Fast Lead Star Detection in Entertainment Videos

Transcription

Fast Lead Star Detection in Entertainment Videos
CVPR
CVPR
#170
#170
CVPR 2009 Submission #170. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
000
001
002
003
004
005
006
Fast Lead Star Detection in Entertainment Videos
007
008
009
010
011
Paper ID 170
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
Anonymous CVPR submission
Abstract
Video can be summarized in many forms. One natural possibility that has been well explored is extracting key
frames in shots or scenes, and then creating thumbnails.
Another natural alternative that has been surprisingly illexplored is to locate “lead stars” around whom the action
revolves. Though scarce and far between, available techniques for detecting lead stars is usually video specific.
In this paper, we highlight the importance of lead star
detection, and present a generalized method for detecting
snippets around lead actors in entertainment videos. Applications that naturally make use of this method include
locating action around the ‘player of the match’ in sports
videos, lead actors in movies and TV shows, and guest-host
snippets in TV talk shows. Additionally, our method is fifty
times faster than the state-of-art spectral clustering technique with comparable accuracy.
1. Introduction
Suppose an avid cricket fan or coach wants to learn
exactly how Flintoff repeatedly got Hughes “out.” Or a
movie buff wants to watch an emotional scene involving
his favourite heroine in a Hollywood movie. Clearly, in
such scenarios, you want to skip frames that are “not interesting.” One possibility that has been well explored is
extracting key frames in shots or scenes and then creating
thumbnails. Another natural alternative – the emphasis in
this paper – is to determine frames around, what we call,
lead stars. A lead star in an entertainment video is the actor
who, most likely, appears in many significant frames. We
define lead stars in other videos also. For example, the lead
star in a soccer match is the hero, or the player of the match,
who has scored “important” goals. Intuitively, he is the one
the audience has paid to come and see. Similarly the lead
star in a talk show is the guest who has been invited, or, for
that matter, the hostess. This work presents how to detect
lead stars in entertainment videos. Moreover like various
video summarization [11, 6, 1], lead stars is a natural way
of summarizing video. (Multiple lead stars are of course
allowed.)
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
Figure 1. Lead star detection. This is exemplified in sports by the
player of the match; in movies, stars are identified; and in TV
shows, the guest and host are located.
1.1. Related Work
Researchers have explored various video specific applications for lead stars detection – anchor detection in news
video [9], lead casts in comedy sitcoms [2], summarizing
meetings [8], guest host detection [6, 5] and locating the
lecturer in smart rooms by tracking the face and head [10].
Fitzgibbon [3] uses affine invariant clustering to detect
cast listing from movie. As the original algorithm had runtime that is quadratic, the authors used a hierarchical strategy to improve the clustering speed that is central to their
method. Foucher, S. and Gagnon, L. [4] used spatial clustering techniques for clustering actor faces. Their methods
detect the actor’s cluster in unsupervised way with computation time of about 23 hours for a motion picture.
1.2. Our Strategy
Although the lead actor has been defined using a pictorial or semantic concept, an important observation is that
the significant frames in an entertainment video is often accompanied by a change in the audio intensity level. It is
true no doubt that not all frames containing the lead actors
involve significant audio differences. Our interest is not at
the frame level, however. Note that certainly the advent of
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
CVPR
CVPR
#170
#170
CVPR 2009 Submission #170. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
108
109
110
111
112
113
114
115
116
117
118
119
162
163
164
165
166
important scenes and important people bear a strong correlation to the audio level. We surveyed around one hundred
movies, and found that it is rarely, if at all, the case that the
lead star does not appear in audio highlighted sections, although the nature of the audio may change from scene to
scene. And as alluded above, once the lead star has entered
the shot, the frames may well contain normal audio levels.
Figure 4. Illustration of highlight detection from audio signal. We
detect highlights by considering segments that involve significant
low RMS ratio.
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
The rms ratio is marked as low when the value is below a
user defined threshold. In our implementation, we use 5 as
the threshold, and the video frames corresponding to such
windows are considered as ‘important.’
2.2. Finding & Tracking Potential People
Figure 2. Our strategy for lead star detection. We detect lead stars
by considering segments that involve significant change in audio
level. However, this by itself is not enough!
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
Our method is built upon this concept. We detect lead
stars considering such important scenes of the video. To reduce false positives and negatives, our method clusters the
faces for each important scenes separately and then combines the results. Unlike the method in [3], our method provides a natural segmentation for clustering. Our method is
shown to considerably reduce the computation time of the
previously mentioned state-of-the-art for computing lead
star in motion picture (a factor of 50). We apply this method
to sports video to identify the player of the match, motion
pictures to find heroes and heroines and TV show to detect
guest and host.
190
191
192
193
194
195
Figure 5. Illustration of data matrix formation. The tracked faces
are stacked together to form the data matrix.
2. Methodology
As mentioned, the first step in the problem is to find important scenes which have audio highlights. Once such important scenes are identified, they are further examined for
potential faces. Once a potential face is found in a frame,
subsequent frames are further analyzed for false alarms using concepts from tracking. At this point, several areas are
identified as faces. Such confirmed faces are grouped into
clusters to identify the lead stars.
Once important scenes are marked, we seek to identify
people in the corresponding video segment. Fortunately
there are well understood algorithms that detect faces in an
image frame. We select a random frame within the window
and detect faces using the Viola & Jones face detector [7].
Every face detected in the current frame is then voted for
a confirmation by attempting to track them in subsequent
frames in the window. Confirmed faces are stored for the
next step in the processing in a data matrix. Confirmed faces
from each highlight i, is stored in the corresponding data
matrix Di as illustrated in Figure 5.
2.1. Audio Highlight Detection
The intensity of a segment of an audio signal is summarized by the root-mean-square value. The audio track of
a video is divided into windows of equal size and the rms
value is computed for each audio window. From the resulting rms sequence, the rms ratio is computed for successive
items in the sequence.
2.3. Face Dictionary Formation
In this step, the confirmed faces are grouped based on
their features. There are a variety of algorithms for dimen2
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
CVPR
CVPR
#170
#170
CVPR 2009 Submission #170. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
216
270
271
272
273
274
217
218
219
220
221
222
275
276
277
278
279
280
223
224
225
226
227
281
282
283
284
285
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
286
287
288
289
290
291
Figure 3. Strategy further unfolded. We first detect scenes accompanied by change in audio level. Next we look for faces in these important
scenes, and to further confirm the suitability track faces in subsequent frames. Finally, a face dictionary representing the lead stars is
formed by clustering the confirmed faces.
based on minimal mean square error. Representative faces
from clusters of all highlights are clustered again to get the
final set of clusters. The representative faces of these clusters forms the face dictionary.
At this point, we have a dictionary of faces, but not all
faces belong to lead actors. We use the following parameters to shortlist the faces to form lead stars.
246
247
248
249
250
1. The number of faces in the cluster. If a cluster (presumably of the same face) has a large cardinality, we
give this cluster a high weightage.
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
2. Position of the face with respect to center of the image.
Lead stars are expected to be in the center of the image.
3. Size of the detected face. Again, lead stars typically
occupy a significant portion of the image.
4. Duration for which the faces in the cluster occur in the
current window as a fraction of the window size.
Figure 6. Illustration of face dictionary formation.
The face dictionary formed for the movie Titanic is
shown in Figure 7. Our method has successfully detected
the lead actors in movie. As can be noticed, along with lead
stars, there are patches that have been misclassified as faces.
sionality reduction, and subsequent grouping. We observe
that the Principal Component Analysis (PCA) method has
been successfully used for face recognition. We use PCA
to extract feature vectors from Di and we use the k-means
algorithm for clustering. The number of clusters is decided
3
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
CVPR
CVPR
#170
#170
CVPR 2009 Submission #170. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
the players when there is score leads to the identification of
star players. This is easily achieved by our technique, as
detecting highlights results in exciting scenes like scores.
Figure 7. Face dictionary formed for the movie Titanic. Lead stars
are highlighted in blue (third image), and red (last image). Note
that this figure does not indicate the frequency of detection.
3. Applications
Figure 9. Lead star detected for the highlights of a soccer match
Liverpool vs Havant & Waterlooville. The first image is erroneously detected as face. The other results represents players and
coach.
In this section, we demonstrate our technique using three
applications — Lead Actor Detection in Motion Picture,
Player of the Match Identification and Host Guest Detection. As applications go, Player of the Match Identification
has not been well explored considering the enormous interest. In the other two applications, our technique detects lead
stars faster than the state-of-art techniques, which makes
our method practical and easy to use.
The result of lead sports stars detected from a soccer
match Liverpool vs Havant & Waterlooville is presented in
the Figure 9. The key players of the match are detected.
3.1. Lead Actor Detection in Motion Picture
3.3. Host Guest Detection
In motion pictures, detecting the hero, heroine and villain has many interesting benefits. A person while reviewing a movie can skip the scenes where lead actors are not
present. A profile of the lead actors can also be generated.
Significant scenes containing many lead actors can be used
for summarizing video.
In our method, the face dictionary formed contains the
lead actors. These face dictionaries are good enough in most
of the cases. However, for more accurate results, the algorithm scans through a few frames of every shot to determine
the faces occurring in the shot. The actors who occur in a
large number of shots is identified as the lead actor. The
result of lead actors for the movie Titanic after scanning
through entire movie is shown in the Figure 8.
In TV interviews and other TV programs, detecting host
and guest of the program is the key information used in
video retrieval. Javed et. al. [5] have proposed a method
for the same which removes the commercials and then exploits the structure of the program to detect guest and host.
The algorithm uses the inherent structure of the interview
that the host occurs for shorter duration than guest. However, it is not always the case, especially when the hosts are
equally popular like in the case of TV shows like Koffee
With Karan. In the case of competition shows, the host is
shown for longer duration than guests or judges.
Our algorithm detects hosts and guests as lead stars. To
distinguish hosts and guests, we detect lead stars on multiple episodes and combine the result. As it is intuitive, the
lead stars over multiple episodes are hosts and the other lead
stars detected for specific episodes are guests.
4. Experiments
We have implemented our system in Matlab. We tested
our method on an Intel Core Duo processor, 1.6 Ghz, 2GB
RAM. We have conducted experiments on 7 popular motion
pictures, 9 soccer match highlights and two episodes of TV
shows summing up to total of 19 hours 23 minutes of video.
Our method detected lead stars in all the videos in an average of 14 minutes for an one hour video. The method [4] in
the literature computes lead star of a motion picture in 23
hours, whereas we compute lead star for motion picture in
an average of 30 minutes. We now provide more details.
Figure 8. Lead actors detected for the movie Titanic
.
3.2. Player of the Match Identification
In the sports, sports highlight and key frames [6] are the
two main methods used for summarizing. We summarize
sports using the player of the match capturing the star players.
Detecting and tracking players in the complete sports
video does not yield player of the match. The star players can play for shorter time and score more as opposed to
players who attempt many times and don’t. So analyzing
4.1. Lead Actor Detection in Motion Picture
We ran our experiments on 7 box-office hit movies listed
in the table 1. This totally sums up to 16 hours of video.
4
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR
CVPR
#170
#170
CVPR 2009 Submission #170. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
Table 1. Time taken for computing lead actors for popular movies.
No.
1
2
3
4
5
6
7
Movie Name
The Matrix
The Matrix Reloaded
Matrix Revolutions
Eyes Wide Shut
Austin Powers
in Goldmember
The Sisterhood
of the Traveling Pants
Titanic
Total
Duration
(hh:mm)
02:16
02:13
02:04
02:39
01:34
Computation Time for
Detecting Lead Stars
(hh:mm)
00:22
00:38
00:45
00:12
00:04
Computation Time for
Refinement
(hh:mm)
00:49
01:25
01:07
00:29
01:01
01:59
00:16
00:47
03:18
16:03
01:01
03:18
01:42
07:20
514
515
516
517
518
519
462
463
464
465
466
479
480
481
482
483
484
485
497
498
499
500
501
508
509
510
511
512
513
456
457
458
459
460
461
473
474
475
476
477
478
491
492
493
494
495
496
502
503
504
505
506
507
450
451
452
453
454
455
467
468
469
470
471
472
486
487
488
489
490
Figure 10. Lead actors identified for popular movies appear on the right.
The lead stars in all these movies are computed in 3 hour
18 minutes. So the average computation time for a movie is
around 30 minutes. From Table 1, we see that the best computation time is 4 minutes for the movie Austin Powers in
Goldmember which is 1 hour 42 minutes in duration. The
worst computation time is 45 minutes for the movie Matrix Revolutions of duration 2 hour 4 minutes. For movies
like Eyes Wide Shut and Austin Powers in Goldmember, the
computation is faster as there are fewer audio highlights.
Whereas action movies like Titanic sequels take more time
as there are many audio highlights. This causes the variation in computation time among movies.
in most of the movies topmost stars are detected. Since the
definition of “top” is subjective, it could be said that in some
cases, top stars are not detected in some cases. Further, in
some cases the program identifies the same actor multiple
times. This could be due to disguise, or due to pose variation. The result is further refined for better accuracy as
mentioned in Section 3.1.
4.2. Player of the Match Identification
We have conducted experiments on 11 soccer match
highlights taken from BBC and listed in Table 2. Our
method on an average takes half the time of the duration
of the video. Note however, that these timings are for only
sections that have already been manually edited by the BBC
staff. If the video were run on a routine full soccer match,
The lead actors detected are shown in the Figure 10. The
topmost star is highlighted in red color and the next top star
is highlighted in blue color. As you can notice in the figure,
5
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
CVPR
CVPR
#170
#170
CVPR 2009 Submission #170. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
we expect our running time to be a lower percentage of the
entire video.
Figure 12. Host detection of TV show “Koffee with Karan”. Two
episodes of the show are combined and given as input. The first
person in the detected list (sorted by the weight) gives the host.
Table 2. Time taken for computing key players from BBC MOTD
highlights for premier league 2007-08.
No.
1
2
3
4
5
6
7
8
9
Soccer Match
Barnsley vs Chelsea
BirminghamCity vs
Arsena
Tottenham vs Arsenal
Chelsea vs Arsenal
Chelsea vs
Middlesborough
Liverpool vs Arsena
Liverpool vs
Havant & Waterlooville
Liverpool vs
Middlesbrough
Liverpool vs
NewcaslteUnited
Total
Duration
(hh:mm)
00:02
00:12
Computation
(hh:mm)
00:01
00:04
00:21
00:14
00:09
00:07
00:05
00:05
00:12
00:15
00:05
00:06
00:09
00:05
00:18
00:04
01:52
00:43
5. Conclusion
Detecting lead stars has numerous applications like identifying player of the match, detecting lead actor and actress
in motion pictures, guest host identification. Computational
time has always been a bottleneck for using this technique.
In our work, we have presented a faster method to solve this
problem with comparable accuracy. This makes our algorithm usable in practice.
References
[1] A. Doulamis and N. Doulamis. Optimal contentbased video decomposition for interactive video navigation. Circuits and Systems for Video Technology,
IEEE Transactions on, 14(6):757–775, June 2004.
[2] M. Everingham and A. Zisserman. Automated visual
identification of characters in situation comedies. In
ICPR ’04: Proceedings of the Pattern Recognition,
17th International Conference on (ICPR’04) Volume
4, pages 983–986, Washington, DC, USA, 2004. IEEE
Computer Society.
The results of key player detection is presented in the
Figure 11. The key players of the match are identified for
all the matches.
[3] A. W. Fitzgibbon and A. Zisserman. On affine invariant clustering and automatic cast listing in movies.
In ECCV ’02: Proceedings of the 7th European Conference on Computer Vision-Part III, pages 304–320,
London, UK, 2002. Springer-Verlag.
4.3. Host Guest Detection
We conducted our experiment on the TV show Koffee
with Karan. Two different episodes of the show were combined and fed as input. Our method identified the host in
4 minutes for a video of duration 1 hour 29 minutes. Our
method is faster than the method proposed by Javed et. al.
[5].
[4] S. Foucher and L. Gagnon. Automatic detection and
clustering of actor faces based on spectral clustering
techniques. In Proceedings of the Fourth Canadian
Conference on Computer and Robot Vision, pages
113–122, 2007.
Table 3. Time taken for identifying host in a TV show Koffee With
Karan. Two episodes are combined into a single video and given
as input.
TV show
Koffee With Karan
Duration
(hh:mm)
01:29
[5] O. Javed, Z. Rasheed, and M. Shah. A framework
for segmentation of talk and game shows. Computer
Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE
International Conference on, 2:532–537 vol.2, 2001.
Computation
(hh:mm)
00:04
[6] Y. Takahashi, N. Nitta, and N. Babaguchi. Video summarization for large sports video archives. Multimedia and Expo, 2005. ICME 2005. IEEE International
Conference on, pages 1170–1173, July 2005.
The result of our method for the TV show Koffee with
Karan is presented in Figure 12. Our method has successfully identified the host.
[7] P. Viola and M. Jones. Robust real-time face detection.
International Journal of Computer Vision, 57(2):137–
154, May 2004.
593
6
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
CVPR
#170
CVPR
#170
CVPR 2009 Submission #170. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
648
649
650
702
703
704
705
706
707
708
651
652
653
654
655
656
709
710
711
712
713
714
657
658
659
660
661
715
716
717
718
719
662
663
664
665
666
667
720
721
722
723
724
725
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
Figure 11. Key players detected from BBC Match of the Day match highlights for premier league 2007-08.
[8] A. Waibel, M. Bett, and M. Finke. Meeting browser:
Tracking and summarizing meetings. In Proceedings
DARPA Broadcast News Transcription and Understanding Workshop, pages 281–286, February 1998.
[9] D. Xu, X. Li, Z. Liu, and Y. Yuan. Anchorperson
extraction for picture in picture news video. Pattern
Recogn. Lett., 25(14):1587–1594, 2004.
[10] Z. Zhang, G. Potamianos, A. W. Senior, and T. S.
Huang. Joint face and head tracking inside multicamera smart rooms. Signal, Image and Video Processing, 1:163–178, 2007.
[11] Y. Zhuang, Y. Rui, T. Huang, and S. Mehrotra. Adaptive key frame extraction using unsupervised clustering. In Proceedings of the International Conference on
Image Processing, volume 1, pages 866–870, 1998.
var
Highlights: Int[][];
Faces, FacesInInterval: List<Image>;
Interval, Shots, RandomFrames: int[];
LeadActorsClusterCenters: List<Image>;
LeadActors: List<Image>;
begin
Highlights := GetAudioHighLights(Video);
Faces := [];
i := 0
repeat
Interval := Highlights(i);
Shots := ComputeShot(Video, Interval);
RandomFrame := GetRandomNumber(Interval);
FacesInInterval := ExtractFace(Video,
RandomFrame, Shots);
Faces.add(FacesInInterval);
i := i + 1;
until i <= length(Highlights)
LeadActorsClsCenters := Cluster(Faces);
LeadActors := SelectTopClusters(
LeadActorsClsCenters, Faces);
end.
Appendix: Algorithm
Algorithm for detecting lead stars is presented below. In
motion pictures, this indicates the lead actors. In sports,
this fetches the key players. In TV show, among episodes,
fetches host as the top most star.
Detection of lead actors
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
program LeadActorDetection (Video)
7