Discovering Objects in Videos
Transcription
Discovering Objects in Videos
Discovering Objects in Videos Karim Sayed Ahmed [email protected] Machine learning project - Final Report 1 Problem The goal of this project is to identify and localize objects in videos. Given a video V , consisting of a set of N frames f1 , .., fN where each frame contains one or more unknown object, the objective is to identify and localize primary objects in each i-th frame fi in V that exhibit coherence in both appearance and motion. Figure 1 shows a brief illustration of discovering objects in videos problem. There are two sub-problems that should be solved; the first sub-problem is localization of primary objects in each frame (still images). The second sub-problem is finding the correlation between the detected objects in all video’s frames. F1 FN-1 F2 Primary Object ? FN Same Objects ? Figure 1: The problem of discovering objects in videos is divided into two sub-problems. The first sub-problem is localization of primary objects in each frame. The second sub-problem is finding the correlation between the detected objects in all video’s frames. 2 Related Work The problem of discovering objects in videos is closely related to video segmentation. The following are some of the state-of-the-art related work in this area. • Representing video by a multi-label Markov Random Field model, and accomplishing segmentation by finding the minimum energy label [7]. • Fragments-based tracking of non-rigid objects using level sets [2]. • Using layered Directed Acyclic Graph (DAG) based framework for detection and segmentation of the primary object in video [10]. 3 Approach Overview In order to identify and localize coherent objects in videos, our proposed approach is divided into two main phases, where each phase solves one of the sub-problems illustrated in Figure 1. The key contribution of this work is utilizing deep learning and transductive learning to discover video objectness. The following subsections discusses an overview of the proposed two-phases approach. 1 3.1 Phase I: Object localization per frame The objective of this phase is to find the primary object in each frame in the video; this primary object is represented in a bounding box. To accomplish this task, I used an approach called STL which is proposed in [1] and it relies on deep learning. This approach is a new self-taught object localization algorithm for still images that leverages on deep convolutional neural networks trained for whole image recognition to localize objects in images without additional human supervision, i.e., without using any ground-truth bounding boxes for training [1]. This approach can generate many bounding boxes containing object proposals. After extracting the object proposals in each frame, the next step is to re-rank the detected top four object proposals using a linear SVM. According to our experiments, we found that this step is essential and it enhances the quality of the detected video’s objects as it will be shown later. Figure 2 gives an overview of the steps included in phase I. Top four object proposals (using deep learning STL [1]) The best object Proposal (using SVM) Figure 2: Overview of phase I. 3.2 Phase II: Classification of the primary objects in video After identifying and localizing the primary object in each frame (output of phase I ), the next step is classification of the primary objects in all video’s frames using Transductive Support Vector Machine (TSVM). This learning problem could be formulated as detection of similarity between objects; discarding noise objects. The are two main issues with learning the similarity between frames of the same video which are: 1) Few labeled samples in training data especially for positive samples. 2) The testing data are part of the training data. By using Transductive Support Vector Machine (TSVM) [4], we can avoid these problems. The advantage of using Transductive Support Vector Machines is that it trains labeled and unlabeled data, and the unlabeled data is part of the test data. In other words, TSVM extends SVM by treating partially labeled data in semi-supervised learning by following the principles of transduction. In addition to the training set D, the learner is also given a set D? = {x?i |x?i ∈ Rp }ki=1 of test examples to be classified. Transductive support vector machine is defined by the following primal optimization problem [4] as follows: Minimize (over w, b, y? ) : 21 kwk2 , subject to (for any i = 1, . . . , n and any j = 1, . . . , k): yi (w · xi − b) ≥ 1, yj? (w · x?j − b) ≥ 1, and yj? ∈ {−1, 1}. 2 4 Dataset and extraction of image features For testing and evaluation, we used the SegTrack dataset [7, 8]. It consists six videos with pixel-level segmentation ground-truth for each video. The standard method used for evaluation in this dataset is the average per-frame pixel error. This dataset is widely used by other approaches including some of the recent proposed methods [5, 10, 7, 8], so it is feasible to compare our results with other approaches. For extracting image features, we used the Histogram of Gradients (HOGs [3]). Figure 3: Videos used by Segtrack [7] dataset. 5 Learning and Inference In this section, we will discuss the various methods and techniques used in the learning and inference process in the two-phases proposed approach. 5.1 Learning and Inference for Phase I First, and before going into the details of the learning and inference process. We have to show why we need extra learning step after executing STL. As mentioned earlier, in this part I applied STL approach [1] on Segtrack dataset. The testing code for STL has been made available by the authors of paper [1]. However, selecting the top-1 bounding box by ranked STL, is not a good option, as it will be shown later in the experimental results section. But for now, we illustrate Figure 4 which shows a visualization of some video frames applied on Segtrack dataset, which will bounding box (object proposal) with the highest confidence score as it is most likely will contain the primary object. Figure 4: Results of applying STL on Segtrack video frames. Red bounding box in each frame is the selected Top-1 score object proposal. Average error rate = 64% It is clear from the results in Figure 4 that the top-1 bounding box miss-classifies the primary object in the scene many times. My experiments that are conducted on Segtrack dataset show that the average error rate for selecting STL top-1 bounding box = 64%. In order to get good results, it is crucial for this phase to select the best possible bounding 3 boxes as the overall performance will depend heavily on these results. To tackle this problem, we investigated different object proposals generated by STL. Figure 5 shows the Top-4 bounding boxes generated by STL. (a) Although red bounding box (Top-1) has the (b) Although red bounding box (Top-1) has the highest STL confidence score, Top-4 and Top-2 are highest STL confidence score, Top-4 and Top-3 are much better. much better. Figure 5: Top four bounding boxes from STL. Red box is the Top-1, other blue boxes are Top-2, Top-3, and Top-4. In Figure 5, STL showed overall good semantic results; however, it is obvious that STL tends to give higher score for the bigger bounding boxes. This raised a serious problem that will definitely affect the final project results. To overcome this problem, we have to execute an extra learning process on the top ranked bounding boxes generated by STL as it will be shown in the following subsections. In this phase, we used a linear SVM classifier to re-rank the STL object proposals for each frame. We propose two different methods for applying SVM. The first method is learning SVM classifier on Segtrack dataset; and the second method is learning SVM classifier on top-1 STL bounding boxes. In the following parts, we discuss these two methods in details. 5.1.1 Method(1): Learning with linear SVM on Segtrack ground-truth To refine the ranking score of STL, we applied linear SVM to re-rank the top-four object proposals. There are two issues to apply this solution. First, SVM is a two-classes discriminative learning algorithm. Second, the availability of ground truth dataset. The following is my proposed and implemented solution to these problems. First, applying SVM to re-rank object proposals for every frame can be seen as detecting the object proposal which has the highest abjectness value. In other words, it aims at discriminating objects from non-objects in every frame. So, class (+1) is assigned to ground truth object bounding box; that is all objects are alike and there is no difference between different objects. On the other hand, class (-1) is assigned to every thing else (any non object bounding box). The negative examples are extracted from the background of each frame. Second, as mentioned in the previous paragraph, we have to find a ground-truth dataset to get the perfect tight bounding box that contains only the primary object in the frame. Fortunately, the Segtrack dataset includes a segmented ground truth images for each frame in videos. Figure 6 shows the process of extracting perfect objects (ground truth); and applying SVM on one frame. At the training time, the perfect bounding box is considered class (+1) and parts from frame background are considered negative examples class (-1). 5.1.2 Method(2): Learning with linear SVM on top-1 STL This method is an alternative for Method (1). In this method, we also used a linear SVM to re-rank the top four STL objects proposals; however in we trained SVM on the top-1 STL bounding box, extracted on all frames for all videos. Figure 7 shows the process of selecting the top-a STL bounding box as ground truth; and then applying a linear SVM on 4 -1 +1 Segtrack ground-truth Figure 6: Method(1) Training. Perfect objects are extracted from Segtrack dataset (ground truth), then we train a linear SVM classifier on these objects in all frames through all the videos. At the training time, the perfect bounding box is a positive example (class (+1)), and background parts are a negative (class (-1)). all boxes. At the training time, the top-1 STL bounding box is considered class (+1) and parts from frame’s background are considered negative examples class (-1). Top-1 STL -1 +1 Figure 7: Method(2) Training. Ground-truth objects considered the top-1 STL bounding boxes, extracted from Segtrack dataset. We train a linear SVM classifier on these top-1 objects in all frames through all the videos. At the training time, the top-1 STL bounding box is a positive example (class (+1)), and background parts are a negative (class (-1)). 5.1.3 Testing for Phase I After learning a linear SVM classifier as illustrated whether in Method(1) or Method(2), the next step is to test the trained model on the different STL bounding boxes in each frame, and for each video. Figure 9 shows testing the SVM model (Method(1) or Method(2)) on the top four STL bounding boxes. The highest ranked bounding box is selected to be the primary object in the frame. Test SVM on Top four STL Select highest score Figure 8: Testing for Phase I 5 5.2 Learning and Inference for Phase II After identifying and localizing the primary object in each frame (output of phase I ), the next step is classification of the primary objects in all video’s frames using Transductive Support Vector Machine (TSVM). This learning problem is considered as detection of similarity between objects. The are two main issues with learning the similarity between frames of the same video which are: 1) Few labeled samples in training data especially for positive samples. 2) The testing data are part of the training data. The advantage of using TSVM is that it trains labeled and unlabeled data, and the unlabeled data is part of the test data. In other words, TSVM extends SVM by treating partially labeled data in semi-supervised learning. Figure 9 shows the learning and inference process using TSVM. The first primary object in the first frame in the video is considered the only positive example (shown in red rectangle, class(+1)); the negative examples are extracted from the background parts of the video’s frames; and the other primary objects (shown in yellow rectangles) in the other frames are considered the unlabeled data ans assigned value of zeros (unlabeled). -1 +1 0: (unlabeled) Figure 9: Learning and inference using Transductive Support Vector Machine (TSVM) for phase II. The first primary object in the first frame in the video is the only positive example (shown in red rectangle, class(+1)); the negative examples are extracted from the background parts; and the other primary objects (shown in yellow rectangles) in the other frames are the unlabeled data. 6 Experimental Results In this section, we discuss the overall performance of our proposed two-phases approach, including the two proposed methods “Method (1)” and “Method (2)” (with and without TSVM). In addition to illustrating experimental results for the other possible methods. Before discussing the details in the next subsections, we first provide a legend for the names of the different proposed and implemented methods as follows: • “Method(1) with TSVM“: This method uses a linear SVM classifier to re-rank the top four STL bounding boxes in phase I. The ground-truth training dataset is the segmented object included in Segtrack dataset. Then, in phase II, a TSVM classifier is applied to find similar objects in each video. • “Method(2) with TSVM“: This method uses a linear SVM classifier to re-rank the top four STL bounding boxes in phase I. The ground-truth training dataset is the top-1 STL bounding according to the ranking of STL . Then, in phase II, a TSVM classifier is applied to find similar objects in each video. • “Method(1) without TSVM“: This method uses only a linear SVM classifier to re-rank the top four STL bounding boxes. The ground-truth training dataset is the segmented object included in Segtrack dataset. There is no phase II. • “Method(2) without TSVM“: This method uses only a linear SVM classifier to re-rank the top four STL bounding boxes. The ground-truth training dataset is the 6 top-1 STL bounding according to the ranking of STL . There is no phase II. • “Method(3)“: This is a method in which we used the top-1 ranked bounding box generated by STL directly without any extra learning process. We will use this to compare the effect of using linear SVM classifier if trained on top-1 STL bounding boxes as proposed in ”Method(2)”. 6.1 Performance of SVM classifier In this part, we show some results for evaluating SVM (Method(1) in Phase I) to re-rank the STL bounding boxes. Figure 11 shows applying SVM to re-rank top four STL bounding boxes and then selecting the bounding box with the highest score gives much better results on most of videos than using Top-1 STL bounding box only. The average error rate for TOP-1 STL = 64%; while using TOP-SVM = 42%. It should be noted there are no common ground-truth used by researchers for this case, and the true labels are generated by manually annotating the bounding boxes. Although this may be inaccurate and many researchers may be skeptical to this approach, we just report these results to give an approximate overview of the effect of using SVM to re-rank the STL bounding boxes. For accurate results, please see subsection (6.3 ’Overall performance and quantitative comparison’). Figure 10: Testing error rates using STL Top-1 bounding box vs using proposed SVM re-ranking method applied on top four STL bounding box (Method(1)). Error rates are reported for each video. 6.2 Performance of TSVM classifier In this part, we show evaluation results for TSVM (used in Phase II). Here we show the Precision-Recall (PR) curves for each video, for the results of both TSVM applied on Method(1) and TSVM applied on Method(2) in Figures [11-16]. Precision-Recall (PR) curve reflects the relative proportions of positive and negative samples directly. The PR curves are shown side-by-side for both methods in order to easily compare results of both methods. Also, on the top of each figure, we show the Area Under the Curve (AUC) value, which can be used to summarize the overall quality of ranking in terms of precision and recall. The AUC is obtained by trapezoidal interpolation of the precision. An alternative and usually almost equivalent metric is the Average Precision (AP), also shown on the top of the figures. This is the average of the precision obtained every time a new positive sample is recalled. It is the same as the AUC if the precision is interpolated by constant 7 segments. Additionally, we show the 11 points interpolated average precision, on the top of each figure. This is obtained by taking the average of eleven precision values [9]. It should be noted comparing both Method(1) and Method(2) using the evaluation of TSVM provided useless. For example, if we got low quality bounding boxes generated in phase I, and then applied TSVM on them; regardless of how good TSVM classifier is, it does not matter and will not improve the final results. However, if phase I generated good bounding boxes (high quality), we then should have a good TSVM classifier to generate final good results. In summary, having good TSVM classifier is essential but that doesn’t guarantee good final results; as it all depends on the overall results of phase I and phase II. In this part, we only report these TSVM performance results for showing the quality of the classifier. Instead the overall results shown later in the ’overall results’ section, should be used for building a reasonable comparison between both methods. Additionally, it should be noted that the testing labels used in this evaluation are automatically generated. In this case, the label is considered true if there is at least 40% of the ground-truth object overlaps with the detected (test) object. Although this may not be accurate, but it can provide a approximate measure for the performance of TSVM classifier. In case of video ’Girl’: Figure 11 shows that TSVM applied on Method(2) performs better than TSVM applied on Method(1) in general. In this case, Method(2) shows slightly more successful results than Method(1). The reason behind this, is that the selected bounding boxes (primary objects) for Method(2) are very like to each other, in both appearance and size. Although this is good point for Method(2), but it doesn’t guarantee a better overall performance in terms of average error pixels metric used in Segtrack dataset; that is because this is just evaluation of performance TSVM, regardless of the quality of the primary objects selected in Phase I. PR (AUC: 53.95%, AP: 54.11%, AP11: 58.07%) 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 precision precision PR (AUC: 40.48%, AP: 41.47%, AP11: 43.75%) 1 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 PR PR rand. 0 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 PR PR rand. 0 1 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 Figure 11: Precision-Recall (PR) curves of video ’Girl’. On the left, PR curve of TSVM testing results applied on Method(1). On the right, PR curve of TSVM testing results applied on Method(2). In case of video ’Birdfall2’: Figure 12 shows that TSVM applied on Method(2) also performs better than TSVM applied on Method(1) in general. In this case, Method(2) shows significant successful results than Method(1). If we take a deep look at the final output, we will find that most of the detected bound boxes generated by Method(2) are very large boxes, and almost fills the whole image. That makes typically the classification task of TSVM to be classification between whole images, which are all similar and from the same video input; resulting in good performance as shown in the PR curve. 8 1 PR (AUC: 86.93%, AP: 87.21%, AP11: 86.78%) 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 precision precision PR (AUC: 52.07%, AP: 53.80%, AP11: 54.54%) 1 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 PR PR rand. 0 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 PR PR rand. 0 1 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 1 Figure 12: Precision-Recall (PR) curves of video ’Birdfall2’. On the left, PR curve of TSVM testing results applied on Method(1). On the right, PR curve of TSVM testing results applied on Method(2). In case of video ’cheetah’: Figure 13 shows that TSVM applied on Method(1) performs better than TSVM applied on Method(2) in general. Unlike other videos, the primary objects in this video discovered in Phase I, are small, contained in a tight bounding boxes and moving through the scene. PR (AUC: 53.89%, AP: 54.87%, AP11: 62.81%) 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 precision precision PR (AUC: 60.70%, AP: 61.45%, AP11: 64.55%) 1 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 PR PR rand. PR PR rand. 0 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 0 1 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 Figure 13: Precision-Recall (PR) curves of video ’Cheetah’. On the left, PR curve of TSVM testing results applied on Method(1). On the right, PR curve of TSVM testing results applied on Method(2). In case of video ’monkeydog’: Figure 14 shows that TSVM applied on Method(2) performs better than TSVM applied on Method(1) in general. 9 1 PR (AUC: 59.28%, AP: 59.76%, AP11: 62.14%) PR (AUC: 50.13%, AP: 50.87%, AP11: 53.32%) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 precision precision 0.6 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 PR PR rand. PR PR rand. 0 0.5 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 0 1 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 1 Figure 14: Precision-Recall (PR) curves of video ’Monkeydog’. On the left, PR curve of TSVM testing results applied on Method(1). On the right, PR curve of TSVM testing results applied on Method(2). In case of video ’parachute’: Figure 15 shows that TSVM applied on Method(1) performs slightly better than TSVM applied on Method(2) in general. PR (AUC: 59.36%, AP: 59.57%, AP11: 61.80%) 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 precision precision PR (AUC: 59.99%, AP: 60.45%, AP11: 60.34%) 1 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 PR PR rand. 0 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 PR PR rand. 0 1 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 Figure 15: Precision-Recall (PR) curves of video ’Parachute’. On the left, PR curve of TSVM testing results applied on Method(1). On the right, PR curve of TSVM testing results applied on Method(2). In case of video ’penguin’: Figure 16 shows that TSVM applied on Method(2) performs better than TSVM applied on Method(1) in general. 10 1 PR (AUC: 59.36%, AP: 59.57%, AP11: 61.80%) 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 precision precision PR (AUC: 42.60%, AP: 43.39%, AP11: 48.62%) 1 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 PR PR rand. 0 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 PR PR rand. 0 1 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 Figure 16: Precision-Recall (PR) curves of video ’Penguin’. On the left, PR curve of TSVM testing results applied on Method(1). On the right, PR curve of TSVM testing results applied on Method(2). 6.3 Overall performance and quantitative comparison In Segtrack dataset [7] used in this work, the overall performance is measured using average per-frame pixel error rate compared to the ground-truth. Apparently, lower values of error are better. To calculate the average per-frame pixel error, we use the following equation: Error = XOR(GT, f ) . F where f is the segmentation labeling results of the method, GT is the ground-truth labeling of the video, and F is the number of frames in the video. Table 1 shows the average number of error pixels per frame compared between (Method(1) with TSVM), (Method(2) with TSVM), and other different approaches found in the literature. In this comparison, we used the full implementation of our proposed methods thats including details illustrated earlier in both phase I and phase II. Best values are shown in green color. Blue values show the results of our approaches that are near from the best values, also some of these values are much better than other approaches (but not the best). Comparison between “Method(1) with TSVM” and “Method(1) without TSVM” In this part, we compare the performance of Method(1) in case of using TSVM (as proposed the earlier sections phase I + phase II), against Method(1) without using TSVM (using SVM only in phase I). In other words, we discover the benefits of using Transductive learning as proposed in phase II. In order to accomplish this, we used the same average number of error pixels per frame measure, but in this case it is calculated only on the results retrieved by SVM classifier applied on STL Top four object proposals as illustrated earlier in a previous section. Table 2 shows the average number of error pixels per frame compared between (Method(1) with TSVM), (Method(2) without TSVM). From the shown results, it 11 1 Video Girl Birdfall2 Cheetah Monkeydog Parachute Penguin Method(1) with TSVM 2113 1259 1013 1094 1457 1890 Method(2) with TSVM 3452 4204 2442 2254 7016 3095 Tsai et al. [7] 1304 252 1142 563 235 1705 Chockal al. [2] 1755 454 1217 683 502 6627 et Zhang et al. [10] 1488 155 633 365 220 1895 Table 1: Average number of error pixels per frame between different approaches. Lower values are better. Best values are shown in green color. Blue values show the results of our approaches that are near from the best values, also some of these values are much better than other approaches (but not the best). is clear that “Method(1) with TSVM” outperforms “Method(1) with TSVM” in five out of six cases (videos: Birdfall2, Cheetah, Cheetah, Monkeydog, Parachute, and Penguin). Video Girl Birdfall2 Cheetah Monkeydog Parachute Penguin Method(1) with TSVM 2113 1259 1013 1094 1457 1890 Method(1) without TSVM 1600 2504 1136 1427 1792 2495 Table 2: Average number of error pixels per frame of “Method(1) with TSVM” against “Method(1) without TSVM” approaches. Lower values shown in blue bold are better Comparison between “Method(2) with TSVM” and “Method(2) without TSVM” In this part, we compare the performance of Method(2) in case of using TSVM (as proposed the earlier sections phase I + phase II), against Method(2) without using TSVM (using SVM only in phase I). Also, in this case we used the average number of error pixels per frame measure, and it is calculated only on the results retrieved by SVM classifier applied on STL Top four object proposals taking the top-1 bounding box as ground-truth as illustrated earlier in a previous section. Table 3 shows the average number of error pixels per frame compared between (Method(2) with TSVM), (Method(2) without TSVM). From the shown results, it is clear that “Method(2) with TSVM” outperforms “Method(2) with TSVM” in all six cases (videos: Girl, Birdfall2, Cheetah, Cheetah, Monkeydog, Parachute, and Penguin). Video Girl Birdfall2 Cheetah Monkeydog Parachute Penguin Method(2) with TSVM 3452 4204 2442 2254 7016 3095 Method(2) without TSVM 4849 7084 5328 2937 8532 4017 Table 3: Average number of error pixels per frame of “Method(2) with TSVM” against “Method(2) without TSVM” approaches. Lower values shown in blue bold are better. 12 Comparison between “Method(2) without TSVM” and “Method(3) without TSVM” In this part, we explore the efficiency of using linear SVM learned on Top-1 STL bounding box versus Method(3) which assumes no learning and it just takes directly the top-1 STL bounding box as the best primary object in the current frame. We compare the performance of Method(2) without TSVM (as proposed the earlier sections phase I), against Method(3). In other words, we discover the benefits of using linear SVM learning on top-1 STL bounding box as proposed in phase I . We also used the same average number of error pixels per frame measure, and it is calculated only on the results retrieved by SVM classifier applied on STL Top four object proposals as illustrated earlier in a previous section. Table 4 shows the average number of error pixels per frame compared between (Method(2) without TSVM), (Method(3) without TSVM). From the shown results, it is clear that “Method(2) without TSVM” outperforms “Method(3) without TSVM” in four out of six cases (videos: Birdfall2, Cheetah, Cheetah, Monkeydog, and Penguin). Video Girl Birdfall2 Cheetah Monkeydog Parachute Penguin Method(2) without TSVM 4849 7084 5328 2937 8532 4017 Method(3) without TSVM 4706 7982 6809 4040 6919 6311 Table 4: Average number of error pixels per frame of “Method(2) without TSVM” against “Method(3) without TSVM”. Lower values shown in blue bold are better 6.4 Qualitative Comparison This part provides some visualizations and analysis for the final output of both “Method(1) with TSVM” and “Method(2) with TSVM”. Figure 17 shows the selected frames for each method after applying SVM in phase I and TSVM in phase II. Although the improvement that is achieved by “Method(2)” compared to applying “Method(3)” (using the top-1 STL bounding box without learning SVM), the generated output still shows that ”Method(2)” tends to generate bigger bounding boxes if compared to “Method(1)”. This is reasonable because ”Method(1)” was trained on a much better ground-truth than “Method(1)” which uses the top-a STL bounding box as a reference; which may be inaccurate bounding box in many cases. As a result of selecting bigger bounding boxes in “Method(2)” , the average error pixel per frame for videos dataset are greater than the error of ”Method(1)“, which was shown earlier in Table 1. 13 Method Method(1) (1) Method (2) Method (1) with TSVM frame t Method (2) with TSVM frame t' frame t frame t' Figure 17: Selected output frames with the final discovered objects for “Method(1) with TSVM“ (on the left) and “Method(2) with TSVM“ (on the right). 7 Conclusions From the experimental results shown in the previous section, it is clear that “Method (1) with TSVM“ is the best approach compared to the others proposed in this work. It also gives near performance to some of the state-of-the-art methods. Additionally, we showed that using TSVM as proposed in Phase II, enhances the performance and gives better results than using SVM only as proposed in phase I. We also showed that the learning process conducted using SVM in phase I (whether in case of Method(1) or Method(2)), was essential; and that enhanced the performance greatly than using only the top-1 STL bounding box as the primary object. References [1] Alessandro Bergamo, Loris Bazzani, Dragomir Anguelov, and Lorenzo Torresani. Self-taught object localization with deep networks. arXiv preprint arXiv:1409.3964, 2014. [2] Prakash Chockalingam, Nalin Pradeep, and Stan Birchfield. Adaptive fragments-based tracking of non-rigid objects using level sets. In IEEE International Conference on Computer Vision (ICCV), pages 1530–1537. IEEE, 2009. [3] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 886–893. IEEE, 2005. 14 [4] Thorsten Joachims. Transductive inference for text classification using support vector machines. In ICML, volume 99, pages 200–209, 1999. [5] Yong Jae Lee, Jaechul Kim, and Kristen Grauman. Key-segments for video object segmentation. In IEEE International Conference on Computer Vision (ICCV), pages 1995–2002. IEEE, 2011. [6] David Tsai, Matthew Flagg, and coherent tracking with multi-label mrf http://cpl.cc.gatech.edu/projects/SegTrack/. James M.Rehg. optimization. BMVC, Motion 2010. [7] David Tsai, Matthew Flagg, Atsushi Nakazawa, and James M Rehg. Motion coherent tracking using multi-label mrf optimization. International journal of computer vision, 100(2):190–202, 2012. [8] A. Vedaldi and B. Fulkerson. Vlfeat: An open and portable library of computer vision algorithm. In http://www.vlfeat.org/. [9] Dong Zhang, Omar Javed, and Mubarak Shah. Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 628–635. IEEE, 2013. 15