Supplementary Material
Transcription
Supplementary Material
Supplemental Material This supplemental material contains (1) a more exhaustive benchmark of sketch-based image retrieval variants, (2) more details about the training parameters for the deep networks, (3) percategory retrieval evaluation for all 125 categories, (4) qualitative examples of image-based sketch retrieval, sketch-based sketch retrieval, and image-based image retrieval, and (5) numerous additional within-domain and across-domain retrieval results. Figure 1: Benchmark plot with additional models and baselines GN - GoogLeNet[3] AN - AlexNet[2] refers to the caffe variation–CaffeNet SN GN - GoogLeNet fine tuned for sketch classification SN AN - AlexNet fine tuned for sketch classification GN Cat - GoogLeNet cross domain classification network AN Cat - AlexNet cross domain classification network GFHOG - Gradient field HOG based on code from the authors [1] GALIF - Gabor local line based feature, based on OpenSSE implementation by Zhang Dongdong available on github at https://github.com/zddhub/opensse 1 Model GN Triplet GN Siamese AN Siamese Model GN Triplet GN Siamese AN Siamese Table 1: Parameter details for training Triplet/Siamese network Learning rate Iterations Step size1 Batch size Weight decay 10e-5 160k 100k 16 2e-3 10e-5 120k 100k 32 2e-3 10e-5 80k 50k 128 5e-4 Classification loss weight Embedding weight Margin(s) 2 10 1 15 3 2500, 2.5 1 0.1 3 2500, 2.5 1 10e-3 1 The learning rate drops by factor of 0.1 for each step size. final value. The value has been adjusted during the training phase to avoid getting stuck in local optima. 3 For positive pairs and negative pairs, respectively. 2 Shows Momentum was set to 0.9 for all experiments. References [1] R. Hu and J. Collomosse. A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Computer Vision and Image Understanding, 117(7):790–806, 2013. [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In 26th Annual Conference on Neural Information Processing Systems (NIPS), pages 1106–1114, 2012. [3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014. Table 2: Comparison between different output dimension Output dimension Recall @ K=1 2048 35.98% 1024 36.09% 512 35.7% 256 35.49% 128 36.33% 64 34.71% 32 31.62% 2 4.07% Figure 2: Per class Recall at K = 1 for all 125 categories in the Sketchy database. Note that chance is 1/1250 because there is only one correct photo for each sketch query in the test set. Figure 3: Comparison between networks trained for one epoch with MS COCO boundaries vs the Sketchy database. Both networks were fine tuned from an ImageNet trained model with no additional pre-training. Note, this experiment covers only the 13 MS COCO categories that have enough training data and overlap with Sketchy database categories, thus the accuracies are not comparable to other experiments. Figure 4: Contribution of different pre-training schemes. We train each network with our Sketchy database for 160k iterations and then measure the recall at K = 1. Figure 5: In our paper, we evaluate our learned cross-domain embedding with sketch-based image retrieval. But the embedding just as naturally supports (b) image-based sketch retrieval, (c) sketch-based sketch retrieval, and (d) image-based image retrieval. (a) Shows the average of aligned sketches retrieved for various query photos. The photo query in (b) top did not work particularly well because “person” is not a category in the Sketchy database. The remainder of this supplementary material contains additional cross-domain and within-domain retrieval results. All results use the “GN Triplet” network pair (or half of the network pair in the case of image-to-image and sketch-to-sketch retrieval). It may be that focusing on within-domain retrieval at learning time would improve the results. Or the opposite could be true – that image-to-image retrieval is improved by the presence of cross-domain sketches because because it trains the image networks to encode the salient object details. Our aim with those qualitative results is to show that our learned embedding space is reasonable, not that it is “state-of-the-art” for these retrieval tasks. Figure 6: Sketch-based image retrieval results Figure 7: Sketch-based image retrieval results Figure 8: Sketch-based image retrieval results Figure 9: Sketch-based image retrieval results Figure 10: Sketch-based image retrieval results Figure 11: Sketch-based image retrieval results Figure 12: Image-based image retrieval results Figure 13: Image-based image retrieval results Figure 14: Image-based image retrieval results Figure 15: Image-based image retrieval results Figure 16: Sketch-based sketch retrieval results Figure 17: Sketch-based sketch retrieval results Figure 18: Sketch-based sketch retrieval results Figure 19: Image-based sketch retrieval results Figure 20: Image-based sketch retrieval results