Supplementary Material

Transcription

Supplementary Material
Supplemental Material
This supplemental material contains (1) a more exhaustive benchmark of sketch-based image retrieval variants, (2) more details about the training parameters for the deep networks, (3) percategory retrieval evaluation for all 125 categories, (4) qualitative examples of image-based sketch
retrieval, sketch-based sketch retrieval, and image-based image retrieval, and (5) numerous additional within-domain and across-domain retrieval results.
Figure 1: Benchmark plot with additional models and baselines
GN - GoogLeNet[3]
AN - AlexNet[2] refers to the caffe variation–CaffeNet
SN GN - GoogLeNet fine tuned for sketch classification
SN AN - AlexNet fine tuned for sketch classification
GN Cat - GoogLeNet cross domain classification network
AN Cat - AlexNet cross domain classification network
GFHOG - Gradient field HOG based on code from the authors [1]
GALIF - Gabor local line based feature, based on OpenSSE implementation by Zhang Dongdong
available on github at https://github.com/zddhub/opensse
1
Model
GN Triplet
GN Siamese
AN Siamese
Model
GN Triplet
GN Siamese
AN Siamese
Table 1: Parameter details for training Triplet/Siamese network
Learning rate Iterations Step size1 Batch size Weight decay
10e-5
160k
100k
16
2e-3
10e-5
120k
100k
32
2e-3
10e-5
80k
50k
128
5e-4
Classification loss weight Embedding weight Margin(s)
2 10
1
15
3 2500, 2.5
1
0.1
3 2500, 2.5
1
10e-3
1 The
learning rate drops by factor of 0.1 for each step size.
final value. The value has been adjusted during the training phase to avoid getting stuck
in local optima.
3 For positive pairs and negative pairs, respectively.
2 Shows
Momentum was set to 0.9 for all experiments.
References
[1] R. Hu and J. Collomosse. A performance evaluation of gradient field hog descriptor for sketch
based image retrieval. Computer Vision and Image Understanding, 117(7):790–806, 2013.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional
neural networks. In 26th Annual Conference on Neural Information Processing Systems (NIPS),
pages 1106–1114, 2012.
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
Table 2: Comparison between different output dimension
Output dimension Recall @ K=1
2048
35.98%
1024
36.09%
512
35.7%
256
35.49%
128
36.33%
64
34.71%
32
31.62%
2
4.07%
Figure 2: Per class Recall at K = 1 for all 125 categories in the Sketchy database. Note that chance
is 1/1250 because there is only one correct photo for each sketch query in the test set.
Figure 3: Comparison between networks trained for one epoch with MS COCO boundaries vs
the Sketchy database. Both networks were fine tuned from an ImageNet trained model with no
additional pre-training. Note, this experiment covers only the 13 MS COCO categories that have
enough training data and overlap with Sketchy database categories, thus the accuracies are not
comparable to other experiments.
Figure 4: Contribution of different pre-training schemes. We train each network with our Sketchy
database for 160k iterations and then measure the recall at K = 1.
Figure 5: In our paper, we evaluate our learned cross-domain embedding with sketch-based image retrieval. But the embedding just as naturally supports (b) image-based sketch retrieval, (c)
sketch-based sketch retrieval, and (d) image-based image retrieval. (a) Shows the average of aligned
sketches retrieved for various query photos. The photo query in (b) top did not work particularly
well because “person” is not a category in the Sketchy database. The remainder of this supplementary material contains additional cross-domain and within-domain retrieval results. All results
use the “GN Triplet” network pair (or half of the network pair in the case of image-to-image and
sketch-to-sketch retrieval). It may be that focusing on within-domain retrieval at learning time
would improve the results. Or the opposite could be true – that image-to-image retrieval is improved by the presence of cross-domain sketches because because it trains the image networks to
encode the salient object details. Our aim with those qualitative results is to show that our learned
embedding space is reasonable, not that it is “state-of-the-art” for these retrieval tasks.
Figure 6: Sketch-based image retrieval results
Figure 7: Sketch-based image retrieval results
Figure 8: Sketch-based image retrieval results
Figure 9: Sketch-based image retrieval results
Figure 10: Sketch-based image retrieval results
Figure 11: Sketch-based image retrieval results
Figure 12: Image-based image retrieval results
Figure 13: Image-based image retrieval results
Figure 14: Image-based image retrieval results
Figure 15: Image-based image retrieval results
Figure 16: Sketch-based sketch retrieval results
Figure 17: Sketch-based sketch retrieval results
Figure 18: Sketch-based sketch retrieval results
Figure 19: Image-based sketch retrieval results
Figure 20: Image-based sketch retrieval results