Structured deep learning : Pose and gestures - LIRIS
Transcription
Structured deep learning : Pose and gestures - LIRIS
Structured deep learning :! Pose and gestures! Christian Wolf! Université de Lyon, INSA-Lyon! LIRIS UMR CNRS 5205! April 30th, 2015! Pose and gestures! ACCV2014 #635 CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. submission ID 635 405 Deep Learning : overview! 406 407 408 409 410 411 412 413 414 415 416 Pose estimation: hand! 417 418 419 420 421 proposed deep convolutional architecture of a single learner. 422 ers F1 , F2 and F3 with rectified linear activation units (ReLU). are followed by 2 ⇥ 2 max pooling and reduction. o most existing methods for scene labeling, instead of randomly or patches), training is performed image-wise, i.e. all pixels from re provided to the classifier at once and each pixel gets assigned ass label based on information extracted from its neighborhood. he convolutional classifier with pooling/reduction layers to an itional way would lead to loss in resolution by a factor of 4 (in ration). On the other hand, simply not reducing the image resot higher layers from learning higher level features, as the size of does not grow with respect to the image content. To avoid this loy specifically designed splitting functions originally proposed g in [27] and further exploited in OverFeat networks [28]. Intueach map at a given resolution is reduced to four di↵erent maps n using max pooling. The amount of elements is preserved, but each map is lower compared to the maps of previous layers. 423 424 425 426 427 Direct deep gesture recognition! (without pose estimation)! 428 429 430 431 432 433 434 435 436 437 438 Articulated pose from Kinect (V1) does not provide hand pose (finger joint positions)! 3 Hand pose estimation! - A complex problem! – Small images (according to distances between hand and sensor)! – Large variation in hand poses! – Real time is a challenge! - Our solution! – Segmenting hands into parts! – Structured deep learning! – Semi supervised setting! PhD of Natalia Neverova! estimation through on in Pose Parts from Single Depth Images segmentation! Mat Cook Toby Sharp Mark Finocchio - Calculate human pose : set of joint positions! x Kipman Andrew Blake - Use an intermediate representation : body / mbridge & Xbox Incubation hand part segmentation! [PRL 2014]! depth image body parts 3D joint proposals Figure : From Shottonan etsingle al., CVPR 2011! Figure 1. Overview. input depth image, a per-pixel body part distribution is inferred. (Colors indicate the most likely [ACCV 2014]! Segmentation and spatial relationships ! Features! F2 i Fi F1 F3 F4 Labels! Fi Fj Fi N l2 l1 li i l3 li lj li l4 Oui Oui Auto-context Pixelwise classification. models! The prior ! improves the ! classifier! Oui MRF/CRF/BN. Inference of a global solution with high computational complexity! Non" Pixelwise classification ! (independant)! ! ! [ICPR 2002a]! [IEEE-Tr-PAMI 2010]! [BMVC-2014]! 5th/43 @ ! [Neurocomputing 2010]! [Work in progress! R. Khan]! DIBCO 2009! [ICPR 2010]! [EG-W-3DOR 2008]! [ICPR 2002b]! [ICPR 2008]! ANR Canada! ANR Madras! Labex IMU-Rivière! ANR Solstice! [Pattern recognition letters 2014]! [ACCV 2014]! [Work in progress ! N. Neverova]! INTERABOT! notation,complexity. the pixels of an image are indexed tational m m Euclidean distances of pairs of f ndex: X = {X }. We seek to learn a segmenta Euclidean distance irepresentation m (Deep) learning! XM We seek to 1 learn aMsegmentation model i }. images {X , . . . , m X } and their associated la m m Section m ||Zi Zj ||2||Z (see 3.1). wo parts: Z || (see 2 i j our the pixels of an image are indexe CV2014 notation, Feature ACCV2014 Classifier! mapping! #635 #635 m m m |✓ )ˆwhich ˆ m mclassifier m • A l = g(Z etric function fsegmentation (X ) gwhich em function Zi We = f (XIDi635 |✓flearn )Zwhich embeds •a iA classifier li = g(Z 10mapping ACCV-14 submission = {X }. seek to mode i = i |✓i f each CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. i m Q endfield to a feature representation Z 2 R parameters ✓parameters an .estimat irepresentation ✓g givin its receptive field to a feature g giving 405 405 406 406 407 407 408 408 409 409 410 410 411 411 412 412 413 413 earned from training data, taking into account meters ✓f are learned from training data, taking in m m in As common in the De As is common the Deep Learning ing function Zi = f (Xi |✓is ) which embeds eac f 8 trained in an unsupervis trained in 8anrepresentation unsupervised (Hinton ptive field to a feature Z m 2 RaQ 414 i 414 415 415 416 416 417 417 418 418 419 419 and Hin (Salakhutdinov andtaking Hinton, 2007) ma are learned from training(Salakhutdinov data, into accoun kind of inductiveT other kind ofother inductive principle. Results! Input: real depth videos from a Kinect sensor. Synthethic training data! 600 000 images rendered with 3D modeler (« Poser »)! Calculations distributed over 4 workstations (several weeks) CVPR #**** Synthetic training data - Testing: #! Real training data 108 109 110 111 112 113 114 – pure classification, no graphical model (speed!)! - Training:#! CVPR 2013 Su – Leverage structural information to get a loss on real data w/o groundtruth! 115 116 117 i Zi li (Parenthesis: semantic full scene labelling)! Goal: label each pixel of an image with a semantic class.! Joint work with ! LHC, St. Etienne Elisa Fromont! Rémi Emonet! Taygun Kekec! Alain Trémeau! Damien Muselet! [BMVC-2014]! (Independant pixelwise classification)! Figure by [Sermanet 2012]! I! P! (a) msConvnet (a) msConvnet O! (b) msAugLearner (b) msAugLearner (c) GroundTruth (c) GroundTruth FigureFigure 5: Raw5:image labeling of the multiscale ConvNet, our multiscale augmented learner learner Raw image labeling of the multiscale ConvNet, our multiscale augmented and ground truth labels. and ground truth labels. (a) We experiment with anwith AugL does usenot anyuse trueany context label injection corresponding We experiment an that AugL thatnot does true context label injection corresponding to t =to 0 and AugL that hasthat an has injection parameter t = 0.05. t = another 0 and another AugL an injection parameter t = 0.05. (Deep auto-context: augmented learner)! P! I! (a) msConvnet (a) msConvnet O! (b) msAugLearner (b) msAugLearner (c) GroundTruth (c) GroundTruth FigureFigure 5: Raw labeling of theof multiscale ConvNet, our multiscale augmented learnerlearner 5: image Raw image labeling the multiscale ConvNet, our multiscale augmented and ground truth labels. and ground truth labels. We experiment with an AugL that does usenot any true label injection corresponding We experiment with an AugL thatnot does use anycontext true context label injection corresponding to t =to0tand AugLAugL that has injection parameter t = 0.05. = 0another and another thatanhas an injection parameter t = 0.05. Intermediate results: context learner – The –classification accuracies obtained from the Intermediate results: context learner The classification accuracies obtained from the context learner (“ContextL” in theintable) are given in Table 1 for both In Fig.In4,Fig. 4, context learner (“ContextL” the table) are given in Table 1 fordatasets. both datasets. we show the responses of ourofcontext learnerlearner maps for some The second row row we show the responses our context maps for input some patches. input patches. The second showsshows strongstrong responses for theforobject, tree and classes. For theFor second and third responses the object, treebuilding and building classes. the second and third rows,rows, although the context learnerlearner outputs a strong response for thefor object class, due itsdue the its the although the context outputs a strong response the object class, relative simplicity, it is not provide an accurate classification (e.g., for thefor building relative simplicity, it isable nottoable to provide an accurate classification (e.g., the building class).class). Nevertheless, this network is useful for thefor augmented learnerlearner and it’sand training time time Nevertheless, this network is useful the augmented it’s training is negligible: the time per sample is lower (see Table 1) and it converges faster than is negligible: the time per sample is lower (see Table 1) and it converges fasterthe than the Convnet. Convnet. Classification accuracy results – Table 1 shows the classification resultsresults obtained with with Classification accuracy results – Table 1 shows the classification obtained the different approaches. Overall, we observe that our provides better better resultsresults for for the different approaches. Overall, we observe thatmethod our method provides both the and the Flow datasets. For Stanford dataset, another state ofstate the of artthe art bothStanford the Stanford andSIFT the SIFT Flow datasets. For Stanford dataset, another technique is reported by Munoz et al. [12]. They reported their pixel accuracy as 76.9 and technique is reported by Munoz et al. [12]. They reported their pixel accuracy as 76.9 and class class accuracy as 66.2 a deepa learning architecture. With our we were accuracy as without 66.2 without deep learning architecture. Withtechnique, our technique, we were able toable obtain much higher class accuracy. to obtain much higher class accuracy. WhileWhile the accuracy gain varies between singlescale and multiscale implementations, we we the accuracy gain varies between singlescale and multiscale implementations, observe that our approach consistently improves both pixel and class accuracies. The gain observe that our approach consistently improves both pixel and class accuracies. The gain on single-scale experiments are higher compared to multiscale implementations. This brings on single-scale experiments are higher compared to multiscale implementations. This brings (b) Context learner fc: ! to predict a whole patch of(a)semantic labels. tion • oflearn our feature learning approaches. The target func- ! • Integrate thisf patch « features » intop. the(b) direct extraction function and a as prediction function Our learner fd! • « Augmented learner » e learning of context features fc and! dependent features fd . • 1-in-K encoding (« hot one » ) of semantic labels when used as input Structured loss (1) : local context! 4 CONFIDENTIAL COPY. DO NOT DISTRIBUTE. Loss generated from aREVIEW context learner is calculated on a Handoutput segmentation with structured convolutional learning 635 segmented patch.! The context learner is trained on synthetic images.! TRAINING PHASE 1 INPUT DEPTH MAP SEGMENTATION MAP 1 n SEGMENTATION MAP 2 CONTEXT LEARNER fd (.; ✓d ) fc (.; ✓c ) Y 5 (i,j) fd DIRECT LEARNER X (i,j) A (i,j) d (i,j) Yc DIRECT LEARNER n FORWARD PASS BACKPROPAGATION (i,j) G CONTEXT LEARNER FORWARD PASS BACKPROPAGATION G(i,j) G(i,j) GROUND TRUTH MAP GROUND TRUTH MAP Fig. 2. The two learning pathways involving a direct learner fd and a context learner fd . The context learner operates on punctured neighborhood maps n(i,j) , where the (to be predicted) middle pixel is missing. ere pixel coordinates, in vector form, are denoted as r . If r(i,j) Rk > ⌧ , i.e. the pixel (i, j) is close enough to its barycenter (2):data), global context! estimatedStructured from the labelledloss synthetic then the pixel is considered rectly classified and used to update parameters of the direct learner ✓d . T s function term for one pixel (i, j) is givenregion as follows: A single is supposed to Hand segmentation with convolutional⌘learning ⇣ structured exist for each label.! (i,j) (i,j) (i) (i,j) Q+ (✓ y ) = F log P Y = y x , ✓d , ✓c are , d d yd Unconnectedd outlier regions glb d (i,j) where pixel coordinates, in vectoridentified form, are to denoted as r . generate loss.! (i,j) (i) If r Rk > related ⌧ , i.e. the pixel (i,of j)class is close enough to its barycent ere Fk is a weight to the size components: s estimated from the labelled synthetic data), then the pixel is consider (i) (i,j) orrectly classified and used of the direct learner ✓d . Fk to= update |{j : Yd parameters = k}| ↵ Predicted class! oss function term for one pixel (i, j) is given as follows: outparameter. of range! In the opposite case, when r(i,j) R d ↵ > 0Siispixel a gain ⇣ ⌘ k (i,j) (i,j) (i) (i,j) e current prediction is penalized and the class corresponding Q+ (✓ y ) = F log P Y = y x , ✓d , ✓cto ,the clos d d yd glb d d ment in the given distance ⌧ is promoted: Class of nearest barycenter! (i) ⇣ ⌘ where FkElse! is a weight(i,j) related to the size of class components: Qglb (✓d yd ) = F (i) log P Yd = x(i,j) , ✓d , ✓c , ( here (i) (i,j) Fk = |{j : Yd = argmin (|r(i,j) = k}| Rk |). ↵ ( High resolution segmentation! (Classical) resolution reductions between layers.! System resolution kept Wolf, highGraham by keeping shifts.! 10 Natalia Neverova, is Christian W. Taylor,different Florian Nebout Results: supervised vs. semi-supervised! On 50 manually annotated frames (real data)! Training method! Training data! Supervised! synth.! Semisupervised! all! Test data! Accuracy (per pixel)! Accuracy ! (per class)! synth.! 85.9%! 78.5%! real! 47.2%! 35.0%! synth.! 75.5%! 78.3%! real! 50.5%! 43.4%! Average gain of a single update (stochastic gradient descent):! Gain in % points! Local! Global! Loc+Glb! Supervised (w. Labels)! +0.60! +0.41! +0.82! +16.05! No labels required! [ACCV 2014]! Results on real images! One step of unsupervised training! Supervised Pre-training Experimental results Conclusions Pose and gestures! ct recognition, pose estimation, scene recognition usually d part segmentation. 1 Pose estimation: full body! 2 3 8 4 7 5 6 8 10 3 5 11 9 7 10 6 1 9 11 ures are typically based on appearance (SIFT, HOG/HOF ...). ures can also be informed by spatial relationships. Pose estimation: hand! spatial learning Direct deep gesture recognition! (without pose estimation)! 5 / 29 Gesture recognition! Communicative gestures! Multiple modalities:! - color and depth video! - Skeleton (articulated pose)! - Audio! Multiple scales:! - full upper-body motion! - fine hand articulation! - short and long-term dependencies.! PhD of Natalia Neverova! A multi-scale architecture! RANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Operates at 3 temporal scales ! temporal scales corresponding to . The deep convolutional multi-modal architecture operating at 3 s of 3 different durations. Althoughtothe audio poses modality not present in the 2014 ChaLearn Lo corresponding dynamic of 3 is different durations! e Challenge dataset, we have conducted additional experiments by augmenting the visual signal wi Single-scale deep architecture! max pooling ConvD2 HLV1 Path V1: HLV2 depth video, right hand Path V1: ConvD1 shared hidden layer HLS intensity video, right hand ConvC1 ConvC2 ConvD2 Path V2: HLV1 depth video, left hand output layer ConvD1 Path V2: intensity video, left hand HLV2 ConvC1 ConvC2 HLM2 HLM3 Path M: mocap stream pose feature extractor HLM1 Path A: audio stream mel frequency spectrograms ConvA1 HLA1 HLA2 Articulated pose descriptor ! IEEE TRANSACTIONS ON PATTERN ANALYS 1. Based on 11 upper body joints! 2. Position normalization: HipCenter is an origin of a new coordinate system.! 3. Size normalization by the mean distance between each pair of joints.1! 4. Calculation of basis vectors for each frame (shown in blue) by applying PCA on 6 torso joints (shown in white).! 1Zanfir Fig. 3. The pose descriptor is cal M., Leordeanu, M., Sminchisescu, C., “The Moving Pose: An Efficient 3D ized coordinates of Kinematics 11 upper bod Descriptor for Low-Latency Action Recognition and Detection”, ICCV 2013! also including their velocities and of angles (triples of joints formin TABLE 1 Hyper-parameters chosen for the deep learning models (a single temporal scale) HoG features f Parameters / weights (single scale)! use zero-mean and Layer Input D1,D2 ConvD1 ConvD2 Input C1,C2 ConvC1 ConvC2 HLV1 HLV2 Input M HLM1 HLM2 HLM3 Input A ConvA1 HLA1 HLA2 HLS1 HLS2 Output layer Filter size / n.o. units N.o. parameters Paths V1, V2 72⇥72⇥5 25⇥5⇥5⇥3 25⇥5⇥5 72⇥72⇥5 25⇥5⇥5⇥3 25⇥5⇥5 900 450 Path M 183 700 700 350 Path A 40⇥9 25⇥5⇥5 700 350 Shared layers 1600 84 21 1900 650 1900 650 3 240 900 405 450 - Pooling 2⇥2⇥1 2⇥2⇥3 1⇥1 2⇥2⇥1 2⇥2⇥3 1⇥1 - 128 800 490 700 245 350 - 650 3 150 000 245 350 1⇥1 1⇥1 - 3 681 600 134 484 1785 - ages to extract HoG a 2-level spatial py and a magnified ve Histograms of d extracted on two s from a whole map version (by a facto Derivatives of and depth histogr 1) h(t 1), wh Combined togethe 270-dimensional de frame and, consequ for the dynamic po Extremely rand data fusion and ge of this sort have ge in conjunction wit we followed the of the neural archi trained independen as described in Sec Training algorithm! Difficulties:! Number of parameters: ! - ~12.4M per scale ! - ~37.2M total! Number of training gestures: ~10 000.! Result: poor convergence.! max pooling Path V1: ConvD2 HLV1 HLV2 depth video, right hand Path V1: ConvD1 shared hidden layer HLS intensity video, right hand ConvC1 ConvC2 ConvD2 depth video, left hand ConvD1 Path V2: HLV2 intensity video, left hand ConvC1 ConvC2 Path M: mocap stream HLM pose feature extractor ! !Proposed solution:! - Structured weight matrices! - Pretraining of individual channels separately;! - “clever” initialization of shared layers;! - An iterative training algorithm gradually increasing the number of parameters to learn.! ! ! output layer HLV1 Path V2: Initialization of the shared layer: ! structured weight matrices! The top hidden layer from each path is initially wired to a subset of neurons in the shared layer.! ! During fusion, additional connections between paths and the shared hidden layer are added.! A slightly different view! weights W1 when training is completed weights W1 path V1: depth, hand 1 W1 path V2: depth, hand 2 W1 path M: mocap data W1 path A: audio signal W1 data flow v1 vv12 vv21 W1 v2 W1 W1 W1 vm1 W1 vm2 va1 W1 va2 mv1 W1 av1 md2 W1 av2 W1 m W1 ma W1 W1 am a hidden layer HLS with units shared across modalities d1 W2 d2 W2 m W2 a W2 output layer weights W2 Blocks of the weight matrices are learned iteratively ! after proper initialization of the diagonal elements.! Gesture localization! An additional binary classifier is employed for filtering and refinement of temporal position of each gesture. ! class number confident prediction target main classifier class class number confident prediction no gesture no gesture main classifier 0 target frame class pre-stroke post-stroke number binary no gesture no gesture 1 motion detector 0 frame 0 pre-stroke post-stroke number binary no motion motion no motion 1 motion detector 0 no motion motion no motion 2014 ChaLearn Looking at People Challenge (ECCV)! Track 3: Gesture recognition! Rank! Team! Score! 1! Ours (LIRIS)! 0.8500! 2! C. Monnier (Charles Rivier Analytics, USA)! 0.8339! 3! Ju Yong Chang (ETRI, KR)! 0.8268! 4! Xiaojiang Peng et al. (Southwest University, CN)! 0.7919! 5! L. Pigou et al. (Gent University, BE)! 0.7888! 6! D. Wu et al. (University of Sheffield, UK)! 0.7873! 7! N.C. Camgoz et al. (Bogazici University, TR)! 0.7466! 8! G. Evangelidis et al. (INRIA Rhône-Alpes, FR)! 0.7454! 9! “Telepoints”! 0.6888! 10! G.Chen et al. (TU München, D)! 0.6490! …! …! 17! “YNL”! 0.2706! Escalera et al, “ChaLearn Looking at People Challenge 2014: Dataset and Results”, ECCV-W, 2014! Results! Error evolution during ! iterative training! Post competition improvements! With motion detector! Without motion detector! Virtual rank! 0.821! 0.865! (1)! Extremley randomized trees ! (baseline)! 0.729! 0.781! (6)! Deep learning + ERT (hybrid)! 0.829! 0.867! (1)! Model (no audio)! max pooling Multi-scale! Deep learning (proposed)! Path V1: ConvD2 HLV1 HLV2 depth video, right hand Path V1: ConvD1 shared hidden layer HLS intensity video, right hand ConvC1 ConvC2 ConvD2 output layer HLV1 Path V2: depth video, left hand ConvD1 Path V2: HLV2 intensity video, left hand ConvC1 ConvC2 Path M: mocap stream HLM pose feature extractor Dropout! - Introduced in 2012 for Imagenet! – A. Krizhevsky, I. Sutskever, G. Hinton, « ImageNet Classification with Deep Convolutional Networks », 2012.! - During training, for each training sample, 50% of the units are disabled.! - Punishes co-adaptation of units! - Large performance gains! shared layer) is therefore related to a specific gesture class. Our : dropout on(and shared layer! Let us network note, that this block structure meaning) is forced on the weight matrix during initialization and in the early phases of training. If only the diagonal blocks are non-zero, which is forced at the beginning of the training procedure, then individual modalities are trained independently, and no cross correlations between modalities are modelled. During the final phases of training, no structure is imposed and the weights can evolve freely. Considering the shared layer, this can be formalized by expressing output of each hidden unit as follows: 2 3 Fk Fn K X X X 6 (k) (k,k) (k) (n,k) (n) (k) 7 hj = 4 wi,j xi + wi,j xi + bj 5 (10) i=1 (k) n=1 i=1 n6=k modality k) are dropped with probability p(k) . In a dropout network, this can be: modality-wise formulated such thatdropout! the input to the Moddrop activation function of a given output unit (denoted as sd ) - Punisha co-adaptation of individual units (k) (likefor vanilla involves Bernoulli selector variable eachdropout)! modality Train acan network is robust/resistent vs.which dropping of k- which take which on values in {0, 1} and is activated individual modalities (e.g. fail of audio channel).! with probablity p(k): 2 3 NE INTELLIGENCE Fk Fn K X X X (k) 6 (k,k) (k) (n,k) (n) (k) 7 (k) ¯ hj = 4 wi,j xi + wi,j xi + bj 5 im to learn er-channel een modalle missing Therefore, a way that predictions s (with an signals are 9 thisi=1 step, the network takes asi=1 an input multi-modal training n=1 (k) samples { (k) xd }, k =n61=.k. . K from the training set |D| (19) (k) where for eachKsample Feach modality component x is d k X X (k) (k) droppeds(set to 0) with probability pk which is (k) a certain = w x (20) d j j indicated by Bernoulli selector Bernoulli selector! j=1 k=1 (k) : P( (k) =1) = p(k) . (15) Accordingly, one step of a gradient descent given an input with a certain number of non-zero modality components Moddrop: results! Classification accuracy on the validation set ! (dynamic poses)! Modalities Dropout Dropout+Moddrop All 96.77 96.81 Mocap missing 38.41 92.82 Audio missing 84.10 92.59 Hands missing 53.13 73.28 Jacquard index on test set (full gestures)! Modalities Dropout Dropout+Moddrop All 0.875 0.875 Mocap missing 0.306 0.859 Audio missing 0.789 0.854 Hands missing 0.466 0.680 Current and future work! Integrate a model learned for pose estimation into the gesture recognition algorithm.! Still no direct pose estimation! ! max pooling Path V1: ConvD2 HLV1 HLV2 depth video, right hand Path V1: ConvD1 shared hidden layer HLS intensity video, right hand ConvC1 ConvC2 ConvD2 output layer HLV1 Path V2: depth video, left hand ConvD1 Path V2: HLV2 intensity video, left hand ConvC1 ConvC2 Path M: mocap stream HLM pose feature extractor Conclusion (1)! 1. General method for gesture and near-range action recognition from a combination of color and depth video and articulated pose data.! 2. Each channel captures a spatial scale, the system operates at three temporal scales.! 3. Possible extensions:! #- additional modalities;! #- additional temporal scales;! #- feedback connections to handle noisy or missing channels. ! [ECCV Workshop 2014]! [ACCV 2014]! [Journal under review]! [Conference in writing]! [Pattern Recognition Letters 2014]! Conclusion (2)! ACCV2014 GRAPHES DE 68 IT CHAP RE 4. EN T #635 RIEM APPA APH R GR T PA ES ACCV2014 #635 CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. 10 Visual recognition! PPA ET A EN RIEM Deep Learning! ACCV-14 submission ID 635 405 405 406 406 407 407 408 408 409 409 410 410 411 411 412 412 413 413 414 414 415 415 416 416 417 418 419 420 p x hyFig. 4. The proposed deep convolutional architecture of a single learner. e d eu en t d 422 m e ri a Structured ! p 421 Figu – Ap re 4.8 p h es er-gra ´eos. de vid en ces u q e ´ s ir de `a part ruits c on s t 423 learning convolutional layers F1 , F2 and F3 with rectified lineardeep activation units! (ReLU). 424 Layersand F1 semi-structured and F2 are followed by 2 ⇥ 2 max pooling and reduction. Structured 425 models! to most existing methods for scene labeling, instead of randomly As opposed 426 sampling pixels (or patches), training is performed image-wise, i.e. all pixels from 427 the given image are provided to the classifier at once and each pixel gets assigned 428 with an output class label based on information extracted from its neighborhood. 429 Applying of the convolutional classifier with pooling/reduction layers to an 430 image in the traditional way would lead to loss in resolution by a factor of 4 (in 431 the given configuration). On the other hand, simply not reducing the image reso432 lution will prevent higher layers from learning higher level features, as the size of 433 the filter support does not grow with respect to the2eimage ordre. content. To avoid this 434 e type functions originally proposed d dilemma, we employ specifically designed splitting it u ) d (b 435 phe r´e for image scanning in [27] and further er-gra exploited in OverFeat networks [28]. Intun h yp 436 (b) uat ; le itively speaking,g´eeach map a given resolution is reduced to four di↵erent maps n´era ph 437 (a) er-gra of lower resolution using max pooling. The amount of elements is preserved, but p y h un 438 – ( a) re 4.9 the resolution of each map is lower compared to the maps of previous layers. Figu439 In more detail, let us consider the output of the first convolutional layer F1 440 of the network. Once the output feature maps are obtained, 4 virtual extended 441 copies of them are created by zero padding with 1) one column on the left, 2) one 442 column on the right, 3) one row on top, 4) one row in the bottom. Therefore, each 443 copy will contain the original feature map but shifted in 4 di↵erent directions. 444 On the next step, we apply max pooling (2 ⇥ 2 with stride 2 ⇥ 2) to each of 445 the extended maps producing 4 low-resolution maps. By introducing the shifts, 446 pixels from all extended maps combined together can reconstruct the original 447 feature map as if max pooling with stride 1 ⇥ 1 had been applied. 448 This operation allows the network to preserve results of all computations for 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 Including structural terms into deep feed-forward models is a goal and a challenge! 449 441 442 443 444 445 446 447 448 449