Vision in Brains and Machines - Redwood Center for Theoretical
Transcription
Vision in Brains and Machines - Redwood Center for Theoretical
Vision in Brains and Machines Bruno A. Olshausen Helen Wills Neuroscience Institute, School of Optometry and Redwood Center for Theoretical Neuroscience UC Berkeley Brains vs. machines “the brain doesn’t do things the way an engineer would” “we needn’t be constrained by the way biology does things” Brains vs. machines “the brain doesn’t do things “we needn’t be constrained by “brains and machines use a common set of operating principles” the way an engineer would” the way biology does things” Cybernetics/neural networks Norbert Wiener Warren McCulloch & Walter Pitts Frank Rosenblatt By the study of systems such as the perceptron, it is hoped that those fundamental laws of organization which are common to all information handling systems, machines and men included, may eventually be understood. — Frank Rosenblatt, Psychological Review, 1958. Two modern developments that arose from the confluence of ideas between disciplines Information theory Natural scene statistics Signal and image processing Feature detection Machine learning Models of visual cortex The ‘Ratio Club’ (1952) 216 YNGVI zonERMAN Gonloo, G. , R. Kjtchell , L. Strom , and Y. Zotterman, The response pattern Redundancy reduction I. (Barlow, 1961) of taste flbres in the chorda 1959 , 46, 119-- 132. Jacobs , H. L., and M. Acta physiol, scand. tympani of the monkey, L. Scott. H. B. BARLOW Physiological Laboratory, Cambridge UniversIty Factors mediating food and liquid intake in Studies on the prefct"ellt'c for sucrose or saccharinc solutions, Poultry Sci. 1957 , 36, ~15. Kare, M, R., R. Black , and E. G. Allison, The sense of taste in the fowl. Poultry . chiekells. Possible Principles Sel. 1957 , 36, 129-138. Kitchell , R. L. , L. Strom , and y, Zotterman. Electrophyslological studies of ther- mal and taste reception in chickens and pigeons. Acta physlol, scand., 46, 13~151. Koketsu , K. Impulses from receptors in the tongue of a frog, mea. Sci. 195 J , 2, 53-61. Kusano , K. , and M. Sato. Properties of fungiform papillae la1', in frog s tongue. 1957 J. G. mammals. Geschmacksinn, , 2, 1- impulses. J. , E. v. ZottermRn , tympani. Zotterman , of Sensory Messages Kywhu Mem, The hypothesis is that sensory relays PhI/sial. , 7, 324-338. , and y, Zotterman. The water taste in recode sensory messages so Acta physlol. ,cond. 1954 , 32, 291-303. Skand, Arch. Phl/slol. ohrwalJ , H. Untersuchungen tiber den that their69.redundancy is reduced but IA9l Pfaffmann , C. Gustatory afferent 243-258. comparatively little information is lost. Handbuch der nlederen Skrarulik Li.ljestTand , Underlying the Transformations 1959 Y. Aclion Sinne. ceU, compo Phl/sIoI. 1941 , 17, Bd. 1. Leipzig: Thieme, 1926, potentials in the glossopharyngeal nerve and in the chorda Skalla. Arch. PhYstol. 1935, 72, 73-77, Y. The response of the frog s taste fibres to the application of pure Acta phI/stolt scand. a way of organizing ...it constitutes Experientia Zotterman Zotterman the sensory InformatIon so that, on the one hand, an internal model of the environment causing the past sensory inputs is built up, while on the other hand the current sensory situation is represented in a concise way which simplifies the task of the parts of the nervous system responsible for learning and conditioning. water, 1949 , 18, 181- 189, The water taste in the frog, Species differences in the water taste, 37, 00-70. 1950 6 (2), 57-58, Acta physwl. scand., 1956 A wing would be a most mystifying structure if one did not know that birds flew. One might observe that it could be extended a considerable distance , that it had a smooth covering of feathers with conspicuous markings , that it was operated by powerful muscles , and that strength and lightness were prominent features of its construction, These are important facts , but by themselves they do not tell us that birds fly. Yet without knowing this , and without understanding something of the principles of flight, a more detailed examination of the wing itself would probably be unrewarding, I think that we may be at an analogous point in our understanding of the sensory side of the central nervous system, We have got our first batch of facts from the anatomical , neurophysiological , and psychophysical study of sensation and perception , and now we need ideas about what operations are performed by the various structures we have examined. For the bird' s wing we can say that it accelerates downwards the air flowing past it and so derives an upward force which supports the weight of the bird; what would be a similar summary of the most important operation performed at a sensory relay? It seems to me vitally important to have in mind possible answers to this question when investigating these structures , for if one does not one will get lost in a mass of irrelevant detail and fail to make the .crucial observations, In this paper I shall discuss three hypotheses according to which the answers would bc as follows: 1, Sensory relays are for detecting, in the incoming messages , cer- passwords " that have a particular key significance for the animal. 2, They are filters , or recoding centers , whose "pass characteristics tain " 217 ......... ... Models and experimentally-tested predictions arising from the redundancy reduction theory Notizen coding characteristic is used in 911 digital image pro- cessing, and is called " histogram equalization (5). Note that this coding procedure amplifies inputs in proportion to their expected frequency of occurrence , using the response range for the better resolution of common events , rather than reserving large portions for the improbable. 1.0 • Histogram equalization To first order Laughlin (1981); Bell & Sejnowski use this coding procedure I blowfly s compound eye (1995) 110m V 50ms interneurons of the see if the 0 0. compared their contrast-response functions with the contrast levels measured in natural scenes , such as dry schlerophyll woodland and lakeside vegetation. Relati ve intensities were measured across these scenes using a detector which scanned horizontally, like the ommatidium of a turning fly. It consisted of a PIN photodiode, operating within its linear range in the focal plane of a quartz lens. A combination of ...., coloured glass filters (Schott , KG 3 + BG 38) was used to givs, the detector a spectral sensitivity similar to a fly monopolar cell (3). The scans 206were cumu lative probability • Predictive coding 1.0 Srinivasan, Laughlin & Dubs (1982) digitised at intervals of 0. 070 Joseph J. Atick and A. Norman Redlich 1.4 0 coHesponding to the angular sensitivity of a fly photoreceptor (6). Contrast values were obtained by • Whitening averaged from 6 cells; range bars show total scatter. Inset shows the averaged responses of an LMC to four contrast steps (the cell hyperpolarises to increments and depolarises to 300 decrements). The stimulus , a Siemens LD57C light emitting diode (LED) subtending 20 and filling the centre dividing each scan into intervals of 10 , 25 or 500 Within each interval the mean intensity, /, was found , and substracted from every data point to give the fluctuation about the mean LlI. This of the LMC's receptive field; was mounted in the centre of the reflecting screen of diameter 160 . LED and screen difference value was divided by the mean to give the contrast AI/I. The cumulative probability dis- Atick & Redlich (1990; 1992); levels (Fig. 2) was derived from tribution of contrast 15000 readings. As expected , the range of contrasts of the interval van Hateren (1992; 1993); used difference between the two larger intervals , was small. Dan, Atick & Reid (1996)measured using thes techniques contrast-response function was developed for intraencountered increased with the width , but the , 25 0 and 500 The interneuron + 1.0 Fig. 2. The contrast-response function of light adapted fly LMC's compared to the cumulative probability function for natural contrasts (500 interval). Response is normalised with the maximum amplitude to light off as 0. , and the maximum amplitude to an increment as 1. 0. Data points 1000 and convolved with a Gaussian point spread function of half-width contrast cd/I 100 brightnesses were equalised by setting the mean LED intensity so that its substitution for the screen in the centre . of the field generated a negligible response. Contrast steps ...,J !II Q.) were generated by setting the LED driving voltage to a for 100 ms , and the corresponding contrast new level determined by recording the driving voltages and referring these to the diode s voltage/intensity curve, determined Mean light levels where 2. 5 to 3. 0 log units above the intensity producing a half maximal response from the dark adapted LMc. situ. cellular recording in the intact retina of the blowfly, Calliphora stygia , 6). Individual LMC' s were light adapted by a bright steaqy background and the responses to sudden increments and decrements about this level recorded (Fig. 2). Repeated responses to the same stimulus were averaged to enhance the reliability of the data. The light adapted contrast-response function approximates to the sig- sponse equiprobability have a marginal effect on 100 entropy (4), there should be little redundancy associated with the LMC response to natural scenes. frequency, c/deg concept The successfulSpatial application of a central from information theory, entropy, to a neuron V1 is highly overcomplete Tempo ral recons tructio n o f the image The homun culus also has to face t'he proble m that the image is often nioving contin uously , but is only represe nted by impulses a t discret e mome nts in time. I n these days he often has to deal with visual images derive d from cinema screens and television sets t h a t repres ent scenes sample d a t quite long interva ls, and we know LGN afferents IVb layer 4 cortex Barlow (1981) C 0 1mm I E A tracing of the outlines of the granule cells of area 17 in layers I V b and IVc of FIGUR8. monkey cortex, where the incoming geniculate fibres termma te (from fig. 3 c of Hubel & Wiesel 1972) The dots at the top lndlcate the calct~l atedseparation of the sample points Sparse, distributed codes Dense codes (e.g., ascii) (e.g., grandmother cells) . . . . . . N 2 Local codes N K ( ) (From Foldiak & Young, 1995) N 4.00 d. 3.00 coo 0 I Field (1987) X~~~~~~~~~~. Ia 2.00 David J. Field r~2 0 to 1.01 II 02 > 0.0c11 1, ¢oX 6.OC ar 01=, a s.ocII. more recently by Daugman,1 a Gabor function represents a minimum in terms of the spread of uncertainty in space and spatial frequency (actually time and frequency in Gabor's description). However, the Gabor code is mathematically pure in only the Cartesian coordinates where all the Gabor Gi~~~~~~~~~~~~~~~(." 11., channels are the same size in frequency and hence have ,)0 Vol.in4,space No. 12/December 1987/J. Opt. Soc. Am. A sensors that are all the same size (i.e., all the rectangles in the diagrams in Fig. 1 are the same size). In such a case, the Gabor code represents the most effective means of packing the information space with a minimum of spread and hence a minimum of overlap between neighboring units in both space and frequency. However, modifying the basic structure of the code to permit a polar distribution such as that shown in our rosettes (Fig. 3) alters the relative spread and overlap between neighbors. In this section some results are described that were obtained with a function that partially restores some of the destructive David effects J.ofField the polar mapping. This function will be called the log Gabor function. It has a frequency rePhysiological Laboratory, sponse University described of byCambridge, Cambridge CB2 3EG, UK Relations between the statistics of natural images and the response properties of cortical cells Image D 2 4.OC II. 0) E. 3.OC 2379 Image E 1.0 2.0 3.0 I aatM 2.OC Simple-cell receptive fields are well suited to encode images with a small fraction of active units. 1.001 0.0 , David J. Field 6.005.00 4.00 d. 3.00 coo 0 I 2.00 to 3.0 1.0 2.0 3.0 G(f) = expi-[log(f/f 0 )]2 /2[log(oi/f0 )]2 I, Received May 15, 1987; accepted August 14, 1987 (12) Bandwidth (octaves) that image-coding is, the frequency response is a Gaussian on a log freThe relative efficiency of any particular scheme should be defined only in relation to the class of Vol. 4, No. 12/December 1987/J. Opt. Soc. Am. A 2389 the Fig. 12. Relative response of the different subsets of sensors as a quency axis. Figure 14 a comparison images that the code is likely to encounter. To understand the provides representation of images between by the mammalian visual function of the bandwidth. For example, plottherefore labeled 0.01 system, itthe might be useful to consider the statistics of images from the natural environment (i.e., images represents the average response of the toptrees, 1%ofrocks, the sensors relative with bushes, etc). In this study, various coding schemes are compared in relation to how they represent to the averageresponse of all the sensors. If we consider the sensor Image natural Gabor codesA of have number Image Bof interesting Image C the information in such naturalimages. images. The coefficients suchacodes are represented 4 by arrays of mechanisms 6.00r subject thisrespond plot related to the Image A response to be C cantobelocal Image B to noise, thenthat Image regions of space, spatial mathematical properties. As orientation described by Gabor1 and For many classes frequency, and (Gabor-like transforms). signal-to-noise ratio. It suggests that spatial-frequency band2 of image, such codes will not recently be an efficient means of representing information. the results obtained with more by5.00 Daugman,1 a Gabor function However, represents a widths in the range of 1 octave produce the highest signal-to-noise - and the spatial-frequency six natural images suggest that the orientation tuning of mammalian simple cells are well ratio. insuch terms of the in 0o,0space and suited for coding the minimum information in images if thespread goal of of the uncertainty code is to convert higher-order redundancy (e.g., 002 4.00 -0.0opixels)time correlation between the intensities of neighboring spatial frequency into first-order (actually and frequency in(i.e., Gabor's redundancy the response distribution 0.02 0.05 of the tocoefficients). Such coding produces a relatively active sensors have a high response the average. If we high signal-to-noise ratio and permits information to be description). However, the Gabor code is mathematically 0.02 transmitted with only a subset of the total number of cells. 3.00 0.05 These results support Barlow's theory that the goal of consider the fact that cortical neurones are inherently rather 0.05 0.10 0.) Cartesian pure intheonly the coordinates where all0.20the redundancy. Gabor G i~~~~~~~~~~~~~~~(." natural vision 3 is toplot represent information in the natural X~~~~~~~~~~. Ia environment with minimal noisy in their response to a stimulus, ' this can be 0.10 11., channels are 0~0the2.00same size in frequency and hence have _~~0.20 ,)0 considered a measure of the signal-to-noise ratio of different sensors VI) all the same size in space (i.e., all the rectantypes of sensors. INTRODUCTION Coding information into channels with that are to 2 2why behave 1.00 in Fig. 1 are gles in the diagrams thecortical sameneurons size). might In such a in this way. Daugman' approximately 1-octave bandwidths produces a representa02 points out that the Gabor coderepresents an effective means Since Hubel and the cells classic experiments oncode neurons in case, the represents the most effective means of tion in which a small proportion of Weisel'sl represents a Gabor 0.00 of filling the information space with functions that extend in the visual cortex, we have moved a great deal0) closer to an packing the information spaceboth withspace a minimum of spread large proportion of the information with a high signal-to2and frequency. However, this does not necessarunderstanding of the behavior and connections of visual noise ratio. and hence a minimum of overlap between neighboring units ily imply that mustF be an efficient means of cortical neurons. A number recent Image D Image Esuch a codeImage 0 early visual Image D We have Image so far Econsidered only channels for of which themodels of C,, r~2 0 I.1.0 2.0 1.01 II 02 > 0.0c11 1, ¢oX 6.OC ar 0 a s.ocII. 1=, in both spaceC,and6.00 frequency. representing the information in any image. As we shall see, processing bandwidth have been quite in accounting ratio of the spatial-frequency to theeffective orientation 0 for a wide 2 4the basic However, modifying structure of the code to the efficiency of a code will depend on the statistics of the range of physiological and psychophysical observations. 0.) 5.0 - bandwidth is constant (AF/AO = 1.0). Figure 13 shows reinput the images). permit a polar distribution such as that(i.e., shown in our rosettes For a wide variety of images, a 2However, although we know much more about how the 2 4.OC II. sults with various different aspect ratios. One of the diffi0.) 1.0 2.0 early 3.0 stages of the visual system (Fig. codewillbetween be quite an inefficient means of representing 3)information, altersV) the4.00k relative and overlap neigh0) process there isspreadGabor culties of such an analysis is that the two-dimensional Gabor Ca. 0.02 0.01 information. E E. 3.OC still a great deal 0.02 of disagreement bors. In this section some results are described that were about the reasons why the functions are not polar separable. That is, the spatial-fre0.02 c __ _~ 0.02 I aatM 0.05 Clearly, the definition visual an efficient, or optimal, code system works as it does.obtained Theorieswith of why 3.00ccortical that partially a function restores some ofofthe quency tuning is not independent of the orientation tuning 0.0 0.05 0.20 depends on two parameters: the goal neurons of the code and the behave as they do havedestructive varied widely \ function 0.10 from Fourier 2.OC (in degrees). Extending the5 orientation bandwidth actually effects of the polar mapping. This will 0.20 6 0.20 7 0.20 statistics of thea input. analysis to edge todetection. However, theory function. extends the response of the' channel higher frequencies. be called no thegeneral log2.00Gabor It has frequency rehas emerged a clear With few exceptions (e.g., Refs. 15-18), theories of why favorite. Edge detection has proved With the 1/f amplitude spectrum,asthe response of the chan1.001 sponse described1.0C by to beprimarily an effective visual neurons behave as they do have failed to give serious of coding many types of images, but nel will be dependent on means the lower frequencies, 2 the evidence that cortical neurons can generally classified 0 )]consideration with little effect produced by the extension. Nonetheless, (12) of the natural environment. /2[log(oi/f0 )]2toI, the properties G(f) =beexpi-[log(f/f 0.00' 0.0 , I.1.0 2.0 3.0 1.0 2.0 as3.0 0.25 1.00 4.00 0.25 1.00 4.00 0.25 4.00 edge detectors is lacking Our present theories about the1.00 Refs. the results of such an analysis are shown in Fig.(e.g., 13. As can8 and 9). function of cortical neurons Bandwidth (octaves) Theofnotion thatoptiis, the frequency response is based a Gaussian onona the log bandwidth frethe is are frequency bandwidth / Orientation visual cortex primarily performs be seen, an aspect ratio about that 0.5-1.0 somewhat a globalSpatial Fouriresponse of such neurons to stimer transform no longer uli asuch Fig. 12. Relative response of the different subsets of is sensors as a given quency serious consideration. checkerboards, mal, although the effects are small. Theprovides axis. Fig. Figure 14 comparison between the 13. Relative response of theasdifferent subsets as asine-wave function ofgratings, long straight - . ..------- 0.10 Field (1994) Simple-cell receptive fields are well suited to maximize non-Gaussianity (kurtosis) of response histograms. Internal model External world . . . Sparse coding image model (Olshausen & Field, 1996; Chen, Donoho & Saunders 1995) I(x,y) φi(x,y) I(⇤x) = M X ai ai ⇥i (⇤x) + (⇤x) i=1 image neural activities (sparse) features other stuff Learned dictionary { i } spatial frequency (cycles/window) 2.5 5.0 1.2 2.5 0 1.2 y x Olshausen & Field (1996) Non-linear encoding Outputs of sparse coding network (a i) Pixel values Image I(x,y) Many other developments • Divisive normalization Schwartz & Simoncelli (2001); Lyu (2011) • Non-Gaussian, elliptically symmetric distributions Zetzsche et al. (1999); Sinz & Bethge (2010); Lyu & Simoncelli (2009) • Contours Geisler et al. (2001, 2009); Sigman et al. (2001); Hoyer & Hyvarinen (2002) • Complex cells, higher-order structure Hyvarinen & Hoyer (2000); Karklin & Lewicki (2003, 2009); Berkes et al. (2009); Cadieu & Olshausen (2012) Natural Scene Statistics (Hancock, MA 1997) Barlow Donoho Li Applications of sparse coding Denoising / Inpainting / Deblurring Compressed sensing Computer vision Deep learning building blocks, staggering them present appropriately Up toover the two mechanisms the an and conversely). wide region. have ' off'-centre, A complexgeniculate units. One only of simple fields useopposite the corresponding type ('on '-centre instead as of by ones sponding cellsneed mayinhibitory Similarendings schemes by explain thewebehaviour be proposed replace of ones,to provided corretheother excitatory complex field, and this would tend higher-order to excite 19 thewe cell. any of the may replace connexions. direct inhibitory In Text-fig. activate scheme one or more ofshould, these simple cells wherever fellpossibility itthe within the of however, consider proposed In would the one 280 and staggeredexcitatory H. HUBEL AND T. N. WIESEL axis orientation, along aD. horizontal light line. An&edge of 1961). Wiesel, (Hubel suppression afferent firing of in its main of simple cortical cells with fields of type C, Text-fig. 2, all with vertical because of FIG. 38. Wiring centrefrom its field an of 'off'-centre on illuminating Text-figs. 5cell One may imagine andis6.suppressed that it receives afferents a set diagrams that visual system is clear from studies of the lateral geniculate body, where account for the properties of hypercomple the mechanisms occur in place level. such takes That inhibition at a lower order cell. cells. A: hypercomplex cell responding tonic excitation, field isofpresumed to be result of withdrawal thesome its position, will excite simple-field cells, leadingof to excitation of the higher-i.e. the single stopped edge (as in Figs. 8 th rupted lines. Any vertical-edge stimulus an falling across this rectangle, part of regardless inhibitory the receptive firing suppression of on illuminating The boundaries of the fields are staggered within an area outlined by the inter11) receives projections from two co the excitatory synapses. Here Text-fig. model of is based on The the 19 left and an inhibitory region to the right of a vertical straight-line boundary. cells, one excitatory to the hypercomple neurone has a receptive field arranged as shown to the left: an excitatory region to afferents. cell (E), the other inhibitory (I). Th are imagined to project to a single cortical cell of higher order. Eachby projecting 'on'-centre unequally reinforced be produced were the two flanks if fields. A number of cells with simple fields, of which three are shown schematically, citatory complex cell has iti receptive E, would regions, as flanking asymmetry in field left. 20. An Possible to the Text-fig. scheme forofexplaining the organization of complex receptive in the region indicated by the left region below and to the right of the boundary, and 'on' centres above and tinuous) rectangle; the inhibitory ce be formed by having geniculate afferents with 'off' centres situated in the its field in the area indicated by the their field centres appropriately placed. For example, field-type G could (interrupted) rectangle. The hypercomple by supposing that the afferent 'on'- or 'off'-centre geniculate cells have field thus includes both areas, one bein In a similar way, the simple fields of Text-fig. 2D-G may be constructed Hubel & Wiesel (1962, 1965) Hypercomplex Complex activating region, the other the antagonis receptive-field diagram to the left of the figure. Stimulating the left region alone resu then have an elongated 'on' centre indicated by the interrupted lines in the excitation of the cell, whereas stimulat synapses are supposed to be excitatory. The receptive field of the cortical cell will regions together is without effec a straight line on the retina. All of these project upon a single cortical cell, and the both proposed to explain the prope the upper right in the figure, have receptive fields with 'on' centres arranged along scheme fields. A large number of lateral geniculate cells, of which four are illustrated in of a hypercomplex cell responding Text-fig. receptive 20. hypothetical simple The of organization illustrated explaining the has a complex field like thatdouble-stopped Text-fig. 19. Possible scheme for cell slit (such as that desc their exact retinal positions. An example of such a scheme is given in for the differ all have identical axis orientation, but would differ from one another inin Figs. 16 and 17, except or the hypercomplex ce having cells with simple fields as their afferents. These simple fields wouldin orientation, in Fig. 27). The cell re noted in Part I suggests that cells with complex fields are of higher order,small spikes input from a complex cell body. Rather, the correspondence between simple and complex fieldsexcitatory posing that these cells receive afferents directly from the lateral geniculate vertically oriented field is indicated The properties of complex fields are not easily accounted for by sup-left by a continuous rectangle; two occur. tional complex cells inhibitory to the not been distinguished, but there is no reason to think that both do not complex cell have vertically oriented CAT VISUAL CORTEX1 143flanking cortical fields. the first one above and Simple Neocognitron (Fukushima 1980) image feature pooling feature pooling objects extraction extraction “LeNet” (Yann LeCun et al., 1989) articles ‘HMAX’ (Riesenhuber & Poggio, 1999; Serre, Wolf & Poggio, 2005) n extension of simple cells4, ‘S’ units in the matching, solid nits6, performonlinear MAX he cell’s inputs odel’s propersummation of e two types of invariance to different posiover afferents ap in space e ‘complex’ mulus size, nvariance! a simplified View-tuned cells Complex composite cells (C2) Composite feature cells (S2) Complex cells (C1) Simple cells (S1) weighted sum MAX network to recover the data from the code. minimizing the discrepancy between the original data and its reconstruction. The required gradients are easily obtained by using the chain rule to backpropagate error derivatives first through the decoder network and then through the encoder network (1). The whole system is Deep Belief Networks Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada. (Hinton & Salakhutdinov, 2006) *To whom correspondence should be addressed; E-mail: [email protected] Decoder 30 W4 500 Top RBM T T W 1 +ε 8 W1 2000 2000 T 500 1000 1000 W3 1000 T W 2 +ε 7 W2 T T W 3 +ε 6 W3 RBM 500 500 T W4 30 1000 W4 W2 2000 Code layer 500 RBM W3 1000 W2 2000 2000 W1 W1 T W 4 +ε 5 30 W 4 +ε 4 500 W 3 +ε 3 1000 W 2 +ε 2 2000 W 1 +ε 1 It is difficult nonlinear autoen hidden layers (2– autoencoders typi with small initial early layers are train autoencoder the initial weights gradient descent initial weights req algorithm that lea time. We introduc for binary data, ge and show that it data sets. An ensemble ages) can be mo work called a Bre (RBM) (5, 6) in w are connected t detectors using s nections. The pi units of the RB observed; the fea Bhidden[ units. A the visible and h given by X Eðv, hÞ 0 j iZpix j RBM Pretraining X i, j Encoder Unrolling Fine-tuning Fig. 1. Pretraining consists of learning a stack of restricted Boltzmann machines (RBMs), each where vi and hj a and feature j, bi a the cortex. They also demonstrate that convolutional DBNs (Lee et al., 2009), trained on aligned images of faces, can learn a face detector. This result is interesting, but unfortunately requires a certain degree of Le etconstruction: al. 2012)their training supervision(Quoc during dataset images (i.e., Caltech 101 images) are aligned, homogeneous and belong to one selected category. ‘Google Brain’ logical and co Lyu & Simonc As mentioned of local conne ments, the fir pixels and the lapping neighb The neurons in input channel second sublay (or map).3 W responses, the the sum of th is known as L normalization Our style of ules, switch ance layers, HMAX (Fuk 1998; Riesenh been argued t brain (DiCarlo pooling sparse coding Figure 1. The architecture and parameters in one layer of Although we not convoluti across differe We visualize histograms of activation values for face images and random images in Figure 2. It can be seen, even with exclusively unlabeled data, the neuron learns to differentiate between faces and random distractors. Specifically, when we give a face as an input image, the neuron tends to output value larger than the threshold, 0. In contrast, if we give a random image as an input image, the neuron tends to output value less than 0. ‘Google Brain’ (Quoc Le et al. 2012) Figure 2. Histograms of faces (red) vs. no faces (blue). The test set is subsampled such that the ratio between faces and no faces is one. Figure 3. Top: Top 48 stimuli of the best neuron from test set. Bottom: The optimal stimulus according t merical constraint optimization. 4.5. Invariance properties We would like to assess the robustness of the fac tector against common object transformations, logical and computational models (Pinto the cortex. They also demonstrate that convolutional et al., 2008; translation, scaling and out-of-plane rotation. F 2 In this section, we will present two visualization techLyu & Simoncelli, 2008; Jarrett et al.,we2009). DBNs (Lee et al., 2009), trained on aligned images of chose a set of 10 face images and perform di niques verify the optimal stimulus of the neuron is faces, can learn a face detector. Thisto result is if interAs mentioned above, central to our approach use e.g., scaling and translating. For tions tois the them, esting, but unfortunately requires a certain of method is visualizing the most indeed a face.degree The first of local connectivity between neurons. In our experiof-plane rotation, we used 10 images of faces rota supervision during dataset construction: their training responsive stimuli in thements, test set. Since the test the first sublayer has set receptive of 18x18 infields 3D (“out-of-plane”) as the test set. To check th images (i.e., Caltech 101 images) are aligned, homoge- can reliably detect near optimal is large, this method pixels and the second sub-layer poolsbustness over 5x5ofoverthe neuron, we plot its averaged resp neous and belong to one selected category. stimuli of the tested neuron. second approach lapping The neighborhoods of features (i.e., pooling size). over the small test set with respect to changes in s is to perform numerical The optimization to first findsublayer the op-connect to pixels in all neurons in the 3D rotation (Figure 4), and translation (Figure 5 channels (orErhan maps) et whereas timal stimulus (Berkes &input Wiskott, 2005; al., the neurons in the sublayer pixels of only6 Scaled, one channel 2009; Le et al., 2010). Insecond particular, we connect find thetonormtranslated faces are generated by stan 3 While the first sublayer outputs linear filter (or map). cubic interpolation. For 3D rotated faces, we used 1 bounded input x which maximizes the output f of the Building high-level features using large-scale unsupervised learning 4.4. Visualization responses, the pooling layer outputs the square root of the sum of the squares of its inputs, and therefore, it is known as L2 pooling. Our style of stacking a series of uniform modules, switching between selectivity and tolerance layers, is reminiscent of Neocognition and HMAX (Fukushima & Miyake, 1982; LeCun et al., 1998; Riesenhuber & Poggio, 1999). It has also been argued to be an architecture employed by the brain (DiCarlo et al., 2012). Although we use local receptive fields, they are not convolutional: the parameters are not shared across different locations in the image. This is Deep learning (Krizhevsky, Sutskever & Hinton, 2012) 0 image layer 1 layer 2 (96x55x55) (256x27x27) layer 3 layer 4 layer 5 layer 6 (384x13x13) (384x13x13) (256x13x13) (4096) 60 million weights! layer 7 (4096) classification (1000) Performance graph credit Matt Zeiler, Clarifai place cells grid cells . . face cells . . . ? . . . . ‘Gabor filters’ 0 Can deep learning provide insights into cortical representation? (and vice-versa) ? Deep learning (Krizhevsky, Sustkever & Hinton, 2012) Learned first-layer filters Visualization filters learned at intermediate layers Visualizing of and Understanding Convolutional Neur (Zeiler & Fergus 2013) Layer 2 Visualization filters learned at intermediate layers Visualizing of and Understanding Convolutional Neur (Zeiler & Fergus 2013) LayerLayer 3 2 Visualization filters learned at intermediate layers Visualizing of and Understanding Convolutional Neur (Zeiler & Fergus 2013) LayerLayer 34 Layer 2 Layer 5 http://cs.stanford.edu/people/karpathy/cnnembed/ Visual metamers of a deep neural network (Nguyen,Yosinki & Clune 2014) Figure 1. Evolved images that are unrecognizable to humans, Figure 5. Simple segmentation from (a) The 3-shapes set consisted of b Binding byphases. synchrony shapes (square, triangle, rotated triangle). (b) Visible states after synchronizatio (Reichert & Serre, 2014) image in the MNIST+shape dataset, a MNIST digit and a shape were drawn each w decode 0 /2 phase mask input 3/2 2 eceives input from a recurrent neural network. The system is trained end-to-end using backpropagation to minimize classification error on a modified MNIST dataset. Remarkably, the model learns to perform a visual search over the image, correcting mistaken movements caused by distractors. Furthermore, the cone cells tile themselves in a similar fashion to those ound in the human retinae. This layout is composed of a high acuity region at the center (low variance gaussians) surrounded by low acuity (high variance gaussians). These initial results ndicate the possibility of using deep learning as a mechanism to discover the optimal tiling cone cells in a data driven manner. With the emergent visual search behavior learned by our model, we can also investigate the optimal saliency map features for selecting where to attend next. Learning how and where to attend (Cheung, Weiss & Olshausen, work in progress) Classification Network Location Network Recurrent Network Glimpse Network Image Diagram of our neural network attention model Red line shows the movements (saccades) by our attention model over the MNIST d modified for visual search The goal of computer vision In order to gain new insights about visual representation we must consider the tasks that vision needs to solve. • How and why did vision evolve? • What do animals use vision for? Visual Navigation in Box Jellyfish 799 jumping spider sand wasp Figure 1. Rh of the Upper box jellyfish (A and B) In f lia maintain the medusa heavy crysta rhopalium c such that oriented. Th straight upw body orienta ated on the eyes directe (C) Modelin peripheral p angular sen ceptors are What do you see? not ect(s) ect(s) not Lorenceau & Shiffrar (1992); Murray, Kersten, Schrater, Olshausen & Woods (2002) Perceptual “explaining away” (d) or or not Target object(s) Occlusion object(s) ? Image measurements Auxilliary image measurements (Kersten"Explaining & Yuille, 2003)Away" Perceptual Just as principles of optics govern the design of eyes, so do principles of information processing govern the design of brains.