Lecture Slides File - Moodle
Transcription
Lecture Slides File - Moodle
Unsupervised and Reinforcement Learning In Neural Networks Fall 2015 Reinforcement learning and Biology - Biological Insights - review: eligibility traces and continuous st. - Algorithmic application - Biological interpretation - Q-learning vs V-learning - Biological application: rodent navigation Learning by reward Rat learns to find the reward Learning by reward: conditioning The dog likes to do things, for which it gets reward! Trial and error learning Instrumental conditioning: action learning Detour: Classical Conditioning Pawlow, 1927 Klick-Klack ? Reward!!! Klick-Klack reward Pawlow, 1927 same stuff as always Learning by reward: conditioning Klick-Klack The dog likes to do things, for which it gets a Klick-Klack! Trial and error learning (the Klick-Klack replaces the explicit reward) Learning by reward: humans? Chocolate!! Money!! Praise!!!!! What is the reward? Learning actions, skills, tricks, by reward Images taken from public WEB pages Unsupervised vs. reinforcement learning review: Hebbian Learning pre j wij k i post When an axon of cell j repeatedly or persistently takes part in firing cell i, then j’s efficiency as one of the cells firing i is increased Hebb, 1949 - local rule - simultaneously active (correlations) Hebbian Learning = unsupervised learning Detour: Receptive field development Hebbian leanring leads to specialized Neurons (developmental learning); (Lecture 1-3) unselective neurons Initial: random connections { { Correlated input output neurons output neurons specialize Hebbian Learning of Actions go left Action choice cue Input, current state, cues action result green light – go left à no reward green light – go right à reward red light – go right à reward Hebbian Learning fails! Reinforcement Learning = reward + Hebb SUCCESS/Shock Not all actions are succesful You only want to learn the rewarding stuff Δwij ∝ F ( pre, post, SUCCESS) local global Most models of Hebbian learning/STDP do not take into account Neuromodulators/cannot describe success/shock Except: e.g. Schultz et al. 1997, Wickens 2002, Izhikevich, 2007 Synaptic Plasticity & Hebbian Learning -could be/should be good for memory formation -could be/should be good for action learning BUT Theoretical Postulate we need to include global factors for feedback (neuromodulator?) post Δwij ∝ F ( pre, post ,3rd factor ) Do three factor-rules exist? pre Reinforcement Learning = reward + Hebb success Classification of plasticity: unsupervised vs reinforcement LTP/LTD/Hebb Theoretical concept - passive changes - exploit statistical correlations pre j post i Functionality -useful for development ( wiring for receptive fields) Reinforcement Learning Theoretical concept - conditioned changes - maximise reward success pre j post i Functionality - useful for learning a new behavior Unsupervised and Reinforcement Learning In Neural Networks Reinforcement learning and Biology - Biological Insights - review: eligibility traces and continuous st. - Algorithmic application - Biological interpretation - Q-learning vs V-learning - Biological application: rodent navigation SARSA algo Initialise Q values Start from initial state s Q(s,a1) 1) Being in state s choose action a a1 according to policy a r= R 2) Observe reward r s →s ' and next state s’ s’ 3) Choose action a’ in state s’ Q(s’,a1) according to policy a1 4) Update ΔQ(s,a)=η [r-(Q(s,a)-γ Q(s’,a’))] 5) s’ s; a’ 6) Goto 1) s a2 a2 a γ ΔQ(s,a)=η [r-(Q(s,a)- Q(s’,a’))] Eligibitiy trace s=state a a=action Rsa→s' = reward s’=new state Sarsa ΔQ(s,a)=η [r-(Q(s,a)-γ Q(s’,a’))] Idea: at the moment of reward update also previous action values along trajectories Eligibitiy trace - algo Sarsa(0) ΔQ(s,a)=η [r-(Q(s,a)-γ Q(s’,a’))] δ t = rt +1 − [Qa (s, a) − γ ⋅ Qa (sʹ, aʹ)] Sarsa(λ ) ΔQaj = η δ t eaj if a = action taken in state j eaj (t ) = γλ eaj (t − Δt ) + 1 else eaj (t ) = γλ eaj (t − Δt ) Continuous states – example Action: a1 = right a2 = left y y Action a1 x Action a2 x Continuous states – example Q(s,a1) Action a1 Q(s,a2) Action a2 y y x x Continuous states Q(s, a) = ∑ j waj rj (s) Q(s, a) = ∑ j waj Φ(s − s j ) Spatial Representation Q(s,a1) y continuous: TD( ) TD error in SARSA δ t = Rt +1 − [Qa (s, a) − γ ⋅ Qaʹ (sʹ, aʹ)] Function approximation n Q( s, a) = ∑ wia ⋅ ri ( s ) i =1 n a i = ∑ w ⋅ Φ ( s − si ) Synaptic update Δwaj = η δ t eaj i =1 ⎧Φ( s − s j ) eaj (t ) = ⎨ ⎩0 if a = action taken ALGO: Eligibility and continous: TD( ) 0) initialise 1)Use policy for action choice a Pick most often action Function approximation Qw (s, a) at* = arg max Qa (s, a) n = ∑ wia ⋅ ri ( s ) i =1 a 2) Observe R, s’, choose next action a’ 3) Calculate TD error in SARSA δ t = Rt +1 − [Qa (s, a) − γ ⋅ Qaʹ (sʹ, aʹ)] 4) Update eligibility trace ⎧ rj eaj (t ) = γλ eaj (t − Δt ) + ⎨ 0 ⎩ 5) Update weights if a = action taken Δwaj = η δ t eaj 6) Old a’ becomes a, old s’ becomes s and return to 2) Unsupervised and Reinforcement Learning In Neural Networks Reinforcement learning and Biology - Biological Insights - review: eligibility traces and continuous st. - Algorithmic application - Biological interpretation - Q-learning vs V-learning - Biological application: rodent navigation 15 white and 15 black pieces 24 locations ∆𝑤 = 𝜂 𝑟 − 𝛾𝑉 𝑠 * − 𝑉(𝑠) 𝑒. 𝛿 34 𝑒. = 𝜆𝑒.12 35 1 2 3 ≥4 Blackboard Strategy: 1. Throw dice 2. Estimate V(t+1) for all possible future states s’ 3. Chose move with maximal V(t+1) Initial training: - Play against itself Unsupervised and Reinforcement Learning In Neural Networks Reinforcement learning and Biology - Biological Insights - review: eligibility traces and continuous st. - Algorithmic application - Biological interpretation - Q-learning vs V-learning - Biological application: rodent navigation ALGO: Eligibility and continous: TD( ) 0) initialise 1)Use policy for action choice a Pick most often action Function approximation Qw (s, a) at* = arg max Qa (s, a) n = ∑ wia ⋅ ri ( s ) i =1 a 2) Observe R, s’, choose next action a’ 3) Calculate TD error in SARSA δ t = Rt +1 − [Qa (s, a) − γ ⋅ Qaʹ (sʹ, aʹ)] 4) Update eligibility trace ⎧ rj eaj (t ) = γλ eaj (t − Δt ) + ⎨ 0 ⎩ 5) Update weights if a = action taken Δwaj = η δ t eaj 6) Old a’ becomes a, old s’ becomes s and return to 2) Eligibility and continuous: TD(λ ) Eligibility trace (memory at synapse): ⎧⎪ r i1 if a = a! (= action taken) j eaj (t) = γλ eaj (t − Δt)+ ⎨ ⎪⎩ rj i0 else Δwaj = η δ t eaj eaj (t) = γλ eaj (t − Δt)+ rj δa,a! post pre Winner-takes-all interaction Eligibility trace (memory at synapse): pre post ⎧⎪ r r eaj (t) = λ eaj (t − Δt)+ ⎨ j a ⎪⎩0 memory if a = a! Eligibility and continous: TD( ) TD error in SARSA δ t = Rt +1 − [Qa (s, a) − γ ⋅ Qaʹ (sʹ, aʹ)] Function approximation n a = w Qw (s, a) ∑ i ⋅ ri ( s) i =1 Synaptic update Eligibility trace (memory at synapse): Δwaj = η δ t eaj ⎧⎪ r i1 if a = a! (= action taken) j eaj (t) = γλ eaj (t − Δt)+ ⎨ ⎪⎩ rj i0 else eaj (t) = γλ eaj (t − Δt)+ rj δa,a! Classification of plasticity: unsupervised vs reinforcement LTP/LTD/Hebb Theoretical concept - passive changes - exploit statistical correlations pre j post i Functionality -useful for development ( wiring for receptive fields) Reinforcement Learning Theoretical concept - conditioned changes - maximise reward success pre j post i Functionality - useful for learning a new behavior Plasticity models: unsupervised vs reinforcement STDP/Hebb Reinforcement Learning theoretical Protocol - maximise reward success pre j post i pre j post i Model - STDP (see previous lecture) Model - spike-based model (later lecture)? Δwij ∝ pre ⋅ post + other terms Δwij ∝ success ⋅ ( pre ⋅ post + other terms) Timing issues in Reinforcement Learning wij pre j i post Success signal is delayed Spike time scale 1-10 ms Reward delay 100-5000 ms àNeed memory trace (eligibility trace) Reward-based Action Learning Connection reinforced if action a at state s successful Success signal Function approximation Qw (s, a) = n ∑w ai ⋅ ri ( s) i =1 Update of Q = update of weights Δwaj ΔQ(s,a)=η [r-(Q(s,a)-Q(s’,a’))] δ Synaptic update of current action a Δwaj = η δ t r j if a taken Δwaj = η δ t r j ra Reward-based Action Learning Connection reinforced if action a at state s successful Reward Success=reward - exp reward Δwaj = η δ t eaj predicted Dopamine neurons (W. Schultz) No Reward Blackboard Exercise at home: Dopamine as delta signal Consider the following state diagram N states light R V(s) s’ time Assume at the beginning, that the animal gets the reward stochastically Later always 2 seconds after the light. Calculate the dopamine signal, If dopamine codes for 𝛿 = 𝑅 − 𝑉 9 𝑠 − 𝛾𝑉 9 (𝑠 *) There are N possible states to get to the light and N states after the reward reinforcement and dopamine Reinforcement Learning theoretical Protocol - maximise reward success pre j post i Model Δwij ∝ success ⋅ ( pre ⋅ post + other terms) Reward-based Action Learning Connection reinforced if action a at state s successful Action a=north Success=reward - exp reward Δwaj = η δ t eaj eaj (t ) = r (st ) ra (t ) + g ⋅ eaj (t −1) State = activity r(s) Eligibility trace Reward-based Action Learning Connection reinforced if action a at state s successful Success signal Learning Rule Δwaj = η δ t eaj Success signal=dopamine Spatial Representation Reward-based Action Learning Connection reinforced if action a at state s successful Success signal Molecular mechanism? Changes in synaptic connections Success signal dopamine vesicles Na+ CaMKII Ca2+ AMPA-R NMDA-R PKA CREB Gene expression CaMKII = Ca-Calmodulin dependent Protein Kinase2 PKA= cAMPdependent protein kinase Changes in synaptic connections vesicles Unsupervised and Reinforcement Learning In Neural Networks Reinforcement learning and Biology - Biological Insights - review: eligibility traces and continuous st. - Algorithmic application - Biological interpretation - Q-learning vs V-learning: Neuroeconomics - Biological application: rodent navigation Action values Q versus State values V Bellman equation a s →s ' Q ( s, a ) = ∑ P s' ⎡ a ⎤ ⎢ Rs →s ' + γ ∑ π ( sʹ, aʹ)Q( sʹ, aʹ)⎥ aʹ ⎣ ⎦ V(s) s V π (s) = ∑π (s, a) Q(s, a) Q(s,a1) a a s →s ' V ( s ) = ∑ π ( s, a ) ∑ P π a s' ⎡ a ⎤ ʹ ʹ ʹ ʹ ⎢ Rs →s ' + γ ∑ π ( s , a )Q( s , a ) ⎥ aʹ ⎣ ⎦ a s →s ' V ( s ) = ∑ π ( s, a ) ∑ P π a s' a s →s ' ⎡⎣ R + γ V (s ') ⎤⎦ π V-Values represented in the brain: fMRI and Neuroeconomics a1 s’ Q(s’,a1) a1 a2 V(s’) a2 Exercise at home: Action values Q versus State values V derive the learning algorithm s Q(s,a1) a1 a2 V π (s) = ∑π (s, a) Q(s, a) a a s →s ' V ( s ) = ∑ π ( s, a ) ∑ P π a s' a s →s ' ⎡⎣ R + γ V (s ') ⎤⎦ π s’ Q(s’,a1) a1 a2 50% reward 82% 100% Cue Cue BIG B A -RA 18% 50% Cue C 82% Cue D 100% small -RB Value V TD error Seymour, O’Doherty, Dayan et al., 2004 How would you decide? goal Risk-seeking vs. Risk aversive: in the brain (‘neuroeconomics’) How would you decide? 750 Euro tomorrow! 50’000 Euro in 12 months! Measure ‘discount factor’ (‘neuroeconomics’) How would you decide? Lottery A: Lottery B: Get 110 CHF with 50% Get 50 CHF with 99,99% Measure ‘value vs risk’ (‘neuroeconomics’) Unsupervised and Reinforcement Learning In Neural Networks Reinforcement learning and Biology - Biological Insights - review: eligibility traces and continuous st. - Algorithmic application - Biological interpretation - Q-learning vs V-learning: Neuroeconomics - Biological application: rodent navigation Biological Principles of Learning: spatial learning Spatial representation Place cells - sensitive to rat’s location Map brain Environment box neurons Place fields Neurophysiology of the Rat Hippocampus CA1 CA3 DG rat brain Place field electrode synapses axon soma dendrites pyramidal cells Hippocampal Place Cells place field Depends on - visual cues - works also in the dark O’Keefe Place field - Invisible platform - Distal cues Morris 1981 Behavior: Navigation to a hidden goal (Morris water maze) • Invisible platform in the water pool • Distal visual cues 62 Morris 1981 Behavior: Navigation to a hidden goal (Morris water maze) l l Different starting positions Task learning depends on the hippocampus 63 Foster, Morris & Dayan 2000 Space representation: place cells l Code for position of the animal l Learn action towards goal hippocampus Place cells 64 O’Keefe & Dostrovsky 1971 Reward-based Action Learning Connection reinforced if action a at state s successful Action a=north Success=reward - exp reward Δwaj = η δ t eaj eaj (t ) = r (st ) ra (t ) + g ⋅ eaj (t −1) State = activity r(s) West East SWest South West SWest East South West East Sout East West South After many trials West East South West East South Results: Morris water maze Experiments Foster, Morris & Dayan 2000 Model, Sheynikhovich et al., July 2009 69 Validating the Model The KHEPERA mobile miniature robot @EPFL Experimental arena 422 x80316 pixels x 80 cm aim- map of actions robot trajectory (Biol. Cybern., 2000) obstacles goal Navigation map after 20 training trials The end Unsupervised and Reinforcement Learning In Neural Networks Reinforcement learning and Biology - Biological Insights - review: eligibility traces and continuous st. - Algorithmic application - Biological interpretation - Q-learning vs V-learning: Neuroeconomics - Biological application: rodent navigation Detour: unsupervised learning of place cells Space representation: place cells l Code for position of the animal: Learn the place cells!!!! l Learn action towards goal Reinforcement learning of actions hippocampus Place cells Unsupervised learning Of place cells Model network of navigation:vision Simulated robot, o View field 300 75 Model: stores views of visited places Robot in an environment Visual input at each time step Local view : activation of set of 9000 Gabor wavelets ! L( p, Φ ) = {Fk } Fk Single View Cell stores a local view Blackboard Blackboard Model: stores views of visited places Robot in an environment Visual input at each time step Local view : activation of set of 9000 Gabor wavelets ! L( p, Φ ) = {Fk } Single View Cell stores a local view: Unsupervised competitive learning Environment exploration Population of view cells All local views are stored in a incrementally growing view cells population 78 Fk Model: extracting direction Stored local view i New local view Population of view cells Position VCi upon entry Φi ΔΦ Φ Alignment of views à current gaze direction Ci View cell i New local view 79 Model: extracting position Place cells Stored local view i New local view Population of view cells Position VCi upon re-entry Small difference between local views – spatially close positions Set of filter responses Difference: Similarity measure: ΔL = Fi − F (t ) 2 ⎛ ⎞ ( Δ L ) VC ⎟ ri = exp⎜⎜ − 2 ⎟ ⎝ 2σ vc ⎠ 80 Model: Learning place cells v Population of place cells created incrementally during exploration Grid cells Activity of a place cell View cells pc ij w Competitive Hebbian learning Place cells Δwijpc [ r pc = ∑ wijpc rjpre pc = η ri − ϑ + ] (rjpre − wijpc ) Compare lecture on Competitive Hebbian learning 81 Biological mechanisms: grid cells electrode Dorsomedial Entorhinal cortex l l l Firing grids extend over the whole space Subpopulation with different grid sizes exist Grid cells project to the CA1 area Grid-cell firing fields rotate following the rotation of the cue • Biological locus of the path integration Hafting et al 2005 82 Results: learning the environment (unsupervised 83 Space representation: place cells l Code for position of the animal l Learn action towards goal Reinforcement learning of actions hippocampus Place cells Unsupervised learning Of place cells The end Behavior in water mazes: Navigation to a hidden goal • Different tasks à different navigation strategies Map based (locale) l Stimulus response (taxon) l • Different tasks à different brain areas. hippocampus l dorsal striatum. l 86 Behavior: Navigation to a hidden goal, but fixed starting position • Starting position is always the same • Task learning does not require the hippocampus 87 Eichenbaum, Stewart, Morris 1990 Results: Morris water maze - Place strategy depends on the memorized representation of the room - Requires alignment of the stored map with available cues 88 Results: Morris water maze Map learned after 10 trials - Place strategy depends on the memorized representation of the room - Requires alignment of the stored map with available cues 89 Biological mechanisms: navigation Hippocampus Sensory input Basal ganglia Ventral striatum/NA Dorsal striatum Reward prediction Reward prediction error error VTA Place – based action learning Mink, 1986 Graybiel, 1998 SNc S-R learning 90 Model: Place cell based vs landmark-based navigation Place cells j wij Visual filters Reinforcement learning AC AC i Action Cells Nucleus accumbens Movement towards the goal i Action Cells Dorsal striatum Turning to the landmark 91 Hidden Goal Conclusion • unsupervised learning leads to interaction between vision and path integration • Model of landmark-based navigation • Model of map-based navigation • Learning of actions: reinforcement learning • Learning of views and places: unsupervised learning joint work with Denis Sheynikhovich and Ricardo Chavarriaga 92