Lecture Slides File - Moodle

Transcription

Lecture Slides File - Moodle
Unsupervised and Reinforcement Learning
In Neural Networks
Fall 2015
Reinforcement learning and Biology
- Biological Insights
- review: eligibility traces and continuous st.
- Algorithmic application
- Biological interpretation
- Q-learning vs V-learning
- Biological application: rodent navigation
Learning by reward
Rat learns to find the reward
Learning by reward: conditioning
The dog likes to do things, for which it gets reward!
Trial and error learning
Instrumental conditioning: action learning
Detour: Classical Conditioning
Pawlow, 1927
Klick-Klack
?
Reward!!!
Klick-Klack
reward
Pawlow, 1927
same stuff
as always
Learning by reward: conditioning
Klick-Klack
The dog likes to do things, for which it gets a Klick-Klack!
Trial and error learning
(the Klick-Klack replaces the explicit reward)
Learning by reward: humans?
Chocolate!!
Money!!
Praise!!!!!
What is the reward?
Learning actions, skills, tricks, by reward
Images taken from
public WEB pages
Unsupervised vs. reinforcement learning
review: Hebbian Learning
pre
j
wij
k
i
post
When an axon of cell j repeatedly or persistently
takes part in firing cell i, then j’s efficiency as one
of the cells firing i is increased
Hebb, 1949
- local rule
- simultaneously active (correlations)
Hebbian Learning
= unsupervised learning
Detour: Receptive field development
Hebbian leanring leads to specialized
Neurons (developmental learning);
(Lecture 1-3)
unselective
neurons
Initial:
random
connections
{
{
Correlated input
output
neurons
output neurons specialize
Hebbian Learning of Actions
go left
Action choice
cue
Input, current state,
cues
action
result
green light – go left à no reward
green light – go right à reward
red light
– go right à reward
Hebbian Learning fails!
Reinforcement Learning
= reward + Hebb
SUCCESS/Shock
Not all actions are succesful
You only want to learn the rewarding stuff
Δwij ∝ F ( pre, post, SUCCESS)
local
global
Most models of Hebbian learning/STDP
do not take into account
Neuromodulators/cannot describe success/shock
Except: e.g. Schultz et al. 1997, Wickens 2002, Izhikevich, 2007
Synaptic Plasticity & Hebbian Learning
-could be/should be good for memory formation
-could be/should be good for action learning
BUT
Theoretical Postulate
we need to include global factors for feedback (neuromodulator?)
post
Δwij ∝ F ( pre, post ,3rd factor )
Do three factor-rules exist?
pre
Reinforcement Learning
= reward + Hebb
success
Classification of plasticity:
unsupervised vs reinforcement
LTP/LTD/Hebb
Theoretical concept
- passive changes
- exploit statistical correlations
pre
j
post
i
Functionality
-useful for development
( wiring for receptive fields)
Reinforcement Learning
Theoretical concept
- conditioned changes
- maximise reward
success
pre
j
post
i
Functionality
- useful for learning
a new behavior
Unsupervised and Reinforcement Learning
In Neural Networks
Reinforcement learning and Biology
- Biological Insights
- review: eligibility traces and continuous st.
- Algorithmic application
- Biological interpretation
- Q-learning vs V-learning
- Biological application: rodent navigation
SARSA algo
Initialise Q values
Start from initial state s
Q(s,a1)
1) Being in state s
choose action a
a1
according to policy
a
r=
R
2) Observe reward r
s →s '
and next state s’
s’
3) Choose action a’ in state s’
Q(s’,a1)
according to policy
a1
4) Update
ΔQ(s,a)=η [r-(Q(s,a)-γ Q(s’,a’))]
5) s’ s; a’
6) Goto 1)
s
a2
a2
a
γ
ΔQ(s,a)=η [r-(Q(s,a)- Q(s’,a’))]
Eligibitiy trace
s=state
a
a=action
Rsa→s' = reward
s’=new state
Sarsa
ΔQ(s,a)=η [r-(Q(s,a)-γ Q(s’,a’))]
Idea: at the moment of reward
update also previous action values
along trajectories
Eligibitiy trace - algo
Sarsa(0)
ΔQ(s,a)=η [r-(Q(s,a)-γ Q(s’,a’))]
δ t = rt +1 − [Qa (s, a) − γ ⋅ Qa (sʹ, aʹ)]
Sarsa(λ )
ΔQaj = η δ t eaj
if a = action taken in state j
eaj (t ) = γλ eaj (t − Δt ) + 1
else
eaj (t ) = γλ eaj (t − Δt )
Continuous states – example
Action: a1 = right
a2 = left
y
y
Action a1
x
Action a2
x
Continuous states – example
Q(s,a1) Action a1
Q(s,a2)
Action a2
y
y
x
x
Continuous states
Q(s, a) = ∑ j waj rj (s)
Q(s, a) = ∑ j waj Φ(s − s j )
Spatial
Representation
Q(s,a1)
y
continuous: TD( )
TD error in SARSA
δ t = Rt +1 − [Qa (s, a) − γ ⋅ Qaʹ (sʹ, aʹ)]
Function approximation
n
Q( s, a) = ∑ wia ⋅ ri ( s )
i =1
n
a
i
= ∑ w ⋅ Φ ( s − si )
Synaptic update
Δwaj = η δ t eaj
i =1
⎧Φ( s − s j )
eaj (t ) = ⎨
⎩0
if a = action taken
ALGO: Eligibility and continous: TD( )
0) initialise
1)Use policy for action choice a
Pick most often action
Function approximation
Qw (s, a)
at* = arg max Qa (s, a)
n
= ∑ wia ⋅ ri ( s )
i =1
a
2) Observe R, s’, choose next action a’
3) Calculate TD error in SARSA
δ t = Rt +1 − [Qa (s, a) − γ ⋅ Qaʹ (sʹ, aʹ)]
4) Update eligibility trace
⎧ rj
eaj (t ) = γλ eaj (t − Δt ) + ⎨
0
⎩
5) Update weights
if a = action taken
Δwaj = η δ t eaj
6) Old a’ becomes a, old s’ becomes s and return to 2)
Unsupervised and Reinforcement Learning
In Neural Networks
Reinforcement learning and Biology
- Biological Insights
- review: eligibility traces and continuous st.
- Algorithmic application
- Biological interpretation
- Q-learning vs V-learning
- Biological application: rodent navigation
15 white and 15 black pieces
24 locations
∆𝑤 = 𝜂 𝑟 − 𝛾𝑉 𝑠 * − 𝑉(𝑠) 𝑒.
𝛿
34
𝑒. = 𝜆𝑒.12 35
1
2
3
≥4
Blackboard
Strategy:
1. Throw dice
2. Estimate V(t+1) for all
possible future states s’
3. Chose move with maximal
V(t+1)
Initial training:
- Play against itself
Unsupervised and Reinforcement Learning
In Neural Networks
Reinforcement learning and Biology
- Biological Insights
- review: eligibility traces and continuous st.
- Algorithmic application
- Biological interpretation
- Q-learning vs V-learning
- Biological application: rodent navigation
ALGO: Eligibility and continous: TD( )
0) initialise
1)Use policy for action choice a
Pick most often action
Function approximation
Qw (s, a)
at* = arg max Qa (s, a)
n
= ∑ wia ⋅ ri ( s )
i =1
a
2) Observe R, s’, choose next action a’
3) Calculate TD error in SARSA
δ t = Rt +1 − [Qa (s, a) − γ ⋅ Qaʹ (sʹ, aʹ)]
4) Update eligibility trace
⎧ rj
eaj (t ) = γλ eaj (t − Δt ) + ⎨
0
⎩
5) Update weights
if a = action taken
Δwaj = η δ t eaj
6) Old a’ becomes a, old s’ becomes s and return to 2)
Eligibility and continuous: TD(λ )
Eligibility trace (memory at synapse):
⎧⎪ r i1 if a = a! (= action taken)
j
eaj (t) = γλ eaj (t − Δt)+ ⎨
⎪⎩ rj i0 else
Δwaj = η δ t eaj
eaj (t) = γλ eaj (t − Δt)+ rj δa,a!
post pre
Winner-takes-all interaction
Eligibility trace (memory at synapse):
pre
post
⎧⎪ r r
eaj (t) = λ eaj (t − Δt)+ ⎨ j a
⎪⎩0
memory
if a = a!
Eligibility and continous: TD( )
TD error in SARSA
δ t = Rt +1 − [Qa (s, a) − γ ⋅ Qaʹ (sʹ, aʹ)]
Function approximation
n
a
=
w
Qw (s, a) ∑ i ⋅ ri ( s)
i =1
Synaptic update
Eligibility trace (memory at synapse):
Δwaj = η δ t eaj
⎧⎪ r i1 if a = a! (= action taken)
j
eaj (t) = γλ eaj (t − Δt)+ ⎨
⎪⎩ rj i0 else
eaj (t) = γλ eaj (t − Δt)+ rj δa,a!
Classification of plasticity:
unsupervised vs reinforcement
LTP/LTD/Hebb
Theoretical concept
- passive changes
- exploit statistical correlations
pre
j
post
i
Functionality
-useful for development
( wiring for receptive fields)
Reinforcement Learning
Theoretical concept
- conditioned changes
- maximise reward
success
pre
j
post
i
Functionality
- useful for learning
a new behavior
Plasticity models:
unsupervised vs reinforcement
STDP/Hebb
Reinforcement Learning
theoretical Protocol
- maximise reward
success
pre
j
post
i
pre
j
post
i
Model
- STDP (see previous lecture)
Model
- spike-based model (later lecture)?
Δwij ∝ pre ⋅ post + other terms
Δwij ∝ success ⋅ ( pre ⋅ post + other terms)
Timing issues in Reinforcement Learning
wij
pre
j
i
post
Success signal is delayed
Spike time scale 1-10 ms
Reward delay 100-5000 ms
àNeed memory trace (eligibility trace)
Reward-based Action Learning
Connection reinforced if
action a at state s successful
Success signal
Function approximation
Qw (s, a) =
n
∑w
ai
⋅ ri ( s)
i =1
Update of Q = update of weights
Δwaj
ΔQ(s,a)=η [r-(Q(s,a)-Q(s’,a’))]
δ
Synaptic update of current action a
Δwaj = η δ t r j if a taken
Δwaj = η δ t r j ra
Reward-based Action Learning
Connection reinforced if
action a at state s successful
Reward
Success=reward - exp reward
Δwaj = η δ t eaj
predicted
Dopamine neurons
(W. Schultz)
No Reward
Blackboard
Exercise at home:
Dopamine as delta signal
Consider the following state diagram
N states
light
R
V(s)
s’
time
Assume at the beginning, that the animal gets the reward stochastically
Later always 2 seconds after the light. Calculate the dopamine signal,
If dopamine codes for 𝛿 = 𝑅 − 𝑉 9 𝑠 − 𝛾𝑉 9 (𝑠 *)
There are N possible states to get to the light and N states after the reward
reinforcement
and dopamine
Reinforcement Learning
theoretical Protocol
- maximise reward
success
pre
j
post
i
Model
Δwij ∝ success ⋅ ( pre ⋅ post + other terms)
Reward-based Action Learning
Connection reinforced if
action a at state s successful
Action a=north
Success=reward - exp reward
Δwaj = η δ t eaj
eaj (t ) = r (st ) ra (t ) + g ⋅ eaj (t −1)
State = activity r(s)
Eligibility trace
Reward-based Action Learning
Connection reinforced if
action a at state s successful
Success signal
Learning Rule
Δwaj = η δ t eaj
Success signal=dopamine
Spatial
Representation
Reward-based Action Learning
Connection reinforced if
action a at state s successful
Success signal
Molecular mechanism?
Changes in synaptic connections
Success signal
dopamine
vesicles
Na+
CaMKII
Ca2+
AMPA-R
NMDA-R
PKA
CREB
Gene
expression
CaMKII = Ca-Calmodulin dependent
Protein Kinase2
PKA= cAMPdependent protein kinase
Changes in synaptic connections
vesicles
Unsupervised and Reinforcement Learning
In Neural Networks
Reinforcement learning and Biology
- Biological Insights
- review: eligibility traces and continuous st.
- Algorithmic application
- Biological interpretation
- Q-learning vs V-learning: Neuroeconomics
- Biological application: rodent navigation
Action values Q versus State values V
Bellman equation
a
s →s '
Q ( s, a ) = ∑ P
s'
⎡ a
⎤
⎢ Rs →s ' + γ ∑ π ( sʹ, aʹ)Q( sʹ, aʹ)⎥
aʹ
⎣
⎦
V(s) s
V π (s) = ∑π (s, a) Q(s, a)
Q(s,a1)
a
a
s →s '
V ( s ) = ∑ π ( s, a ) ∑ P
π
a
s'
⎡ a
⎤
ʹ
ʹ
ʹ
ʹ
⎢ Rs →s ' + γ ∑ π ( s , a )Q( s , a ) ⎥
aʹ
⎣
⎦
a
s →s '
V ( s ) = ∑ π ( s, a ) ∑ P
π
a
s'
a
s →s '
⎡⎣ R
+ γ V (s ') ⎤⎦
π
V-Values represented in the brain:
fMRI and Neuroeconomics
a1
s’
Q(s’,a1)
a1
a2
V(s’)
a2
Exercise at home:
Action values Q versus State values V
derive the learning algorithm
s
Q(s,a1)
a1
a2
V π (s) = ∑π (s, a) Q(s, a)
a
a
s →s '
V ( s ) = ∑ π ( s, a ) ∑ P
π
a
s'
a
s →s '
⎡⎣ R
+ γ V (s ') ⎤⎦
π
s’
Q(s’,a1)
a1
a2
50%
reward
82%
100%
Cue
Cue
BIG
B
A
-RA
18%
50%
Cue
C
82%
Cue
D
100%
small
-RB
Value V
TD error
Seymour, O’Doherty, Dayan et al., 2004
How would you decide?
goal
Risk-seeking
vs.
Risk aversive:
in the brain
(‘neuroeconomics’)
How would you decide?
750 Euro tomorrow!
50’000 Euro in 12 months!
Measure ‘discount factor’
(‘neuroeconomics’)
How would you decide?
Lottery A:
Lottery B:
Get
110 CHF with 50%
Get
50 CHF with 99,99%
Measure ‘value vs risk’
(‘neuroeconomics’)
Unsupervised and Reinforcement Learning
In Neural Networks
Reinforcement learning and Biology
- Biological Insights
- review: eligibility traces and continuous st.
- Algorithmic application
- Biological interpretation
- Q-learning vs V-learning: Neuroeconomics
- Biological application: rodent navigation
Biological Principles of Learning:
spatial learning
Spatial representation
Place cells - sensitive to rat’s location
Map
brain
Environment
box
neurons
Place fields
Neurophysiology of the Rat Hippocampus
CA1
CA3
DG
rat brain
Place field
electrode
synapses
axon
soma
dendrites
pyramidal cells
Hippocampal Place Cells
place field
Depends on
- visual cues
- works also in the dark
O’Keefe
Place field
- Invisible platform
- Distal cues
Morris 1981
Behavior: Navigation to a
hidden goal (Morris water maze)
• Invisible platform in the water pool
• Distal visual cues
62
Morris 1981
Behavior: Navigation to a
hidden goal (Morris water maze)
l
l
Different starting positions
Task learning depends on the hippocampus
63
Foster, Morris & Dayan 2000
Space representation: place cells
l
Code for position of the animal
l
Learn action towards goal
hippocampus
Place cells
64
O’Keefe & Dostrovsky 1971
Reward-based Action Learning
Connection reinforced if
action a at state s successful
Action a=north
Success=reward - exp reward
Δwaj = η δ t eaj
eaj (t ) = r (st ) ra (t ) + g ⋅ eaj (t −1)
State = activity r(s)
West
East
SWest
South
West
SWest
East
South
West
East
Sout
East
West
South
After many
trials
West
East
South
West
East
South
Results: Morris water maze
Experiments
Foster, Morris & Dayan 2000
Model,
Sheynikhovich et al.,
July 2009
69
Validating the Model
The KHEPERA mobile miniature robot @EPFL
Experimental arena
422 x80316
pixels
x 80
cm
aim- map of actions
robot trajectory
(Biol. Cybern., 2000)
obstacles
goal
Navigation map
after 20 training trials
The end
Unsupervised and Reinforcement Learning
In Neural Networks
Reinforcement learning and Biology
- Biological Insights
- review: eligibility traces and continuous st.
- Algorithmic application
- Biological interpretation
- Q-learning vs V-learning: Neuroeconomics
- Biological application: rodent navigation
Detour: unsupervised learning of place cells
Space representation: place cells
l
Code for position of the animal: Learn the place cells!!!!
l
Learn action towards goal
Reinforcement learning
of actions
hippocampus
Place cells
Unsupervised learning
Of place cells
Model network of navigation:vision
Simulated robot,
o
View field 300
75
Model: stores views of visited places
Robot in an environment
Visual input at each
time step
Local view : activation of
set of 9000 Gabor wavelets
!
L( p, Φ ) = {Fk }
Fk
Single View Cell stores a
local view
Blackboard
Blackboard
Model: stores views of visited places
Robot in an environment
Visual input at each
time step
Local view : activation of
set of 9000 Gabor wavelets
!
L( p, Φ ) = {Fk }
Single View Cell stores a
local view: Unsupervised
competitive learning
Environment exploration
Population of view cells
All local views are
stored in a
incrementally
growing view cells
population
78
Fk
Model: extracting direction
Stored local view i
New local view
Population of view cells
Position
VCi
upon entry
Φi
ΔΦ
Φ
Alignment of views à current gaze direction
Ci
View
cell i
New
local
view
79
Model: extracting position
Place cells
Stored local view i
New local view
Population of view cells
Position VCi
upon re-entry
Small difference between local views – spatially close
positions
Set of filter responses
Difference:
Similarity measure:
ΔL = Fi − F (t )
2
⎛
⎞
(
Δ
L
)
VC
⎟
ri = exp⎜⎜ −
2 ⎟
⎝ 2σ vc ⎠
80
Model: Learning place cells
v
Population of place cells
created incrementally
during exploration
Grid
cells
Activity of a place cell
View
cells
pc
ij
w
Competitive
Hebbian learning
Place
cells
Δwijpc
[
r pc = ∑ wijpc rjpre
pc
= η ri − ϑ
+
]
(rjpre − wijpc )
Compare lecture on
Competitive Hebbian learning
81
Biological mechanisms: grid cells
electrode
Dorsomedial
Entorhinal
cortex
l
l
l
Firing grids extend over the whole
space
Subpopulation with different grid
sizes exist
Grid cells project to the CA1 area
Grid-cell firing fields rotate following
the rotation of the cue
• Biological locus of the path integration
Hafting et al 2005
82
Results: learning the environment
(unsupervised
83
Space representation: place cells
l
Code for position of the animal
l
Learn action towards goal
Reinforcement learning
of actions
hippocampus
Place cells
Unsupervised learning
Of place cells
The end
Behavior in water mazes:
Navigation to a hidden goal
• Different tasks à different navigation strategies
Map based (locale)
l
Stimulus response (taxon)
l
• Different tasks à different brain areas.
hippocampus
l
dorsal striatum.
l
86
Behavior: Navigation to a hidden goal, but
fixed starting position
• Starting position is always the
same
• Task learning does not require the
hippocampus
87
Eichenbaum, Stewart, Morris 1990
Results: Morris water maze
- Place strategy depends on the memorized representation of the room
- Requires alignment of the stored map with available cues
88
Results: Morris water maze
Map learned after
10 trials
- Place strategy depends on the memorized representation of the room
- Requires alignment of the stored map with available cues
89
Biological mechanisms:
navigation
Hippocampus
Sensory input
Basal ganglia
Ventral
striatum/NA
Dorsal striatum
Reward prediction Reward prediction
error
error
VTA
Place – based
action learning
Mink, 1986
Graybiel, 1998
SNc
S-R learning
90
Model:
Place cell based vs landmark-based
navigation
Place
cells
j
wij
Visual
filters
Reinforcement learning
AC
AC
i
Action Cells
Nucleus accumbens
Movement towards the goal
i
Action Cells
Dorsal striatum
Turning to the landmark
91
Hidden Goal
Conclusion
• unsupervised learning leads to interaction between
vision and path integration
• Model of landmark-based navigation
• Model of map-based navigation
• Learning of actions: reinforcement learning
• Learning of views and places: unsupervised learning
joint work with Denis Sheynikhovich and Ricardo Chavarriaga
92