Vocabulary for Deep Learning (In 40 Minutes! With Pictures!) Graham Neubig

Transcription

Deep Learning Vocabulary
Vocabulary for Deep Learning
(In 40 Minutes! With Pictures!)
Graham Neubig
Nara Institute of Science and Technology (NAIST)
1
Prediction Problems
Given x, predict y
2
Example: Sentiment Analysis
x
y
I am happy
+1
I am sad
-1
3
Linear Classifiers
T
y = step(w ϕ( x))
x: the input
●
φ(x): vector of feature functions
●
w: the weight vector
●
y: the prediction, +1 if “yes”, -1 if “no”
y
●
2
1
0
-4 -3 -2 -1 0 1 2
-1
-2
step(w*φ)
3
4
4
Example Feature Functions:
Unigram Features
φ(I am happy)
{}
1
0
1
0
0
1
…
I
was
am
sad
the
happy
φ(I am sad)
{}
1
0
1
1
0
0
…
5
The Perceptron
●
Think of it as a “machine” to calculate a weighted sum
and insert it into an activation function
w
ϕ ( x)
T
step( w ϕ (x ))
y
6
Sigmoid Function (Logistic Function)
●
The sigmoid function is a “softened” version of the step
function
w⋅ϕ ( x)
e
P( y =1∣x)=
w⋅ϕ (x)
1+e
Logistic Function
Step Function
1
p(y|x)
p(y|x)
1
0.5
-10
-5
0
0
5
10
w*phi(x)
●
●
Can account for uncertainty
Differentiable
0.5
-10
-5
0
0
5
10
w*phi(x)
7
Logistic Regression
●
●
Train based on conditional likelihood
Find the parameters w that maximize the conditional
likelihood of all answers yi given the example xi
w=argmax
^
∏i P( yi∣x i ; w)
w
●
How do we solve this?
8
Stochastic Gradient Descent
●
Online training algorithm for probabilistic models
(including logistic regression)
create map w
for I iterations
for each labeled pair x, y in the data
w += α * dP(y|x)/dw
●
In other words
●
●
For every training example, calculate the gradient
(the direction that will increase the probability of y)
Move in that direction, multiplied by learning rate α
9
Gradient of the Sigmoid Function
Take the derivative of the probability
w⋅ϕ ( x)
d
P ( y=1∣x ) =
dw
d e
d w 1+e w⋅ϕ (x)
w⋅ϕ (x)
e
= ϕ (x )
w⋅ϕ (x) 2
(1+e
)
d
P ( y=−1∣x ) =
dw
dp(y|x)/dw*phi(x)
●
0.4
0.3
0.2
0.1
-10
-5
0
0
5
10
w*phi(x)
w⋅ϕ (x)
d
e
(1−
)
w⋅ϕ (x)
dw
1+ e
w⋅ϕ (x)
e
= −ϕ (x )
w⋅ϕ (x) 2
(1+ e
)
10
Neural Networks
11
Problem: Linear Constraint
●
●
Perceptron cannot achieve high accuracy on nonlinear functions
X
O
O
X
Example: “I am not happy”
12
Neural Networks
(Multi-Layer Perceptron)
●
Neural networks connect multiple perceptrons together
ϕ ( x)
-1
13
Example:
●
Build two classifiers:
φ(x1) = {-1, 1} φ(x2) = {1, 1}
X
φ2
O
φ1
O
X
φ(x3) = {-1, -1} φ(x4) = {1, -1}
φ1
1
φ2
1
1
-1
w1
y1
w2
y2
φ1 -1
φ2 -1
1 -1
14
Example:
●
These classifiers map the points to a new space
y(x3) = {-1, 1}
φ(x1) = {-1, 1} φ(x2) = {1, 1}
X
φ2
O
O
y2
φ1
O
y1
X
X
φ(x3) = {-1, -1} φ(x4) = {1, -1}
1
1
-1
y1
-1
-1
-1
y2
y(x1) = {-1, -1}
y(x4) = {-1, -1}
O
y(x2) = {1, -1}
15
Example:
●
In the new space, examples are classifiable!
φ(x1) = {-1, 1} φ(x2) = {1, 1}
φ2
X
O
φ1
O
1
1
1
X
y3
φ(x3) = {-1, -1} φ(x4) = {1, -1}
1
1
-1
-1
-1
-1
y1
y(x3) = {-1, 1} O
y2
y1
y2
y(x1) = {-1, -1}
y(x4) = {-1, -1} X
O y(x2) = {1, -1}
16
Example:
●
Final neural network:
φ1
1
φ2
1
1
-1
w1
φ1 -1
φ2 -1
1 -1
1
1
w3
y4
w2
1 1
17
Hidden Layer Activation Functions
-4 -3 -2 -1
Linear Function:
- Whole net become linear
2
2
1
1
0
0
-1
y
y
Step Function:
- Cannot calculate gradient
0
1
2
3
-4 -3 -2 -1
4
Tanh Function:
Standard (also 0-1 sigmoid)
2
3
4
Rectified Linear Function:
+ Gradients at large vals.
2
2
1
1
0
0
y
y
1
w*φ
step(w*φ)
-1
0
-2
-2
-4 -3 -2 -1
-1
0
1
-2
tanh(w*φ)
2
3
4
-4 -3 -2 -1 0
-1
1
2
3
4
-2
relu(w*φ)
18
Deep Networks
ϕ ( x)
●
●
Why deep?
Gradually more
detailed functions
(e.g. [Lee+ 09])
19
Learning Neural Nets
20
Learning:
Don't Know Derivative for Hidden Units!
●
For NNs, only know correct tag for last layer
h( x)
w1
d P( y=1∣x)
=?
d w1
d P( y=1∣x)
e w ⋅h(x )
=h (x)
w ⋅h( x) 2
d
w
(1+ e
)
4
w
4
ϕ ( x)
w2
w3
4
4
y=1
d P( y=1∣x)
=?
d w2
d P( y=1∣x)
=?
d w3
21
Answer: Back-Propogation
●
Calculate derivative w/ chain rule
d P( y=1∣x) d P( y=1∣x ) d w 4 h(x ) d h1 (x )
=
d w1
d w 4 h( x) d h1 (x) d w 1
w 4⋅h( x )
e
w ⋅h( x) 2
(1+e
)
w 1,4
4
Error of
next unit (δ4)
In General
Calculate i based
on next units j:
Weight Gradient of
this unit
d P( y=1∣x) d hi ( x)
=
δ
w
∑
j
i, j
j
wi
d wi
22
BP in Deep Networks: Vanishing Gradients
y
ϕ ( x)
δ=-8e-7 δ=9e-5 δ=6e-4 δ=-2e-3 δ=0.01
δ=1e-6 δ=7e-5 δ=-1e-3 δ=5e-3 δ=0.02 δ=0.1
δ=2e-6 δ=-2e-5 δ=-8e-4 δ=1e-3 δ=-0.01
●
Exploding gradient as well
23
Layerwise Training
ϕ ( x)
●
X
y
Train one layer at a time
24
Layerwise Training
ϕ ( x)
●
X
y
25
Layerwise Training
ϕ ( x)
●
y
26
Autoencoders
●
Initialize the NN by training it to reproduce itself
ϕ ( x)
●
ϕ ( x)
Advantage: No need for labels y!
27
Dropout
●
Problem: Overfitting
X
ϕ ( x)
X
XX
X
XX
●
Deactivate part of the network on each update
●
Can be seen as an ensemble method
y
28
Network Architectures
29
Recurrent Neural Nets
y1
y2
y3
y4
NET
NET
NET
NET
x1
x2
x3
x4
●
Good for modeling sequence data
●
e.g. Speech
30
Convolutional Neural Nets
y
x1
(x1-x3)
(x2-x4)
(x3-x5)
x2
x3
x4
x5
●
Good for modeling data with spatial correlation
●
e.g. Images
31
Recursive Neural Nets
y
NET
NET
NET
●
●
NET
x1
x2
x3
x4
x5
Good for modeling data with tree structures
e.g. Language
32
Other Topics
33
Other Topics
●
Deep belief networks
●
●
Restricted Boltzmann machines
Contrastive estimation
●
Generating with Neural Nets
●
Visualizing internal states
●
Batch/mini-batch update strategies
●
Long short-term memory
●
Learning on GPUs
●
Tools for implementing deep learning
34

Vocabulary for Deep Learning (In 40 Minutes! With Pictures!) Graham Neubig

Transcription

Similar documents