Vocabulary for Deep Learning (In 40 Minutes! With Pictures!) Graham Neubig
Transcription
Vocabulary for Deep Learning (In 40 Minutes! With Pictures!) Graham Neubig
Deep Learning Vocabulary Vocabulary for Deep Learning (In 40 Minutes! With Pictures!) Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Deep Learning Vocabulary Prediction Problems Given x, predict y 2 Deep Learning Vocabulary Example: Sentiment Analysis x y I am happy +1 I am sad -1 3 Deep Learning Vocabulary Linear Classifiers T y = step(w ϕ( x)) x: the input ● φ(x): vector of feature functions ● w: the weight vector ● y: the prediction, +1 if “yes”, -1 if “no” y ● 2 1 0 -4 -3 -2 -1 0 1 2 -1 -2 step(w*φ) 3 4 4 Deep Learning Vocabulary Example Feature Functions: Unigram Features φ(I am happy) {} 1 0 1 0 0 1 … I was am sad the happy φ(I am sad) {} 1 0 1 1 0 0 … 5 Deep Learning Vocabulary The Perceptron ● Think of it as a “machine” to calculate a weighted sum and insert it into an activation function w ϕ ( x) T step( w ϕ (x )) y 6 Deep Learning Vocabulary Sigmoid Function (Logistic Function) ● The sigmoid function is a “softened” version of the step function w⋅ϕ ( x) e P( y =1∣x)= w⋅ϕ (x) 1+e Logistic Function Step Function 1 p(y|x) p(y|x) 1 0.5 -10 -5 0 0 5 10 w*phi(x) ● ● Can account for uncertainty Differentiable 0.5 -10 -5 0 0 5 10 w*phi(x) 7 Deep Learning Vocabulary Logistic Regression ● ● Train based on conditional likelihood Find the parameters w that maximize the conditional likelihood of all answers yi given the example xi w=argmax ^ ∏i P( yi∣x i ; w) w ● How do we solve this? 8 Deep Learning Vocabulary Stochastic Gradient Descent ● Online training algorithm for probabilistic models (including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw ● In other words ● ● For every training example, calculate the gradient (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α 9 Deep Learning Vocabulary Gradient of the Sigmoid Function Take the derivative of the probability w⋅ϕ ( x) d P ( y=1∣x ) = dw d e d w 1+e w⋅ϕ (x) w⋅ϕ (x) e = ϕ (x ) w⋅ϕ (x) 2 (1+e ) d P ( y=−1∣x ) = dw dp(y|x)/dw*phi(x) ● 0.4 0.3 0.2 0.1 -10 -5 0 0 5 10 w*phi(x) w⋅ϕ (x) d e (1− ) w⋅ϕ (x) dw 1+ e w⋅ϕ (x) e = −ϕ (x ) w⋅ϕ (x) 2 (1+ e ) 10 Deep Learning Vocabulary Neural Networks 11 Deep Learning Vocabulary Problem: Linear Constraint ● ● Perceptron cannot achieve high accuracy on nonlinear functions X O O X Example: “I am not happy” 12 Deep Learning Vocabulary Neural Networks (Multi-Layer Perceptron) ● Neural networks connect multiple perceptrons together ϕ ( x) -1 13 Deep Learning Vocabulary Example: ● Build two classifiers: φ(x1) = {-1, 1} φ(x2) = {1, 1} X φ2 O φ1 O X φ(x3) = {-1, -1} φ(x4) = {1, -1} φ1 1 φ2 1 1 -1 w1 y1 w2 y2 φ1 -1 φ2 -1 1 -1 14 Deep Learning Vocabulary Example: ● These classifiers map the points to a new space y(x3) = {-1, 1} φ(x1) = {-1, 1} φ(x2) = {1, 1} X φ2 O O y2 φ1 O y1 X X φ(x3) = {-1, -1} φ(x4) = {1, -1} 1 1 -1 y1 -1 -1 -1 y2 y(x1) = {-1, -1} y(x4) = {-1, -1} O y(x2) = {1, -1} 15 Deep Learning Vocabulary Example: ● In the new space, examples are classifiable! φ(x1) = {-1, 1} φ(x2) = {1, 1} φ2 X O φ1 O 1 1 1 X y3 φ(x3) = {-1, -1} φ(x4) = {1, -1} 1 1 -1 -1 -1 -1 y1 y(x3) = {-1, 1} O y2 y1 y2 y(x1) = {-1, -1} y(x4) = {-1, -1} X O y(x2) = {1, -1} 16 Deep Learning Vocabulary Example: ● Final neural network: φ1 1 φ2 1 1 -1 w1 φ1 -1 φ2 -1 1 -1 1 1 w3 y4 w2 1 1 17 Deep Learning Vocabulary Hidden Layer Activation Functions -4 -3 -2 -1 Linear Function: - Whole net become linear 2 2 1 1 0 0 -1 y y Step Function: - Cannot calculate gradient 0 1 2 3 -4 -3 -2 -1 4 Tanh Function: Standard (also 0-1 sigmoid) 2 3 4 Rectified Linear Function: + Gradients at large vals. 2 2 1 1 0 0 y y 1 w*φ step(w*φ) -1 0 -2 -2 -4 -3 -2 -1 -1 0 1 -2 tanh(w*φ) 2 3 4 -4 -3 -2 -1 0 -1 1 2 3 4 -2 relu(w*φ) 18 Deep Learning Vocabulary Deep Networks ϕ ( x) ● ● Why deep? Gradually more detailed functions (e.g. [Lee+ 09]) 19 Deep Learning Vocabulary Learning Neural Nets 20 Deep Learning Vocabulary Learning: Don't Know Derivative for Hidden Units! ● For NNs, only know correct tag for last layer h( x) w1 d P( y=1∣x) =? d w1 d P( y=1∣x) e w ⋅h(x ) =h (x) w ⋅h( x) 2 d w (1+ e ) 4 w 4 ϕ ( x) w2 w3 4 4 y=1 d P( y=1∣x) =? d w2 d P( y=1∣x) =? d w3 21 Deep Learning Vocabulary Answer: Back-Propogation ● Calculate derivative w/ chain rule d P( y=1∣x) d P( y=1∣x ) d w 4 h(x ) d h1 (x ) = d w1 d w 4 h( x) d h1 (x) d w 1 w 4⋅h( x ) e w ⋅h( x) 2 (1+e ) w 1,4 4 Error of next unit (δ4) In General Calculate i based on next units j: Weight Gradient of this unit d P( y=1∣x) d hi ( x) = δ w ∑ j i, j j wi d wi 22 Deep Learning Vocabulary BP in Deep Networks: Vanishing Gradients y ϕ ( x) δ=-8e-7 δ=9e-5 δ=6e-4 δ=-2e-3 δ=0.01 δ=1e-6 δ=7e-5 δ=-1e-3 δ=5e-3 δ=0.02 δ=0.1 δ=2e-6 δ=-2e-5 δ=-8e-4 δ=1e-3 δ=-0.01 ● Exploding gradient as well 23 Deep Learning Vocabulary Layerwise Training ϕ ( x) ● X y Train one layer at a time 24 Deep Learning Vocabulary Layerwise Training ϕ ( x) ● X y Train one layer at a time 25 Deep Learning Vocabulary Layerwise Training ϕ ( x) ● y Train one layer at a time 26 Deep Learning Vocabulary Autoencoders ● Initialize the NN by training it to reproduce itself ϕ ( x) ● ϕ ( x) Advantage: No need for labels y! 27 Deep Learning Vocabulary Dropout ● Problem: Overfitting X ϕ ( x) X XX X XX ● Deactivate part of the network on each update ● Can be seen as an ensemble method y 28 Deep Learning Vocabulary Network Architectures 29 Deep Learning Vocabulary Recurrent Neural Nets y1 y2 y3 y4 NET NET NET NET x1 x2 x3 x4 ● Good for modeling sequence data ● e.g. Speech 30 Deep Learning Vocabulary Convolutional Neural Nets y x1 (x1-x3) (x2-x4) (x3-x5) x2 x3 x4 x5 ● Good for modeling data with spatial correlation ● e.g. Images 31 Deep Learning Vocabulary Recursive Neural Nets y NET NET NET ● ● NET x1 x2 x3 x4 x5 Good for modeling data with tree structures e.g. Language 32 Deep Learning Vocabulary Other Topics 33 Deep Learning Vocabulary Other Topics ● Deep belief networks ● ● Restricted Boltzmann machines Contrastive estimation ● Generating with Neural Nets ● Visualizing internal states ● Batch/mini-batch update strategies ● Long short-term memory ● Learning on GPUs ● Tools for implementing deep learning 34