Scaling up Deep Learning towards AI

Transcription

Scaling up Deep Learning towards AI
Scaling up Deep Learning towards AI Yoshua Bengio 13, 2015 October IBM Cogni6ve Colloquium, San Francisco Breakthrough
•  Deep Learning: machine learning algorithms inspired by brains, based on learning a composi>on mul>ple transforma>ons = levels of representa>on / abstrac>on. 2 Impact
Deep learning has revolu>onized • Speech recogni>on • Object recogni>on More on the way, including other areas of computer vision, NLP, dialogue, reinforcement learning, robo>cs, control… 3 Challenges to Scale towards AI
•  Computa6onal challenge •  Reasoning, natural language understanding and knowledge representa6on •  Large-­‐scale unsupervised learning 4 Computational Scaling
•  Recent breakthroughs in speech, object recogni6on and NLP hinged on faster compu6ng, GPUs, and large datasets •  In speech, vision and NLP applica6ons we tend to find that as Ilya Sutskever would say BIGGER IS BETTER Because deep learning is EASY TO REGULARIZE while it is MORE DIFFICULT TO AVOID UNDERFITTING 5 Scaling up computation:
we still have a long way to go
in raw computational power
Figure: Ian Goodfellow 6 Low-Precision Training of Deep Nets
Courbariaux, Bengio & David, ICLR 2015 workshop •  See (Gupta et al, arXiv Feb. 2015) for a recent review •  Previous work showed that it was possible to quan6ze weights aOer training to obtain low-­‐precision implementa6ons of trained deep nets (8 bits or even less if you retrain and keep high precision at top layer) •  This work is about training with low precision •  How many bits are required? 7 12 Number of bits for computations
DYNAMIC FIXED POINT FIXED POINT 10 bits were selected, with dynamic fixed point 8 Number of bits for updating and
storing weights
DYNAMIC FIXED POINT 12 bits were selected 9 FIXED POINT NIPS’2015: Single-Bit Weights
BitConnect, Courbariaux, David & Bengio, NIPS’2015 •  Using stochas6c rounding and 16-­‐bit precision opera6ons, we are able to train deep nets with weights quan6zed to 1 bit, i.e., we can get rid of most (2/3) mul6plica6ons •  This could have a dras6c impact on hardware implementa6ons, especially for low-­‐power devices… 10 Results on MNIST
Method
Validation error rate (%)
Test error rate (%)
No regularizer
BinaryConnect (det.)
BinaryConnect (stoch.)
50% Dropout
1.21 ± 0.04
1.17 ± 0.02
1.12 ± 0.03
0.94 ± 0.04
1.30 ± 0.04
1.29 ± 0.08
1.18 ± 0.04
1.01 ± 0.04
Maxout networks [26]
Deep L2-SVM [27]
0.94
0.87
e 1: Error rates of a MLP trained on MNIST depending on the method. Methods using unsup
d pretraining are not taken into account. We see that in spite of using only a single bit per wei
ng propagation, performance is not worse than an ordinary (no regularizer) MLP, it is actu
er, especially with the stochastic version, suggesting that BinaryConnect acts as a regularize
11 Getting Rid of the Remaining
Multiplications
•  The main remaining mul6plica6ons (1/3) are due to the weight update of the form @C
Wij /
hj
@ai
•  It can be eliminated by quan6zing h
j to powers of 2 (Simard & Graf 1994): the mul6plica6on becomes a shih. Similarly for the learning rate. •  The quan6za6on can also be stochas6c, to preserve the expected value of the update: E[h̃j ] = hj
12 Neural Networks with Few
Multiplications
•  ArXiv paper, 2015: Lin, Courbariaux, Memisevic & Bengio Works! Slows down training a bit but improves results by regularizing 13 The Language
Understanding
Challenge
14 Learning Word Semantics: a Success
output
•  Bengio et al 2000 introduced neural language models and word vectors (word embeddings) R(w1 ) R(w2 ) R(w3 ) R(w4 ) R(w5 ) R(w6 )
w1
w2
w3
w4
w5
w6
input sequence
Mikolov showed these embeddings capture analogical rela6ons: Queen King 15 Man Woman The Next Challenge: Rich Semantic
Representations for Word Sequences
•  First challenge: machine transla6on •  Second challenge: document understanding and ques6on answering 16 Attention-Based Neural Machine
Translation
Related to earlier Graves 2013 for genera6ng handwri6ng •  (Bahdanau, Cho & Bengio, arXiv sept. 2014) •  (Jean, Cho, Memisevic & Bengio, arXiv dec. 2014) Recurrent
State
zi
Attention
Mechanism
ui
aj
Annotation
Vectors
Word
Ssample
f = (La, croissance, économique, s'est, ralentie, ces, dernières, années, .)
hj
Attention
weight
+
aj =1
e = (Economic, growth, has, slowed, down, in, recent, years, .)
17 End-to-End Machine Translation with
Recurrent Nets and Attention Mechanism
>Qr 7`
r2 ;Q rBi? ip2`v
H`;2
i`;2i
URV
•  Reached the +M
state-­‐of-­‐the-­‐art n one year, from pQ+#mH`v\
scratch UV 1M;HBb?→6`2M+? UqJh@R9V
LJh
Y*M/
YlLE
Y1Mb
LJhUV
jkXe3
jjXk3
jjXNN
jeXdR
U#V 1M;HBb?→:2`KM UqJh@R8V
JQ/2H
k9X3
k9Xy
kjXe
kkX3
kkXd
LQi2
L2m`H Jh
lX1/BM#m`;?- avMi+iB+ aJh
GAJaAfEAh
lX1/BM#m`;?- S?`b2 aJh
EAh- S?`b2 aJh
:QQ;H2
jyXe⋆
Ĝ
jkXd◦
jeXN◦
S@aJh
jdXyj•
U+V 1M;HBb?→*x2+? UqJh@R8V
JQ/2H
R3Xj
R3Xk
RdXe
RdX9
ReXR
LQi2
L2m`H Jh
C>l- aJhYGJYPaJYaT`b2
*l- S?`b2 aJh
lX1/BM#m`;?- S?`b2 aJh
lX1/BM#m`;?- avMi+iB+ aJh
18 LJhUV, U"?/Mm 2i HX- kyR9c C2M 2i HX- kyR9V- U⋆V, UamibF2p2`
2i HX- kyR9VX
X X X X X X X X X X X
U◦V, UGmQM; 2i HX- kyR9V- U•V, U.m``MB 2i HX- kyR9V-
X X X
X
X
X
X
X
Image-to-Text: Caption Generation
ui
zi
Attention
weight
aj
aj =1
+
Convolutional Neural Network
Attention
Mechanism
Recurrent
State
Word
Ssample
f = (a, man, is, jumping, into, a, lake, .)
19 Usm 2i HX- kyR8V- UuQ 2i HX- kyR8V
Annotation
Vectors
hj
X
X
X
X
X
X
X X X X
X X X X
X X X X
X X X X
X X X X
X X X X
X
X
X
X
X
X
X
X
X
X
The Good
20 And the Bad
21 Attention Mechanisms for Memory
Access
•  Neural Turing Machines (Graves et al 2014) and Memory Networks (Weston et al 2014) •  Use a form of amen6on mechanism to control the read and write access into a memory •  The amen6on mechanism outputs a sohmax over memory loca6ons 22 read write The Next Frontier: Reasoning and
Question Answering
•  Currently working on ar6ficial tasks, with memory networks: From “Memory Networks”, Weston et al. ICLR 2015; “End-­‐to-­‐end memory networks”, Sukhbatar et al. NIPS’2015 23 Ongoing Project: Knowledge
Extraction
•  Learn to fill the memory network from natural language descrip6ons of facts •  Force the neural net to understand language •  Extract knowledge from documents into a usable form read write 24 Why does it work? Pushing off the
Curse of Long-Term Dependencies
•  Whereas LSTM memories always decay exponen6ally (even if slowly), a mental state stored in an external memory can stay for arbitrarily long dura6ons, un6l overwrimen. passive copy access 25 The Unsupervised
Learning Challenge
26 Why Unsupervised Learning?
•  Recent progress mostly in supervised DL •  Real technical challenges for unsupervised DL •  Poten6al benefits: •  Exploit tons of unlabeled data •  Answer new ques6ons about the variables observed •  Regularizer – transfer learning – domain adapta6on •  Easier op6miza6on (local training signal) •  Structured outputs 27 How do humans generalize
from very few examples?
•  Intelligence (good generaliza6on) needs knowledge •  Humans transfer knowledge from previous learning: • 
Representa6ons • 
Explanatory factors •  Previous learning from: unlabeled data 28 + labels for other tasks Unsupervised and Transfer Learning Challenge
+ Transfer Learning Challenge: Won by
Unsupervised Deep Learning
Raw data 1 layer ICML’2011 workshop on Unsup. & Transfer Learning 2 layers NIPS’2011 Transfer Learning Challenge Paper: ICML’2012 3 layers 4 layers Intractable (Exponential) Barriers
•  Sta6s6cal curse of dimensionality: •  Intractable number of configura6ons of variables, in high dimension •  Computa6onal curse of dimensionality: •  Intractable normaliza6on constants •  Intractable (non-­‐convex) op6miza6on? •  Intractable inference 30 Deep Generative Learning: the Hot
Frontier
DRAW: A Recurrent Neural Network For Image Generation
•  Many very different approaches being explored to bypass these intractabili6es Table 3. Experimental Hyper-Parameters.
Task
#glimpses LSTM #h
#z Read Size Write Size
mann z
t
l
•  Exploratory m
ode o
B
n
e
100 ⇥ 100 MNIST Classification
8
256
12
⇥
12
betwe
gio) p
n
a
e
g
B
.
e
Y
h
(
t
p
MNIST Model
64
256
2ckpro 5 ⇥ 5
And 100 nd2 ⇥
a
B
•  Exci6ng a
rea o
f r
esearch a
s
ine 12 ⇥ 12
SVHN Model
32
800
12 ⇥ 12
mach100
CIFAR Model
400 200
5⇥5
5⇥5
•  Connect to brains: bridge the gap 64to biology DRAW (DeepMind) 31 LAPGAN (NYU/Facebook) Learning Multiple Levels of
Abstraction
•  The big payoff of deep learning is to allow learning higher levels of abstrac6on •  Higher-­‐level abstrac6ons disentangle the factors of varia6on, which allows much easier generaliza6on and transfer 32 MILA: Montreal Institute for Learning Algorithms