Scaling up Deep Learning towards AI
Transcription
Scaling up Deep Learning towards AI
Scaling up Deep Learning towards AI Yoshua Bengio 13, 2015 October IBM Cogni6ve Colloquium, San Francisco Breakthrough • Deep Learning: machine learning algorithms inspired by brains, based on learning a composi>on mul>ple transforma>ons = levels of representa>on / abstrac>on. 2 Impact Deep learning has revolu>onized • Speech recogni>on • Object recogni>on More on the way, including other areas of computer vision, NLP, dialogue, reinforcement learning, robo>cs, control… 3 Challenges to Scale towards AI • Computa6onal challenge • Reasoning, natural language understanding and knowledge representa6on • Large-‐scale unsupervised learning 4 Computational Scaling • Recent breakthroughs in speech, object recogni6on and NLP hinged on faster compu6ng, GPUs, and large datasets • In speech, vision and NLP applica6ons we tend to find that as Ilya Sutskever would say BIGGER IS BETTER Because deep learning is EASY TO REGULARIZE while it is MORE DIFFICULT TO AVOID UNDERFITTING 5 Scaling up computation: we still have a long way to go in raw computational power Figure: Ian Goodfellow 6 Low-Precision Training of Deep Nets Courbariaux, Bengio & David, ICLR 2015 workshop • See (Gupta et al, arXiv Feb. 2015) for a recent review • Previous work showed that it was possible to quan6ze weights aOer training to obtain low-‐precision implementa6ons of trained deep nets (8 bits or even less if you retrain and keep high precision at top layer) • This work is about training with low precision • How many bits are required? 7 12 Number of bits for computations DYNAMIC FIXED POINT FIXED POINT 10 bits were selected, with dynamic fixed point 8 Number of bits for updating and storing weights DYNAMIC FIXED POINT 12 bits were selected 9 FIXED POINT NIPS’2015: Single-Bit Weights BitConnect, Courbariaux, David & Bengio, NIPS’2015 • Using stochas6c rounding and 16-‐bit precision opera6ons, we are able to train deep nets with weights quan6zed to 1 bit, i.e., we can get rid of most (2/3) mul6plica6ons • This could have a dras6c impact on hardware implementa6ons, especially for low-‐power devices… 10 Results on MNIST Method Validation error rate (%) Test error rate (%) No regularizer BinaryConnect (det.) BinaryConnect (stoch.) 50% Dropout 1.21 ± 0.04 1.17 ± 0.02 1.12 ± 0.03 0.94 ± 0.04 1.30 ± 0.04 1.29 ± 0.08 1.18 ± 0.04 1.01 ± 0.04 Maxout networks [26] Deep L2-SVM [27] 0.94 0.87 e 1: Error rates of a MLP trained on MNIST depending on the method. Methods using unsup d pretraining are not taken into account. We see that in spite of using only a single bit per wei ng propagation, performance is not worse than an ordinary (no regularizer) MLP, it is actu er, especially with the stochastic version, suggesting that BinaryConnect acts as a regularize 11 Getting Rid of the Remaining Multiplications • The main remaining mul6plica6ons (1/3) are due to the weight update of the form @C Wij / hj @ai • It can be eliminated by quan6zing h j to powers of 2 (Simard & Graf 1994): the mul6plica6on becomes a shih. Similarly for the learning rate. • The quan6za6on can also be stochas6c, to preserve the expected value of the update: E[h̃j ] = hj 12 Neural Networks with Few Multiplications • ArXiv paper, 2015: Lin, Courbariaux, Memisevic & Bengio Works! Slows down training a bit but improves results by regularizing 13 The Language Understanding Challenge 14 Learning Word Semantics: a Success output • Bengio et al 2000 introduced neural language models and word vectors (word embeddings) R(w1 ) R(w2 ) R(w3 ) R(w4 ) R(w5 ) R(w6 ) w1 w2 w3 w4 w5 w6 input sequence Mikolov showed these embeddings capture analogical rela6ons: Queen King 15 Man Woman The Next Challenge: Rich Semantic Representations for Word Sequences • First challenge: machine transla6on • Second challenge: document understanding and ques6on answering 16 Attention-Based Neural Machine Translation Related to earlier Graves 2013 for genera6ng handwri6ng • (Bahdanau, Cho & Bengio, arXiv sept. 2014) • (Jean, Cho, Memisevic & Bengio, arXiv dec. 2014) Recurrent State zi Attention Mechanism ui aj Annotation Vectors Word Ssample f = (La, croissance, économique, s'est, ralentie, ces, dernières, années, .) hj Attention weight + aj =1 e = (Economic, growth, has, slowed, down, in, recent, years, .) 17 End-to-End Machine Translation with Recurrent Nets and Attention Mechanism >Qr 7` r2 ;Q rBi? ip2`v H`;2 i`;2i URV • Reached the +M state-‐of-‐the-‐art n one year, from pQ+#mH`v\ scratch UV 1M;HBb?→6`2M+? UqJh@R9V LJh Y*M/ YlLE Y1Mb LJhUV jkXe3 jjXk3 jjXNN jeXdR U#V 1M;HBb?→:2`KM UqJh@R8V JQ/2H k9X3 k9Xy kjXe kkX3 kkXd LQi2 L2m`H Jh lX1/BM#m`;?- avMi+iB+ aJh GAJaAfEAh lX1/BM#m`;?- S?`b2 aJh EAh- S?`b2 aJh :QQ;H2 jyXe⋆ Ĝ jkXd◦ jeXN◦ S@aJh jdXyj• U+V 1M;HBb?→*x2+? UqJh@R8V JQ/2H R3Xj R3Xk RdXe RdX9 ReXR LQi2 L2m`H Jh C>l- aJhYGJYPaJYaT`b2 *l- S?`b2 aJh lX1/BM#m`;?- S?`b2 aJh lX1/BM#m`;?- avMi+iB+ aJh 18 LJhUV, U"?/Mm 2i HX- kyR9c C2M 2i HX- kyR9V- U⋆V, UamibF2p2` 2i HX- kyR9VX X X X X X X X X X X X U◦V, UGmQM; 2i HX- kyR9V- U•V, U.m``MB 2i HX- kyR9V- X X X X X X X X Image-to-Text: Caption Generation ui zi Attention weight aj aj =1 + Convolutional Neural Network Attention Mechanism Recurrent State Word Ssample f = (a, man, is, jumping, into, a, lake, .) 19 Usm 2i HX- kyR8V- UuQ 2i HX- kyR8V Annotation Vectors hj X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X The Good 20 And the Bad 21 Attention Mechanisms for Memory Access • Neural Turing Machines (Graves et al 2014) and Memory Networks (Weston et al 2014) • Use a form of amen6on mechanism to control the read and write access into a memory • The amen6on mechanism outputs a sohmax over memory loca6ons 22 read write The Next Frontier: Reasoning and Question Answering • Currently working on ar6ficial tasks, with memory networks: From “Memory Networks”, Weston et al. ICLR 2015; “End-‐to-‐end memory networks”, Sukhbatar et al. NIPS’2015 23 Ongoing Project: Knowledge Extraction • Learn to fill the memory network from natural language descrip6ons of facts • Force the neural net to understand language • Extract knowledge from documents into a usable form read write 24 Why does it work? Pushing off the Curse of Long-Term Dependencies • Whereas LSTM memories always decay exponen6ally (even if slowly), a mental state stored in an external memory can stay for arbitrarily long dura6ons, un6l overwrimen. passive copy access 25 The Unsupervised Learning Challenge 26 Why Unsupervised Learning? • Recent progress mostly in supervised DL • Real technical challenges for unsupervised DL • Poten6al benefits: • Exploit tons of unlabeled data • Answer new ques6ons about the variables observed • Regularizer – transfer learning – domain adapta6on • Easier op6miza6on (local training signal) • Structured outputs 27 How do humans generalize from very few examples? • Intelligence (good generaliza6on) needs knowledge • Humans transfer knowledge from previous learning: • Representa6ons • Explanatory factors • Previous learning from: unlabeled data 28 + labels for other tasks Unsupervised and Transfer Learning Challenge + Transfer Learning Challenge: Won by Unsupervised Deep Learning Raw data 1 layer ICML’2011 workshop on Unsup. & Transfer Learning 2 layers NIPS’2011 Transfer Learning Challenge Paper: ICML’2012 3 layers 4 layers Intractable (Exponential) Barriers • Sta6s6cal curse of dimensionality: • Intractable number of configura6ons of variables, in high dimension • Computa6onal curse of dimensionality: • Intractable normaliza6on constants • Intractable (non-‐convex) op6miza6on? • Intractable inference 30 Deep Generative Learning: the Hot Frontier DRAW: A Recurrent Neural Network For Image Generation • Many very different approaches being explored to bypass these intractabili6es Table 3. Experimental Hyper-Parameters. Task #glimpses LSTM #h #z Read Size Write Size mann z t l • Exploratory m ode o B n e 100 ⇥ 100 MNIST Classification 8 256 12 ⇥ 12 betwe gio) p n a e g B . e Y h ( t p MNIST Model 64 256 2ckpro 5 ⇥ 5 And 100 nd2 ⇥ a B • Exci6ng a rea o f r esearch a s ine 12 ⇥ 12 SVHN Model 32 800 12 ⇥ 12 mach100 CIFAR Model 400 200 5⇥5 5⇥5 • Connect to brains: bridge the gap 64to biology DRAW (DeepMind) 31 LAPGAN (NYU/Facebook) Learning Multiple Levels of Abstraction • The big payoff of deep learning is to allow learning higher levels of abstrac6on • Higher-‐level abstrac6ons disentangle the factors of varia6on, which allows much easier generaliza6on and transfer 32 MILA: Montreal Institute for Learning Algorithms