Deep Learning for Language and Image Understanding
Transcription
Deep Learning for Language and Image Understanding
Deep Learning for Language and Image Understanding Richard Socher Joint work with MetaMind Team: Romain Paulus, Elliot English, Brian Pierce and Mohit Iyyer, Andrej Karpathy and Chris Manning Socher, Manning!Ng! Socher,Ng, Manning, Single Word Representa1ons • Many learning algorithms to represent and label single words Socher, Manning!Ng! Socher,Ng, Manning, Single Word Representa1ons • Con1nuous vector representa1ons can capture more informa1on nouns! verbs! Socher, Manning!Ng! Socher,Ng, Manning, Single Word Representa1ons countries! nouns! Socher, Manning!Ng! Socher,Ng, Manning, What About Larger Seman1c Units? How can we know when larger units are similar in meaning? – Two senators received contribu1ons engineered by lobbyist Jack Abramoff in return for poli1cal favors. – Jack Abramoff a=empted to bribe two legislators. People interpret the meaning of larger text units – en11es, descrip1ve terms, facts, arguments, stories – by semanHc composiHon of smaller elements Socher, Manning!Ng! Socher,Ng, Manning, Poten1al Solu1on for Longer Phrases: Bag of Words • Count vector of vocabulary size, ignore word order • Good for informa1on retrieval, topic modeling J white blood cells destroying an infec1on L an infec1on destroying white blood cells 0 0 1 1 1 1 0 0 ... ... 0 0 0 0 0 0 ...! ...! aardvark an blood bold … weird yes zebra ...! L This film doesn't care about cleverness, wit or any other kind of intelligent humor. Socher, Manning!Ng! Socher,Ng, Manning, Poten1al Solu1on for Longer Phrases: Windows • Good for part of speech tagging, named en1ty tagging and phoneme classifica1on in speech (Collobert et al., 2011; Hinton et al. 2012) Classifica1on! Socher, Manning!Ng! Socher,Ng, Manning, Poten1al Solu1on for Longer Phrases: Windows • Good for part of speech tagging, named en1ty tagging and phoneme classifica1on in speech (Collobert et al., 2011; Hinton et al. 2012) Label: J ! If you enjoy being rewarded by a script that assumes you aren’t very bright , then BloodWork is for you Socher, Manning!Ng! Socher,Ng, Manning, Poten1al Solu1on: Discrete Phrase Representa1ons • Formal logic and λ-‐calculus – (Montague, 1974; Ze`lemoyer, 2007) • Discrete categories: noun phrase, preposi1onal phrase – Many subcategories (Petrov et al., 2006) – Lexicalized subcategories (Collins, 2003) – Manually designed subcategories (Klein and Manning, 2003) – Many careful features (Taskar et al., 2004; Finkel et al.,2008) NP! a! cat! NP(cat)! a! cat! Socher, Manning!Ng! Socher,Ng, Manning, New Proposal: Represent Phrases as Vectors x2 1 5 5 4 3 Germany 2 France 1 1 3 2 2.5 1.1 4 Monday Tuesday 9 2 9.5 1.5 0 1 2 3 4 5 6 7 8 9 10 x1 the country of my birth the place where I was born If the vector space captures syntac1c and seman1c informa1on, the vectors can be used as features. Socher, Manning!Ng! Socher,Ng, Manning, How should we map phrases into a vector space? Use the principle of composi1onality! The meaning (vector) of a sentence is determined by (1) the meanings of its words and (2) the rules that combine them. x2! the country of my birth! 5! the place where I was born! 4! Germany! 3! 1! 5! France! Monday! 2! Tuesday! 1! 0 5.5! 6.1! 1! 3.5! 0.4! 0.3! 2.5! 3.8! 2.1! 3.3! 7! 7! 4! 4.5! 2.3! 3.6! 1 2 3 4 5 6 7 8 9 10! x1! Model jointly learns composi1onal vector representa1ons and tree structure. the country of my birth Socher, Manning!Ng! Socher,Ng, Manning, Composi1on is Everywhere Socher, Manning!Ng! Socher,Ng, Manning, Composi1on is Everywhere Socher, Manning!Ng! Socher,Ng, Manning, Composi1on is Everywhere Richard works-at MetaMind! MetaMind is-in Palo Alto! True! Richard works-in Palo Alto! Socher, Manning!Ng! Socher,Ng, Manning, Outline: Recursive Deep Learning • Goal: Learning models that capture composi1onal meaning and jointly learn structure, feature representa1ons and predic1ons tasks. 1. Sen1ment Analysis 2. Ques1on Answering 3. Grounding Meaning in the Visual World Socher, Manning!Ng! Socher,Ng, Manning, Models for Composi1on Model family: Recursive Neural Network Inputs: Two candidate children’s representa1ons Outputs: 1. The seman1c feature vector represen1ng the two nodes 2. Sen1ment predic1on 8 3 0.3 8 3 3 3 Recursive Neural Network 8 5 8 5 3 3 9 1 4 3 was not great Socher, Manning!Ng! Socher,Ng, Manning, Recursive Neural Tensor Network • Goal: Func1on that composes two vectors • More expressive than any other RNN so far • Idea: Allow both addi1ve and mediated mul1plica1ve interac1ons of vectors Socher, Manning!Ng! Socher,Ng, Manning, Applying an RNTN to a Sentence Tree -‐ -‐-‐ 9 1 5 3 0 7 1 Not bad , ++ ++ 8 5 9 1 0 4 3 pre`y smart actually Socher, Manning!Ng! Socher,Ng, Manning, Applying an RNTN to a Sentence Tree 0 5 2 + 3 3 RNTN -‐ -‐-‐ 9 1 5 3 0 7 1 Not bad , RNTN ++ 8 5 ++ 9 1 0 4 3 pre`y smart actually Socher, Manning!Ng! Socher,Ng, Manning, Applying an RNTN to a Sentence Tree 0 5 2 -‐ -‐-‐ 9 1 5 3 0 7 1 Not bad , ++ 8 5 + 3 3 ++ 9 1 0 4 3 pre`y smart actually Socher, Manning!Ng! Socher,Ng, Manning, Applying an RNTN to a Sentence Tree 0 5 2 -‐ -‐-‐ 9 1 5 3 0 7 1 Not bad , ++ 8 5 + RNTN 3 3 + 8 3 ++ 9 1 0 4 3 pre`y smart actually Socher, Manning!Ng! Socher,Ng, Manning, Applying an RNTN to a Sentence Tree 0 5 2 -‐ -‐-‐ 9 1 5 3 0 7 1 Not bad , ++ 8 5 + 8 3 3 3 + ++ 9 1 0 4 3 pre`y smart actually Socher, Manning!Ng! Socher,Ng, Manning, Applying an RNTN to a Sentence Tree + 7! 3! 0 5 2 -‐ -‐-‐ 9 1 5 3 0 7 1 Not bad , ++ 8 5 + 8 3 3 3 + ++ 9 1 0 4 3 pre`y smart actually Socher, Manning!Ng! Socher,Ng, Manning, Applying an RNTN to a Sentence Tree + 5! 4! + 7! 3! 0 5 2 -‐ -‐-‐ 9 1 5 3 0 7 1 Not bad , ++ 8 5 + 8 3 3 3 + ++ 9 1 0 4 3 pre`y smart actually Socher, Manning!Ng! Socher,Ng, Manning, Posi1ve/Nega1ve Results on Treebank Classifying Sentences: Accuracy improves 85.4 Since then, a deep RNN by Irsoy and Cardie has go`en the best performance 86 84 82 Bi NB RNN 80 MV-‐RNN 78 RNTN 76 74 Training with Sentence Labels Training with Treebank Socher, Manning!Ng! Socher,Ng, Manning, Experimental Result on Treebank • RNTN can capture X but Y, 131 cases in dataset • RNTN obtains accuracy of 41% compared to MV-‐RNN (37), RNN (36) and biNB (27). Socher, Manning!Ng! Socher,Ng, Manning, Results on Nega1ng Nega1ves • But how about nega1ng nega1ves? • No flips, but posi1ve ac1va1on should increase! not bad! Socher, Manning!Ng! Socher,Ng, Manning, But not all sen1ment is context independent! ? - 0 Android 0 0 beats iOS of the problem of sentiment classification that uses only xt. The word Android is neutral in isolation but become Socher, Manning!Ng! Socher,Ng, Manning, Instead: Global Belief RNN Figure 2: Propagation steps of the GB-RNN. Step 1 describes the standard RNN feedforward process, showing that the vector representation of “Android” is independent of the rest of the document. Step 2 computes additional vectors at each node (in red), using information from the higher level nodes in the tree (in blue), allowing “Android” and “iOS” to have different representations given the context. [20] unfold the same autoencoder multiple times which gives it more representational power with the same number of parameters. Our model is different in that it takes into consideration more information at each step and can eventually make better local predictions by using global context. Sentiment analysis. Sentiment analysis has been the subject of research for some time [4, 2, 3, 6, 1, 23]. Most approaches in sentiment analysis use “bag of words” representations that do not take the phrase structure into account but learn from word-level features. We explore our model’s ability to determine contextual sentiment on Twitter, a social media platform. 3 Global Belief Recursive Neural Networks Socher, Manning!Ng! Socher,Ng, Manning, Instead: Global Belief RNN • SemEval 2013 Sen1ment Compe11on Classifier SVM SVM SVM GB-RNN Feature Sets stemming, word cluster, SentiWordNet score, negation POS, lexicon, negations, emoticons, elongated words, scores, syntactic dependency, PMI punctuation, word n-grams, emoticons, character n-grams, elongated words, upper case, stopwords, phrase length, negation, phrase position, large sentiment lexicons, microblogging features parser, unsupervised word vectors (ensemble) Twitter 2013 (F1) 85.19 SMS 2013 (F1) 88.37 87.38 85.79 88.93 88.00 89.41 88.40 Table 1: Comparison to the best Semeval 2013 Task 2 systems, their feature sets and F1 results o each dataset for predicting sentiment of phrases in context. The GB-RNN obtains state of the ar performance on both datasets. Model Bigram Naive Bayes Twitter 2013 80.45 SMS 2013 78.53 Socher, Manning!Ng! Socher,Ng, Manning, semantic word vectors hybrid word vectors Forward Only 100 100 + 34 85. 86. Table 3: F1 score comparison of word vectors on the Sem - - - - Chelski + - - Chelski + - want this so + + + + that it - bad + + + + want + this + so + makes- + me + even that + it - bad + - - + + + makes + me + eve - + - + beat + + + thinking+ we may happier + + - - twice them + in - + - - - + 4 days at SB Figure 5: Change in sentiment predictions in the tweet chelski w happier thinking we may beat them twice in 4 days at SB betw Socher, Manning!Ng! Socher,Ng, Manning, word vectors rd vectors 100 100 + 34 85.67 86.80 84.70 87.15 Forward Backward parison of word vectors on the SemEval 2013 Task 2 test dataset. - Chelski + + + + + want + this + so + that + it - bad + + + + makes + + me + ing- at + we + + + - + may + even + + thinking+ + we + may happier - + - + + - - twice them + + + + in + beat - + + + + twice them + in + + + - - - + + + + + 4 days at SB 4 days at SB nt predictions in the tweet chelski want this so bad that it makes me even at them twice in 4 days at SB between the RNN (left) and the GB-RNN Socher, Manning!Ng! Socher,Ng, Manning, Outline: Recursive Deep Learning • Goal: Learning models that capture composi1onal meaning and jointly learn structure, feature representa1ons and language predic1ons tasks. 1. Sen1ment Analysis 2. Ques1on Answering 3. Grounding Meaning in the Visual World Socher, Manning!Ng! Socher,Ng, Manning, Ques1on Answering: Quiz Bowl Compe11on • QUESTION: He ler unfinished a novel whose 1tle character forges his father's signature to get out of school and avoids the drar by feigning desire to join. A more famous work by this author tells of the rise and fall of the composer Adrian Leverkühn. Another of his novels features the jesuit Naptha and his opponent Se`embrini, while his most famous work depicts the aging writer Gustav von Aschenbach. Name this German author of The Magic Mountain and Death in Venice. Socher, Manning!Ng! Socher,Ng, Manning, Ques1on Answering: Quiz Bowl Compe11on • QUESTION: He ler unfinished a novel whose 1tle character forges his father's signature to get out of school and avoids the drar by feigning desire to join. A more famous work by this author tells of the rise and fall of the composer Adrian Leverkühn. Another of his novels features the jesuit Naptha and his opponent Se`embrini, while his most famous work depicts the aging writer Gustav von Aschenbach. Name this German author of The Magic Mountain and Death in Venice. • ANSWER: Thomas Mann Socher, Manning!Ng! Socher,Ng, Manning, Discussion: Composi1onal Structure Use dependency tree Recursive Neural Networks which capture more seman1c structure Socher, Manning!Ng! Socher,Ng, Manning, Discussion: Composi1onal Structure hdepended =f (WNSUBJ · heconomy + WPREP + Wv · xdepended + b). ROOT POBJ POSS DET This POSSESSIVE city ’s NSUBJ economy AMOD PREP depended on subjugated VMOD peasants DOBJ called helots The composition equation for any node n hildren andwith word vector is hn = • Parent K(n) computa1on variable number oxf w children X f (Wv · xw + b + WR(n,k) · hk ), Figure 2: Dependency parse of a sentence from a question about Sparta. positionality over the standard rnn model by taking into account relation identity along with tree structure. We include an additional d ⇥ d matrix, Wv , to incorporate the word vector xw at a node into the node vector hn . Given a parse tree (Figure 2), we first compute leaf representations. For example, the hidden representation hhelots is War II might mention the Battle of the Bulge and vice versa). Thus, word vectors associated with such answers can be trained in the same vector space as question text,2 enabling us to model relationships between answers instead of assuming incorrectly that all answers are independent. k2K(n) To take advantage of this observation, we hhelots = f (Wv · xhelots + b), (1) where f is a non-linear activation function such as tanh and b is a bias term. Once all leaves depart from Socher et al. (2014) by training both the answers and questions jointly in a single model, rather than training each separately and holding embeddings fixed during where R(n, k) is the dependency relation Socher, Manning!Ng! Socher,Ng, Manning, Training • Train on ranked pairs of sentences and correct en1ty hidden vector hs . The error for the sentence is vectors XX C(S, = sentenceL(rank(c, s, Z))max(0, hidden vector hs . The error for✓)the is While past work on rnn s2S z2Z stricted to the sentent XX C(S, ✓) = L(rank(c, s, Z))max(0, 1 xclevels, · hs + we xz ·show hs ), that (5) sen tions can be easily comb s2S z2Z rank(c, s, Z) provides at thethe l representations 1 where xc · hsthe + xfunction z · hs ), (5) rank of correct answer c with to and the bes Therespect simplest incorrect this rank where the function rank(c, s, Z) answers providesZ. theWe transform is just to average the r 4 shown by where the func1on ank(c, s, Z) toprovides the rank f far a loss function Usunier et al. in rank of correct answerinto c rwith respect the sentence seen oso (2009) to optimize theshow ranked list, correct answers answer w ith respect to rank tthe he top incorrect incorrect Z.c We transform this Asofwe in Section r 4 shown byP into a loss function answers Z. L(r) = Usunier 1/i. et al. powerful and performs b (2009) to optimize the top of the baselines. We call this a i=1 ranked list, r P Since rank(c, s, Z) is expensive to compute,ans qanta: a question L(r) = 1/i. we approximate it by randomly sampling K av with trans-sentential i=1 until a violation is observed Since rank(c, s, Z) is incorrect expensiveanswers to compute, + x z · hs K ) and4set Experiments rank(c, s, Z)Manning! = we approximate it by (x randomly sampling c · hs < 1 Socher, Socher,Ng, Manning, Ng! Score of plot (no named en11es) and author 51 answer ters. The ed by U.S. temporal Thomas Mann Henrik Ibsen Joseph Conrad Henry James Franz Kafka Socher, Manning!Ng! Socher,Ng, Manning, Results QANTA: question answering neural network with trans-sentential averaging ! ! History Model Literature Pos 1 Pos 2 Full Pos 1 Pos 2 Full bow bow-dt ir-qb fixed-qanta qanta 27.5 35.4 37.5 38.3 47.1 51.3 57.7 65.9 64.4 72.1 53.1 60.2 71.4 66.2 73.7 19.3 24.4 27.4 28.9 36.4 43.4 51.8 54.0 57.7 68.2 46.7 55.7 61.9 62.3 69.1 ir-wiki qanta+ir-wiki 53.7 59.8 76.6 81.8 77.5 82.3 41.8 44.7 74.0 78.7 73.3 76.6 ble 1: Accuracy for history and literature at the first two sentence positions of each question d the full question. The top half of the table compares models trained on questions only, while e IR models in the bottom half have access to Wikipedia. qanta outperforms all baselines at are restricted to just the question data, and it substantially improves an IR model with cess to Wikipedia despite being trained on much less data. Socher, Manning!Ng! Socher,Ng, Manning, Pushing Facts into En1ty Vectors Socher, Manning!Ng! Socher,Ng, Manning, that are restricted to just the question data, a accessQanta to Wikipedia Model Can despite Defeat Hbeing uman Ptrained layers on m Figure 4: Comparisons of qanta+ir-wiki to h individual human, and the bar height correspon Socher, Manning!Ng! Socher,Ng, Manning, and it substantially improves an IR model wit Literature Ques1ons are Hard! much less data. human quiz bowl players. Each bar represents a Socher, Manning!Ng! Socher,Ng, Manning, Outline: Recursive Deep Learning • Goal: Learning models that capture composi1onal meaning and jointly learn structure, feature representa1ons and language predic1ons tasks. 1. Sen1ment Analysis 2. Ques1on Answering 3. Grounding Meaning in the Visual World Socher, Manning!Ng! Socher,Ng, Manning, Visual Grounding of Full Senteces • Idea: Map sentences and images into a joint space Socher, Manning!Ng! Socher,Ng, Manning, Composi1onal Sentence Structure • Use same dependency tree RNNs which capture more seman1c structure Socher, Manning!Ng! Socher,Ng, Manning, Convolu1onal Neural Network for Images • CNN trained on ImageNet (Le et al. 2013) • RNN trained to give large inner products between sentence and image vectors: Socher, Manning!Ng! Socher,Ng, Manning, Results ü û! ü û! ü û! û! ü ü û! û! û! Socher, Manning!Ng! Socher,Ng, Manning, Results û! ü û! û! û! û! û! ü Describing Images Mean Rank Image Search Mean Rank Random 92.1 Random 52.1 Bag of Words 21.1 Bag of Words 14.6 CT-‐RNN 23.9 CT-‐RNN 16.1 Recurrent Neural Network 27.1 Recurrent Neural Network 19.2 Kernelized Canonical Correla1on Analysis 18.0 Kernelized Canonical Correla1on Analysis 15.9 DT-‐RNN 16.9 DT-‐RNN 12.5 Socher, Manning!Ng! Socher,Ng, Manning, Grounded sentence-image search Demo! Image-Sentence! Socher, Manning!Ng! Socher,Ng, Manning, Grounded sentence-image search Demo! Train your own Deep Vision Model! Socher, Manning!Ng! Socher,Ng, Manning, Conclusion • Deep Learning can learn grounded representa1ons • Recursive Deep Learning can learn composi1onal representa1ons Figure 5: t-SNE 2-D projections of b 451e answer • The combina1on can vectors divided into six major clusters. The employed in a variety of by U.S. blue cluster is predominantly populated presidents. The zoomed plot reveals temporal tasks requiring world or on the clustering among the presidents based years they spent in office. visual knowledge from the meaning of the words that it contains as well as the syntax that glues those words together. Many computational models of compositionality focus on learning vector spaces (Zanzotto et al., 2010; Erk, 2012; Grefenstette et al., 2013; Yessenalina and Cardie, 2011). Recent approaches towards modeling Thomas Mann Henrik Ibsen Joseph Conrad Henry James Franz Kafka Figure 6: A question on the German novelist Thomas Mann that contains no named entities, along with the five top answers as scored by qanta. Each cell in the heatmap corresponds to the score (inner product) between a node in the parse tree and the given answer, and the dependency parse of the sentence is shown on the left. All of our baselines, including irwiki, are wrong, while qanta usesNg, the plot Ng! Socher, Manning! Socher, Manning,