Here - The Stanford Natural Language Processing Group
Transcription
Here - The Stanford Natural Language Processing Group
NaturalLanguageInference,Reading ComprehensionandDeepLearning ChristopherManning @chrmanning•@stanfordnlp StanfordUniversity SIGIR2016 Machine Comprehension Tested by question answering (Burges) “Amachinecomprehends apassageoftext if,forany question regardingthattextthatcanbeanswered correctlybyamajorityofnativespeakers,thatmachine canprovideastringwhichthosespeakerswouldagree bothanswersthatquestion,anddoesnotcontain informationirrelevanttothatquestion.” IR needs language understanding • ThereweresomethingsthatkeptIRandNLPapart • IRwasheavilyfocusedonefficiencyandscale • NLPwaswaytoofocusedonformratherthanmeaning • Nowtherearecompellingreasonsforthemtocometogether • TakingIRprecisionandrecalltothenextlevel • [carparts forsale] • Shouldmatch:Sellingautomobile andpickupengines, transmissions • Example fromJeffDean’sWSDM2016talk • Informationretrieval/questionansweringinmobilecontexts • Websnippets nolonger cutitonawatch! Menu 1. Naturallogic:Aweaklogicoverhumanlanguagesforinference 2. Distributedwordrepresentations 3. Deep,recursiveneuralnetworklanguageunderstanding How can information retrieval be viewed more as theorem proving (than matching)? AI2 4th Grade Science Question Answering [Angeli, Nayak, & Manning, ACL 2016] Our“knowledge”: Ovariesarethefemalepartoftheflower,which produceseggsthatareneededformakingseeds. Thequestion: Whichpartofaplantproducestheseeds? Theanswerchoices: theflowertheleavesthestemtheroots How can we represent and reason with broad-coverage knowledge? 1. Rigid-schemaknowledge baseswithwell-defined logicalinference 2. Open-domainknowledge bases(OpenIE)– noclear ontologyorinference [Etzionietal.2007ff] 3. HumanlanguagetextKB– Norigidschema,butwith “Naturallogic”cando formalinferenceover humanlanguagetext [MacCartneyandManning2008] Natural Language Inference [Dagan 2005, MacCartney & Manning, 2009] Doesapieceoftextfollowsfromorcontradictanother? Twosenatorsreceivedcontributionsengineered bylobbyistJackAbramoffinreturnforpoliticalfavors. JackAbramoffattemptedtobribetwolegislators. Follows Heretrytoproveorrefuteaccordingtoalargetextcollection: 1. Theflowerofaplantproducestheseeds 2. Theleavesofaplantproducestheseeds 3. Thestemofaplantproducestheseeds 4. Therootsofaplantproducestheseeds Text as Knowledge Base Storingknowledgeastextiseasy! Doinginferencesovertextmightbehard Don’twanttorun inferenceovereveryfact! Don’twanttostore alltheinferences! Inferences … on demand from a query … [Angeli and Manning 2014] … using text as the meaning representation Natural Logic: logical inference over text Wearedoinglogicalinference Thecatateamouse⊨ ¬Nocarnivoreseatanimals Wedoitwithnaturallogic IfImutateasentenceinthisway,doIpreserveitstruth? Post-DealIranAsksifU.S. IsStill‘Great Satan,’orSomethingLess ⊨ ACountryAsksifU.S. IsStill‘GreatSatan,’ orSomethingLess • • • • • Asoundandcompleteweaklogic[Icard andMoss2014] Expressiveforcommonhumaninferences* “Semantic”parsingisjustsyntacticparsing Tractable:Polynomialtimeentailmentchecking Playsnicelywithlexicalmatchingback-offmethods #1. Common sense reasoning PolarityinNaturalLogic Weorderphrasesinpartialorders Simplestone:is-a-kind-of Also:geographicalcontainment,etc. Polarity:Inacertaincontext,isitvalid tomoveupordowninthisorder? Example inferences Quantifiersdeterminethepolarity ofphrases Validmutationsconsider polarity Successfultoyinference: • Allcatseatmice ⊨ Allhousecatsconsumerodents “Soft” Natural Logic • Wealsowanttomakelikely(butnotcertain)inferences • SamemotivationasMarkovlogic,probabilisticsoftlogic, etc. • Eachmutationedgetemplatefeaturehasacostθ ≥0 • Costofanedgeisθi ·fi • Costofapathisθ ·f • Canlearnparametersθ • Inferenceisthengraphsearch #2. Dealing with real sentences Naturallogicworkswithfactsliketheseintheknowledgebase: ObamawasborninHawaii Butreal-worldsentencesarecomplexandlong: BorninHonolulu,Hawaii,Obama isagraduateofColumbia UniversityandHarvardLawSchool,whereheservedas presidentoftheHarvardLawReview. Approach: 1. Classifierdivideslongsentencesintoentailedclauses 2. Naturallogicinferencecanshortentheseclauses Universal Dependencies (UD) http://universaldependencies.github.io/docs/ Asingleleveloftypeddependencysyntaxthat (i) worksforallhumanlanguages (ii) givesasimple,human-friendlyrepresentationofsentence Dependencysyntaxisbetterthanaphrase-structuretreefor machineinterpretation– it’salmostasemanticnetwork UDaimstobelinguisticallybetteracrosslanguagesthan earlierrepresentations,suchasCoNLLdependencies Generation of minimal clauses 1. Classificationproblem: givenadependencyedge, doesitintroduceaclause? 2. Isitmissingacontrolled subjectfromsubj/object? 3. Shortenclauseswhile preservingvalidity,using naturallogic! • Allyoungrabbits drinkmilk ⊭ Allrabbits drinkmilk • OK: SJC,theBayArea’s thirdlargestairport,often experiences delaysdueto weather. Oftenbetter:SJCoften experiencesdelays. • #3. Add a lexical alignment classifier • Sometimeswecan’tquitemaketheinferencesthatwewould liketomake: • Weuseasimplelexicalmatchback-offclassifierwithfeatures: • Matchingwords,mismatchedwords,unmatchedwords • Thesealwaysworkprettywell • Thiswasthe lesson ofRTE evaluations andperhaps orIRingeneral The full system • Werunourusualsearchoversplitup,shortenedclauses • Ifwefindapremise,great! • Ifnot,weusethelexicalclassifierasanevaluationfunction • Weworktodothisquicklyatscale • Visit1Mnodes/second, don’trefeaturize, justdelta • 32bytesearch states (thanks Gabor!) Solving NY State 4th grade science (Allen AI Institute datasets) Multiplechoicequestionsfromreal4th gradescienceexams Whichactivityisanexampleofagoodhealthhabit? (A)Watchingtelevision(B)Smokingcigarettes(C)Eatingcandy (D)Exercisingeveryday Inourcorpus knowledgebase: • PlasmaTV’scandisplayupto16millioncolors...greatfor watchingTV...alsomakeagoodscreen. • Notsmokingordrinkingalcoholisgoodforhealth,regardlessof whetherclothingiswornornot. • Eatingcandyfordinerisanexampleofapoorhealthhabit. • Healthyisexercising Solving 4th grade science (Allen AI NDMC) System KnowBot[Hixon etal.NAACL2015] KnowBot(augmented with humaninloop) IRbaseline(Lucene) NaturalLI Moredata+IRbaseline Moredata+NaturalLI NaturalLI+🔔 +(lex.classifier) Aristo[Clarketal.2016]6systems,evenmoredata Dev 45 57 49 52 62 65 74 Test – – 42 51 58 61 67 71 Testset:NewYorkRegents4thGradeScience exammultiple-choice questionsfromAI2 Training: BasicisBarron’sstudyguide;moredataisSciText corpusfromAI2.Score:%correct Natural Logic • Canwejustusetextasaknowledgebase? • Naturallogicprovidesauseful,formal(weak)logicfortextual inference • Naturallogiciseasilycombinablewithlexicalmatching methods,includingneuralnetmethods • Theresultingsystemisusefulfor: • Common-sensereasoning • QuestionAnswering • OpenInformationExtraction • i.e.,getting outrelation triples fromtext Can information retrieval benefit from distributed representations of words? Sec. 9.2.2 From symbolic to distributed representations Thevastmajorityofrule-basedorstatisticalNLPand IR work regardedwordsasatomicsymbols:hotel, conference, walk Invectorspaceterms,thisisavectorwithone1andalotof zeroes [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] Wenowcallthisa“one-hot”representation. Sec. 9.2.2 From symbolic to distributed representations Itsproblem: • Ifusersearchesfor[Dellnotebookbatterysize],wewould liketomatchdocumentswith“Delllaptopbatterycapacity” • Ifusersearchesfor[Seattlemotel],wewouldliketomatch documentscontaining“Seattlehotel” But motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]T hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0 Ourqueryanddocumentvectorsareorthogonal Thereisnonaturalnotionofsimilarityinasetofone-hotvectors Capturing similarity Therearemanythingsyoucandoaboutsimilarity,many wellknowninIR Queryexpansionwithsynonymdictionaries Learningwordsimilaritiesfromlargecorpora Butawordrepresentationthatencodessimilaritywins Lessparameterstolearn(perword,notperpair) Moresharingofstatistics Moreopportunitiesformulti-tasklearning Distributional similarity-based representations Youcangetalotofvaluebyrepresentingawordby meansofitsneighbors “Youshallknowawordbythecompanyitkeeps” (J.R.Firth1957:11) OneofthemostsuccessfulideasofmodernNLP government debt problems turning into banking crises as has happened in saying that Europe needs unified banking regulation to replace the hodgepodge ë Thesewordswillrepresentbanking ì Basic idea of learning neural network word embeddings Wedefinesomemodelthataimstopredictawordbased onotherwordsinitscontext Chooseargmaxw w·((wj−1 +wj+1)/2) whichhasalossfunction,e.g., J =1−wj·((wj−1 +wj+1)/2) Unitnorm vectors Welookatmanysamplesfromabiglanguagecorpus Wekeepadjustingthevectorrepresentationsofwords tominimizethisloss With distributed, distributional representations, syntactic and semantic similarity is captured 0.286 0.792 −0.177 currency = −0.107 0.109 −0.542 0.349 0.271 Distributional representations can solve the fragility of NLP tools StandardNLPsystems– here,theStanfordParser– are incrediblyfragilebecauseofsymbolicrepresentations Crazysentential complement, such asfor “likes [(being)crazy]” Distributional representations can capture the long tail of IR similarity Google’sRankBrain Notnecessarilyasgoodfortheheadofthequery distribution,butgreatforseeingsimilarityinthetail 3rd mostimportantrankingsignal(we’retold…) Sec. 18.3 LSA (Latent Semantic Analysis) vs. word2vec LSA:Count!models • Factorizea(maybeweighted,oftenlog-scaled)termdocument(Deerwester etal.1990)orword-contextmatrix (Schütze1992)intoUΣVT • Retainonlyksingularvalues,inordertogeneralize k [Cf. Baroni:Don’tcount,predict! Asystematic comparison ofcontextcountingvs.context-predicting semanticvectors. ACL2014] Sec. 18.3 LSA vs. word2vec word2vecCBOW/SkipGram:Predict! [Mikolov etal.2013]:Simplepredict modelsforlearningwordvectors • Trainwordvectorstotrytoeither: • Predictawordgivenitsbag-ofwordscontext(CBOW);or • Predictacontextword(positionindependent)fromthecenter word • Updatewordvectorsuntiltheycan dothispredictionwell word2vec encodes semantic components as linear relations COALS model (count-modified LSA) [Rohde, Gonnerman & Plaut, ms., 2005] Count based vs. direct prediction LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert) • NNLM, HLBL, RNN, word2vec Skip-gram/CBOW, (Bengio et al; • Fast training • Scales with corpus size • Efficient usage of statistics • Primarily used to capture word similarity • May not use the best methods for scaling counts Collobert & Weston; Huang et al; Mnih & Hinton; Mikolov et al; Mnih & Kavukcuoglu) • Inefficient usage of statistics • Generate improved performance on other tasks • Can capture complex patterns beyond word similarity Encoding meaning in vector differences [Pennington, Socher, and Manning, EMNLP 2014] Crucialinsight: Ratiosofco-occurrenceprobabilitiescanencode meaningcomponents x =solid x =gas x =water x =random large small large small small large large small large small ~1 ~1 Encoding meaning in vector differences [Pennington, Socher, and Manning, EMNLP 2014] Crucialinsight: Ratiosofco-occurrenceprobabilitiescanencode meaningcomponents x =solid x =gas x =water x =fashion 1.9x10-4 6.6x10-5 3.0x10-3 1.7x10-5 2.2x10-5 7.8x10-4 2.2x10-3 1.8x10-5 8.9 8.5x10-2 1.36 0.96 Encoding meaning in vector differences Q:Howcanwecaptureratiosofco-occurrenceprobabilitiesas meaningcomponentsinawordvectorspace? A:Log-bilinearmodel: withvectordifferences Glove Word similarities [Pennington et al., EMNLP 2014] Nearestwordsto frog: 1.frogs 2.toad 3.litoria 4.leptodactylidae 5.rana 6.lizard 7.eleutherodactylus litoria rana http://nlp.stanford.edu/projects/glove/ leptodactylidae eleutherodactylus Glove Visualizations http://nlp.stanford.edu/projects/glove/ Glove Visualizations: Company - CEO Named Entity Recognition Performance ModelonCoNLL CoNLL ’03dev CoNLL ’03test ACE2 MUC7 CategoricalCRF SVD(logtf) HPCA C&W CBOW GloVe 91.0 90.5 92.6 92.2 93.1 93.2 85.4 84.8 88.7 87.4 88.2 88.3 77.4 73.6 81.7 81.7 82.2 82.9 F1scoreofCRFtrained onCoNLL2003Englishwith50dimwordvectors 73.4 71.5 80.7 80.2 81.1 82.2 Word embeddings: Conclusion Glovetranslatesmeaningfulrelationshipsbetweenword-wordcooccurrencecounts into linearrelations inthewordvectorspace GloveshowstheconnectionbetweenCount! workandPredict! work– appropriatescalingofcountsgivesthepropertiesand performanceofPredict! Models Alotofotherimportantworkinthislineofresearch: [Levy&Goldberg,2014] [Arora,Li,Liang,Ma&Risteski,2015] [Hashimoto,Alvarez-Melis &Jaakkola,2016] Can we use neural networks to understand, not just word similarities, but language meaning in general? Compositionality Artificial Intelligence requires being able to understand bigger things from knowing about smaller parts We need more than word embeddings! Howcanweknowwhenlargerlinguisticunitsare similarinmeaning? Thesnowboarderisleapingoverthemogul Apersononasnowboardjumpsintotheair Peopleinterpretthemeaningoflargertextunits– entities,descriptiveterms,facts,arguments,stories– by semanticcomposition ofsmallerelements Beyond the bag of words: Sentiment detection Isthetoneofapieceoftextpositive,negative,orneutral? • Sentimentisthatsentimentis“easy” • Detectionaccuracyforlongerdocuments~90%,BUT ……loved……………great………………impressed ………………marvelous………… Stanford Sentiment Treebank • 215,154phraseslabeledin11,855sentences • Cantrainandtestcompositions http://nlp.stanford.edu:8080/sentiment/ Tree-Structured Long Short-Term Memory Networks [Tai et al., ACL 2015] Tree-structured LSTM GeneralizessequentialLSTMtotreeswithanybranchingfactor Positive/Negative Results on Treebank 95 BiNB RNN 90 MV-RNN RNTN 85 TreeLSTM 80 75 TrainingwithSentence Labels TrainingwithTreebank Experimental Results on Treebank • TreeRNN cancaptureconstructionslikeXbutY • Biword NaïveBayesisonly58%onthese Stanford Natural Language Inference Corpus http://nlp.stanford.edu/projects/snli/ 570K Turker-judged pairs, based on an assumed picture Amanridesabikeon asnowcoveredroad. Aman isoutside. ENTAILMENT 2femalebabieseatingchips. Twofemalebabiesareenjoyingchips. NEUTRAL Amaninanapronshoppingatamarket. Amaninanapronispreparingdinner. CONTRADICTION NLI with Tree-RNNs [Bowman, Angeli, Potts & Manning, EMNLP 2015] Approach: Wewouldliketoworkoutthemeaningof eachsentenceseparately– apurecompositionalmodel ThenwecomparethemwithNN&classifyforinference P(Entail)=0.8 Softmax classifier manoutside vs. maninsnow manoutside man outside Learnedwordvectors Comparison NNlayer(s) maninsnow man Composition NN layer insnow in snow Tree recursive NNs (TreeRNNs) Theoreticallyappealing Veryempiricallycompetitive But Prohibitivelyslow Usuallyrequireanexternal parser Don’texploitcomplementary linearstructureoflanguage A recurrent NN allows efficient batched computation on GPUs TreeRNN: Input-specific structure undermines batched computation The Shift-reduce Parser-Interpreter NN (SPINN) [Bowman, Gauthier et al. 2016] BasemodelequivalenttoaTreeRNN,but… supportsbatchedcomputation:25× speedups Plus: Effectivenewhybridthatcombineslinearandtree-structured context Canstandalonewithoutaparser Beginning observation: binary trees = transition sequences SHIFT SHIFT REDUCE SHIFT SHIFT REDUCE REDUCE SHIFT SHIFT SHIFT SHIFT REDUCE REDUCE REDUCE SHIFT SHIFT SHIFT REDUCE SHIFT REDUCE REDUCE The Shift-reduce Parser-Interpreter NN (SPINN) The Shift-reduce Parser-Interpreter NN (SPINN) ThemodelincludesasequenceLSTMRNN • ThisactsasasimpleparserbypredictingSHIFTorREDUCE • Italsogivesleftsequencecontextasinputtocomposition Implementing the stack • Naïveimplementation:simulatesstacksin abatchwithafixedsizemultidimensionalarray ateachtimestep • Backpropagationrequiresthateachintermediatestackbe maintainedinmemory • ⇒ Largeamountofdatacopyingandmovementrequired • Efficientimplementation • Haveonlyonestackarrayforeachexample • Ateachtimestep,augmentwiththecurrentheadofthestack • Keeplistofbackpointers forREDUCEoperations • Similartozipperdatastructuresemployedelsewhere A thinner stack Array Backpointers 1 Spot 1 2 sat 12 3 down 123 4 (satdown) 14 5 (Spot(satdown)) 5 Using SPINN for natural language inference the cat sat down (the cat) (sat down) the cat is angry (the cat) (is angry) o₁ o₂ SNLI Results Model %Accuracy(Testset) Feature-based classifier PreviousSOTAsentenceencoder [Mou etal.2016] LSTMRNNsequencemodel Tree LSTM SPINN SOTA(sentencepairalignmentmodel) [Parikhetal.2016] 78.2 82.1 80.6 80.9 83.2 86.8 Successes for SPINN over LSTM Exampleswithnegation • P:Therhythmicgymnastcompletesherfloorexerciseatthe competition. • H:Thegymnastcannotfinishherexercise. Longexamples(>20words) • P:Amanwearingglassesandaraggedcostumeisplayinga Jaguarelectricguitarandsingingwiththeaccompanimentof adrummer. • H:Amanwithglassesandadisheveledoutfitisplayinga guitarandsingingalongwithadrummer. Envoi • Thereareverygoodreasonsforwantingtorepresentmeaning withdistributedrepresentations • Sofar,distributionallearninghasbeenmosteffectiveforthis • Butcf.[Young,Lai,Hodosh &Hockenmaier 2014]on denotationalrepresentations,usingvisualscenes • However,wewantnotjustwordmeanings,butalso: • Meaningsoflargerunits,calculatedcompositionally • Theabilitytodonaturallanguageinference • TheSPINNmodelisfast— closetorecurrentnetworks! • Itshybridsequence/treestructureispsychologicallyplausible andout-performsothersentencecompositionmethods Final Thoughts 2011201320152017 speechvisionNLPIR Youare here IR NLP Final Thoughts I’mcertainthatdeep learningwillcometodominateSIGIR over thenextcouple ofyears…justlikespeech, vision,andNLPbefore it.Thisisagoodthing.Deep learning provides some powerful newtechniques thatarejustbeingamazingly successful onmanyhardapplied problems. However, weshould realize that there isalsocurrently ahugeamountofhypeaboutdeep learning andartificial intelligence. Weshouldnotletagenuine enthusiasm forimportant and successful newtechniques leadtoirrational exuberance oradiminished appreciation ofother approaches. Finally,despite theefforts ofanumber of people, inpractice there has been aconsiderable divisionbetween thehuman language technologyfieldsofIR,NLP,andspeech. Partlythisisdueto organizational factors andpartlythatatonetimethesubfields eachhadavery different focus. However, recent changes inemphasis – withIRpeoplewanting tounderstand theuser better andNLPpeople muchmoreinterested inmeaning andcontext– meanthatthere arealotofcommoninterests, andIwould encourage muchmore collaboration between NLPandIRinthenextdecade.