Here - The Stanford Natural Language Processing Group

Transcription

Here - The Stanford Natural Language Processing Group
NaturalLanguageInference,Reading
ComprehensionandDeepLearning
ChristopherManning
@chrmanning•@stanfordnlp
StanfordUniversity
SIGIR2016
Machine Comprehension
Tested by question answering (Burges)
“Amachinecomprehends apassageoftext if,forany
question regardingthattextthatcanbeanswered
correctlybyamajorityofnativespeakers,thatmachine
canprovideastringwhichthosespeakerswouldagree
bothanswersthatquestion,anddoesnotcontain
informationirrelevanttothatquestion.”
IR needs language understanding
• ThereweresomethingsthatkeptIRandNLPapart
• IRwasheavilyfocusedonefficiencyandscale
• NLPwaswaytoofocusedonformratherthanmeaning
• Nowtherearecompellingreasonsforthemtocometogether
• TakingIRprecisionandrecalltothenextlevel
• [carparts forsale]
• Shouldmatch:Sellingautomobile andpickupengines, transmissions
• Example fromJeffDean’sWSDM2016talk
• Informationretrieval/questionansweringinmobilecontexts
• Websnippets nolonger cutitonawatch!
Menu
1. Naturallogic:Aweaklogicoverhumanlanguagesforinference
2. Distributedwordrepresentations
3. Deep,recursiveneuralnetworklanguageunderstanding
How can information retrieval
be viewed more as theorem
proving (than matching)?
AI2 4th Grade Science Question
Answering
[Angeli, Nayak, & Manning,
ACL 2016]
Our“knowledge”:
Ovariesarethefemalepartoftheflower,which
produceseggsthatareneededformakingseeds.
Thequestion:
Whichpartofaplantproducestheseeds?
Theanswerchoices:
theflowertheleavesthestemtheroots
How can we represent and reason with
broad-coverage knowledge?
1. Rigid-schemaknowledge
baseswithwell-defined
logicalinference
2. Open-domainknowledge
bases(OpenIE)– noclear
ontologyorinference
[Etzionietal.2007ff]
3. HumanlanguagetextKB–
Norigidschema,butwith
“Naturallogic”cando
formalinferenceover
humanlanguagetext [MacCartneyandManning2008]
Natural Language Inference
[Dagan 2005, MacCartney & Manning, 2009]
Doesapieceoftextfollowsfromorcontradictanother?
Twosenatorsreceivedcontributionsengineered
bylobbyistJackAbramoffinreturnforpoliticalfavors.
JackAbramoffattemptedtobribetwolegislators.
Follows
Heretrytoproveorrefuteaccordingtoalargetextcollection:
1. Theflowerofaplantproducestheseeds
2. Theleavesofaplantproducestheseeds
3. Thestemofaplantproducestheseeds
4. Therootsofaplantproducestheseeds
Text as Knowledge Base
Storingknowledgeastextiseasy!
Doinginferencesovertextmightbehard
Don’twanttorun
inferenceovereveryfact!
Don’twanttostore
alltheinferences!
Inferences … on demand from a
query …
[Angeli and Manning 2014]
… using text as the meaning
representation
Natural Logic: logical inference
over text
Wearedoinglogicalinference
Thecatateamouse⊨ ¬Nocarnivoreseatanimals
Wedoitwithnaturallogic
IfImutateasentenceinthisway,doIpreserveitstruth?
Post-DealIranAsksifU.S. IsStill‘Great Satan,’orSomethingLess ⊨
ACountryAsksifU.S. IsStill‘GreatSatan,’ orSomethingLess
•
•
•
•
•
Asoundandcompleteweaklogic[Icard andMoss2014]
Expressiveforcommonhumaninferences*
“Semantic”parsingisjustsyntacticparsing
Tractable:Polynomialtimeentailmentchecking
Playsnicelywithlexicalmatchingback-offmethods
#1. Common sense reasoning
PolarityinNaturalLogic
Weorderphrasesinpartialorders
Simplestone:is-a-kind-of
Also:geographicalcontainment,etc.
Polarity:Inacertaincontext,isitvalid
tomoveupordowninthisorder?
Example inferences
Quantifiersdeterminethepolarity ofphrases
Validmutationsconsider polarity
Successfultoyinference:
• Allcatseatmice ⊨ Allhousecatsconsumerodents
“Soft” Natural Logic
• Wealsowanttomakelikely(butnotcertain)inferences
• SamemotivationasMarkovlogic,probabilisticsoftlogic,
etc.
• Eachmutationedgetemplatefeaturehasacostθ ≥0
• Costofanedgeisθi ·fi
• Costofapathisθ ·f
• Canlearnparametersθ
• Inferenceisthengraphsearch
#2. Dealing with real sentences
Naturallogicworkswithfactsliketheseintheknowledgebase:
ObamawasborninHawaii
Butreal-worldsentencesarecomplexandlong:
BorninHonolulu,Hawaii,Obama isagraduateofColumbia
UniversityandHarvardLawSchool,whereheservedas
presidentoftheHarvardLawReview.
Approach:
1. Classifierdivideslongsentencesintoentailedclauses
2. Naturallogicinferencecanshortentheseclauses
Universal Dependencies (UD)
http://universaldependencies.github.io/docs/
Asingleleveloftypeddependencysyntaxthat
(i) worksforallhumanlanguages
(ii) givesasimple,human-friendlyrepresentationofsentence
Dependencysyntaxisbetterthanaphrase-structuretreefor
machineinterpretation– it’salmostasemanticnetwork
UDaimstobelinguisticallybetteracrosslanguagesthan
earlierrepresentations,suchasCoNLLdependencies
Generation of minimal clauses
1. Classificationproblem:
givenadependencyedge,
doesitintroduceaclause?
2. Isitmissingacontrolled
subjectfromsubj/object?
3. Shortenclauseswhile
preservingvalidity,using
naturallogic!
•
Allyoungrabbits drinkmilk
⊭ Allrabbits drinkmilk
•
OK: SJC,theBayArea’s
thirdlargestairport,often
experiences delaysdueto
weather.
Oftenbetter:SJCoften
experiencesdelays.
•
#3. Add a lexical alignment classifier
• Sometimeswecan’tquitemaketheinferencesthatwewould
liketomake:
• Weuseasimplelexicalmatchback-offclassifierwithfeatures:
• Matchingwords,mismatchedwords,unmatchedwords
• Thesealwaysworkprettywell
• Thiswasthe lesson ofRTE evaluations andperhaps orIRingeneral
The full system
• Werunourusualsearchoversplitup,shortenedclauses
• Ifwefindapremise,great!
• Ifnot,weusethelexicalclassifierasanevaluationfunction
• Weworktodothisquicklyatscale
• Visit1Mnodes/second, don’trefeaturize, justdelta
• 32bytesearch states (thanks Gabor!)
Solving NY State 4th grade science
(Allen AI Institute datasets)
Multiplechoicequestionsfromreal4th gradescienceexams
Whichactivityisanexampleofagoodhealthhabit?
(A)Watchingtelevision(B)Smokingcigarettes(C)Eatingcandy
(D)Exercisingeveryday
Inourcorpus knowledgebase:
• PlasmaTV’scandisplayupto16millioncolors...greatfor
watchingTV...alsomakeagoodscreen.
• Notsmokingordrinkingalcoholisgoodforhealth,regardlessof
whetherclothingiswornornot.
• Eatingcandyfordinerisanexampleofapoorhealthhabit.
• Healthyisexercising
Solving 4th grade science
(Allen AI NDMC)
System
KnowBot[Hixon etal.NAACL2015]
KnowBot(augmented with humaninloop)
IRbaseline(Lucene)
NaturalLI
Moredata+IRbaseline
Moredata+NaturalLI
NaturalLI+🔔 +(lex.classifier)
Aristo[Clarketal.2016]6systems,evenmoredata
Dev
45
57
49
52
62
65
74
Test
–
–
42
51
58
61
67
71
Testset:NewYorkRegents4thGradeScience exammultiple-choice questionsfromAI2
Training: BasicisBarron’sstudyguide;moredataisSciText corpusfromAI2.Score:%correct
Natural Logic
• Canwejustusetextasaknowledgebase?
• Naturallogicprovidesauseful,formal(weak)logicfortextual
inference
• Naturallogiciseasilycombinablewithlexicalmatching
methods,includingneuralnetmethods
• Theresultingsystemisusefulfor:
• Common-sensereasoning
• QuestionAnswering
• OpenInformationExtraction
•
i.e.,getting outrelation triples fromtext
Can information retrieval
benefit from distributed
representations of words?
Sec. 9.2.2
From symbolic to distributed
representations
Thevastmajorityofrule-basedorstatisticalNLPand IR work
regardedwordsasatomicsymbols:hotel, conference,
walk
Invectorspaceterms,thisisavectorwithone1andalotof
zeroes
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Wenowcallthisa“one-hot”representation.
Sec. 9.2.2
From symbolic to distributed
representations
Itsproblem:
• Ifusersearchesfor[Dellnotebookbatterysize],wewould
liketomatchdocumentswith“Delllaptopbatterycapacity”
• Ifusersearchesfor[Seattlemotel],wewouldliketomatch
documentscontaining“Seattlehotel”
But
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]T
hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0
Ourqueryanddocumentvectorsareorthogonal
Thereisnonaturalnotionofsimilarityinasetofone-hotvectors
Capturing similarity
Therearemanythingsyoucandoaboutsimilarity,many
wellknowninIR
Queryexpansionwithsynonymdictionaries
Learningwordsimilaritiesfromlargecorpora
Butawordrepresentationthatencodessimilaritywins
Lessparameterstolearn(perword,notperpair)
Moresharingofstatistics
Moreopportunitiesformulti-tasklearning
Distributional similarity-based
representations
Youcangetalotofvaluebyrepresentingawordby
meansofitsneighbors
“Youshallknowawordbythecompanyitkeeps”
(J.R.Firth1957:11)
OneofthemostsuccessfulideasofmodernNLP
government debt problems turning into banking crises as has happened in
saying that Europe needs unified banking regulation to replace the hodgepodge
ë Thesewordswillrepresentbanking ì
Basic idea of learning neural network
word embeddings
Wedefinesomemodelthataimstopredictawordbased
onotherwordsinitscontext
Chooseargmaxw w·((wj−1 +wj+1)/2)
whichhasalossfunction,e.g.,
J =1−wj·((wj−1 +wj+1)/2)
Unitnorm
vectors
Welookatmanysamplesfromabiglanguagecorpus
Wekeepadjustingthevectorrepresentationsofwords
tominimizethisloss
With distributed, distributional representations,
syntactic and semantic similarity is captured
0.286
0.792
−0.177
currency = −0.107
0.109
−0.542
0.349
0.271
Distributional representations can
solve the fragility of NLP tools
StandardNLPsystems– here,theStanfordParser– are
incrediblyfragilebecauseofsymbolicrepresentations
Crazysentential
complement, such asfor
“likes [(being)crazy]”
Distributional representations can
capture the long tail of IR similarity
Google’sRankBrain
Notnecessarilyasgoodfortheheadofthequery
distribution,butgreatforseeingsimilarityinthetail
3rd mostimportantrankingsignal(we’retold…)
Sec. 18.3
LSA (Latent Semantic Analysis) vs.
word2vec
LSA:Count!models
• Factorizea(maybeweighted,oftenlog-scaled)termdocument(Deerwester etal.1990)orword-contextmatrix
(Schütze1992)intoUΣVT
• Retainonlyksingularvalues,inordertogeneralize
k
[Cf. Baroni:Don’tcount,predict! Asystematic comparison ofcontextcountingvs.context-predicting semanticvectors. ACL2014]
Sec. 18.3
LSA vs. word2vec
word2vecCBOW/SkipGram:Predict!
[Mikolov etal.2013]:Simplepredict
modelsforlearningwordvectors
• Trainwordvectorstotrytoeither:
• Predictawordgivenitsbag-ofwordscontext(CBOW);or
• Predictacontextword(positionindependent)fromthecenter
word
• Updatewordvectorsuntiltheycan
dothispredictionwell
word2vec encodes semantic
components as linear relations
COALS model (count-modified LSA)
[Rohde, Gonnerman & Plaut, ms., 2005]
Count based vs. direct prediction
LSA, HAL (Lund & Burgess),
COALS (Rohde et al),
Hellinger-PCA (Lebret & Collobert)
• NNLM, HLBL, RNN, word2vec
Skip-gram/CBOW, (Bengio et al;
• Fast training
• Scales with corpus size
• Efficient usage of statistics
• Primarily used to capture
word similarity
• May not use the best
methods for scaling counts
Collobert & Weston; Huang et al; Mnih &
Hinton; Mikolov et al; Mnih & Kavukcuoglu)
• Inefficient usage of statistics
• Generate improved performance
on other tasks
• Can capture complex patterns
beyond word similarity
Encoding meaning in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]
Crucialinsight: Ratiosofco-occurrenceprobabilitiescanencode
meaningcomponents
x =solid
x =gas
x =water
x =random
large
small
large
small
small
large
large
small
large
small
~1
~1
Encoding meaning in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]
Crucialinsight: Ratiosofco-occurrenceprobabilitiescanencode
meaningcomponents
x =solid
x =gas
x =water
x =fashion
1.9x10-4
6.6x10-5
3.0x10-3
1.7x10-5
2.2x10-5
7.8x10-4
2.2x10-3
1.8x10-5
8.9
8.5x10-2
1.36
0.96
Encoding meaning in vector differences
Q:Howcanwecaptureratiosofco-occurrenceprobabilitiesas
meaningcomponentsinawordvectorspace?
A:Log-bilinearmodel:
withvectordifferences
Glove Word similarities
[Pennington et al., EMNLP 2014]
Nearestwordsto frog:
1.frogs
2.toad
3.litoria
4.leptodactylidae
5.rana
6.lizard
7.eleutherodactylus
litoria
rana
http://nlp.stanford.edu/projects/glove/
leptodactylidae
eleutherodactylus
Glove Visualizations
http://nlp.stanford.edu/projects/glove/
Glove Visualizations: Company - CEO
Named Entity Recognition Performance
ModelonCoNLL CoNLL ’03dev CoNLL ’03test ACE2 MUC7
CategoricalCRF
SVD(logtf)
HPCA
C&W
CBOW
GloVe
91.0
90.5
92.6
92.2
93.1
93.2
85.4
84.8
88.7
87.4
88.2
88.3
77.4
73.6
81.7
81.7
82.2
82.9
F1scoreofCRFtrained onCoNLL2003Englishwith50dimwordvectors
73.4
71.5
80.7
80.2
81.1
82.2
Word embeddings: Conclusion
Glovetranslatesmeaningfulrelationshipsbetweenword-wordcooccurrencecounts into linearrelations inthewordvectorspace
GloveshowstheconnectionbetweenCount! workandPredict!
work– appropriatescalingofcountsgivesthepropertiesand
performanceofPredict! Models
Alotofotherimportantworkinthislineofresearch:
[Levy&Goldberg,2014]
[Arora,Li,Liang,Ma&Risteski,2015]
[Hashimoto,Alvarez-Melis &Jaakkola,2016]
Can we use neural networks to
understand, not just word similarities,
but language meaning in general?
Compositionality
Artificial Intelligence requires being able
to understand bigger things from knowing
about smaller parts
We need more than word embeddings!
Howcanweknowwhenlargerlinguisticunitsare
similarinmeaning?
Thesnowboarderisleapingoverthemogul
Apersononasnowboardjumpsintotheair
Peopleinterpretthemeaningoflargertextunits–
entities,descriptiveterms,facts,arguments,stories– by
semanticcomposition ofsmallerelements
Beyond the bag of words: Sentiment
detection
Isthetoneofapieceoftextpositive,negative,orneutral?
• Sentimentisthatsentimentis“easy”
• Detectionaccuracyforlongerdocuments~90%,BUT
……loved……………great………………impressed
………………marvelous…………
Stanford Sentiment Treebank
• 215,154phraseslabeledin11,855sentences
• Cantrainandtestcompositions
http://nlp.stanford.edu:8080/sentiment/
Tree-Structured Long Short-Term
Memory Networks
[Tai et al., ACL
2015]
Tree-structured LSTM
GeneralizessequentialLSTMtotreeswithanybranchingfactor
Positive/Negative Results on Treebank
95
BiNB
RNN
90
MV-RNN
RNTN
85
TreeLSTM
80
75
TrainingwithSentence Labels
TrainingwithTreebank
Experimental Results on Treebank
• TreeRNN cancaptureconstructionslikeXbutY
• Biword NaïveBayesisonly58%onthese
Stanford Natural Language Inference
Corpus
http://nlp.stanford.edu/projects/snli/
570K Turker-judged pairs, based on an assumed picture
Amanridesabikeon asnowcoveredroad.
Aman isoutside. ENTAILMENT
2femalebabieseatingchips.
Twofemalebabiesareenjoyingchips.
NEUTRAL
Amaninanapronshoppingatamarket.
Amaninanapronispreparingdinner.
CONTRADICTION
NLI with Tree-RNNs
[Bowman, Angeli, Potts & Manning, EMNLP 2015]
Approach: Wewouldliketoworkoutthemeaningof
eachsentenceseparately– apurecompositionalmodel
ThenwecomparethemwithNN&classifyforinference
P(Entail)=0.8
Softmax classifier
manoutside vs. maninsnow
manoutside
man
outside
Learnedwordvectors
Comparison NNlayer(s)
maninsnow
man
Composition NN layer
insnow
in
snow
Tree recursive NNs (TreeRNNs)
Theoreticallyappealing
Veryempiricallycompetitive
But
Prohibitivelyslow
Usuallyrequireanexternal
parser
Don’texploitcomplementary
linearstructureoflanguage
A recurrent NN allows efficient
batched computation on GPUs
TreeRNN: Input-specific structure
undermines batched computation
The Shift-reduce Parser-Interpreter NN
(SPINN)
[Bowman, Gauthier et al. 2016]
BasemodelequivalenttoaTreeRNN,but…
supportsbatchedcomputation:25× speedups
Plus:
Effectivenewhybridthatcombineslinearandtree-structured
context
Canstandalonewithoutaparser
Beginning observation:
binary trees = transition sequences
SHIFT SHIFT
REDUCE SHIFT
SHIFT REDUCE
REDUCE
SHIFT SHIFT
SHIFT SHIFT
REDUCE REDUCE
REDUCE
SHIFT SHIFT
SHIFT REDUCE
SHIFT REDUCE
REDUCE
The Shift-reduce Parser-Interpreter NN
(SPINN)
The Shift-reduce Parser-Interpreter NN
(SPINN)
ThemodelincludesasequenceLSTMRNN
• ThisactsasasimpleparserbypredictingSHIFTorREDUCE
• Italsogivesleftsequencecontextasinputtocomposition
Implementing the stack
• Naïveimplementation:simulatesstacksin abatchwithafixedsizemultidimensionalarray ateachtimestep
• Backpropagationrequiresthateachintermediatestackbe
maintainedinmemory
• ⇒ Largeamountofdatacopyingandmovementrequired
• Efficientimplementation
• Haveonlyonestackarrayforeachexample
• Ateachtimestep,augmentwiththecurrentheadofthestack
• Keeplistofbackpointers forREDUCEoperations
• Similartozipperdatastructuresemployedelsewhere
A thinner stack
Array
Backpointers
1
Spot
1
2
sat
12
3
down
123
4
(satdown)
14
5
(Spot(satdown))
5
Using SPINN for natural language
inference
the cat
sat down
(the cat) (sat down)
the cat
is angry
(the cat) (is angry)
o₁
o₂
SNLI Results
Model
%Accuracy(Testset)
Feature-based classifier
PreviousSOTAsentenceencoder
[Mou etal.2016]
LSTMRNNsequencemodel
Tree LSTM
SPINN
SOTA(sentencepairalignmentmodel)
[Parikhetal.2016]
78.2
82.1
80.6
80.9
83.2
86.8
Successes for SPINN over LSTM
Exampleswithnegation
• P:Therhythmicgymnastcompletesherfloorexerciseatthe
competition.
• H:Thegymnastcannotfinishherexercise.
Longexamples(>20words)
• P:Amanwearingglassesandaraggedcostumeisplayinga
Jaguarelectricguitarandsingingwiththeaccompanimentof
adrummer.
• H:Amanwithglassesandadisheveledoutfitisplayinga
guitarandsingingalongwithadrummer.
Envoi
• Thereareverygoodreasonsforwantingtorepresentmeaning
withdistributedrepresentations
• Sofar,distributionallearninghasbeenmosteffectiveforthis
• Butcf.[Young,Lai,Hodosh &Hockenmaier 2014]on
denotationalrepresentations,usingvisualscenes
• However,wewantnotjustwordmeanings,butalso:
• Meaningsoflargerunits,calculatedcompositionally
• Theabilitytodonaturallanguageinference
• TheSPINNmodelisfast— closetorecurrentnetworks!
• Itshybridsequence/treestructureispsychologicallyplausible
andout-performsothersentencecompositionmethods
Final Thoughts
2011201320152017
speechvisionNLPIR
Youare
here
IR
NLP
Final Thoughts
I’mcertainthatdeep learningwillcometodominateSIGIR over thenextcouple
ofyears…justlikespeech, vision,andNLPbefore it.Thisisagoodthing.Deep
learning provides some powerful newtechniques thatarejustbeingamazingly
successful onmanyhardapplied problems. However, weshould realize that
there isalsocurrently ahugeamountofhypeaboutdeep learning andartificial
intelligence. Weshouldnotletagenuine enthusiasm forimportant and
successful newtechniques leadtoirrational exuberance oradiminished
appreciation ofother approaches. Finally,despite theefforts ofanumber of
people, inpractice there has been aconsiderable divisionbetween thehuman
language technologyfieldsofIR,NLP,andspeech. Partlythisisdueto
organizational factors andpartlythatatonetimethesubfields eachhadavery
different focus. However, recent changes inemphasis – withIRpeoplewanting
tounderstand theuser better andNLPpeople muchmoreinterested inmeaning
andcontext– meanthatthere arealotofcommoninterests, andIwould
encourage muchmore collaboration between NLPandIRinthenextdecade.