Christopher Manning CS300 talk – Fall 2000
Transcription
Christopher Manning CS300 talk – Fall 2000
Christopher Manning CS300 talk – Fall 2000 [email protected] http://nlp.stanford.edu/~manning/ 1 Research areas of interest: NLP/CL • Statistical NLP models: Combining linguistic and statistical sophistication • NLP and ML methods for extracting meaning relations from webpages, medical texts, etc. • Information extraction and text mining • Lexical and structural acquisition from raw text • Using robust NLP: dialect/style, readability, … • Using pragmatics, genre, NLP in web searching • Computational lexicography and the visualization of linguistic information 2 Models for language • What is the motivation for statistical models for understanding language? • From the beginning, logics and logical reasoning were invented for handling natural language understanding • Logics have a languagelike form that draws from and meshes well with natural languages • Where are the numbers? 3 Sophisticated grammars for NL • From NP Det Adj* N • there developed precise and sophisticated grammar formalisms (such as LFG, HPSG) 4 The Problem of Ambiguity • Any broad-coverage grammar is hugely ambiguous (often hundreds of parses for 20+ word sentences). • Making the grammar more comprehensive only makes the ambiguity problem get worse. • Traditional (symbolic) NLP methods don’t provide a solution. – Selectional restrictions fail because creative/ metaphorical use of language is everywhere: • I swallowed his story • The supernova swallowed up the planet 5 The problem of ambiguity close up • “The post office will hold out discounts and service concessions as incentives.” • 12 words. Real language. At least 83 parses. 6 7 Statistical NLP methods • • • • P(to | Sarah drove) P(time is verb | Time flies like an arrow) P(NP Det Adj N | mother = VP[drive] ) Statistical NLP methods: – Estimate grammar parameters by gathering counts from texts or structured analyses of texts – Assign probabilities to various things to determine the likelihood of word sequences, sentence structure, and interpretation 8 Probabilistic Context-Free Grammars NP NP NP NP NP Det N: NPposs N: Pronoun: NP PP: N: 0.4 0.1 0.2 0.1 0.2 NP NP Det PP N P(subtree above) = 0.1 x 0.4 = 0.04 9 Why Probabilistic Grammars? • The predictions about grammaticality and ambiguity of categorical grammars are not in accord with human perceptions or engineering needs. • Categorical grammars aren’t predictive – They don’t tell us what “sounds natural” • Probabilistic grammars model error tolerance, online lexical acquisition, … and have been amazingly successful as an engineering tool • They capture a lot of world knowledge for free • Relevant to linguistic change and variation, too! 10 Example: near • In Middle English, was an adjective [Maling] • But, today, is it an adjective or a preposition? – The near side of the moon – We were near the station • Not just a word with multiple parts of speech! There is evidence of blending: – We were nearer the bus stop than the train – He has never been nearer the center of the financial establishment 11 Research aim • Most current statistical models are quite simple (linguistically and also statistically) • Aim: To combine the good features of statistical NLP methods with the sophistication of rich linguistic analyses. 12 Lexicalising a CFG VP[looked] V[looked] looked •A lexicalized CFG can capture probabilistic dependencies between words PP[inside] P[inside] NP[box] D[the] N[box] the box 13 Left-corner parsing • The memory requirements of standard parsers do not match human linguistic processing. What humans find hardest – center embedding: – *The man that the woman the priest met knows couldn’t help • is really the bread-and-butter of standard CFG parsing: – (((a + b))) • As an alternative, left-corner parsing does capture this. 14 Parsing and (stack) complexity • She ruled that the contract between the union and company dictated that claims from both sides should be bargained over or arbitrated. 15 Tree geometry vs. stack depth TD 5 LC 1 BU 1 • Kim thinks Sandy knows she likes green apples. 1 1 7 • The rat that the cat that Kim likes chased died 3 3 7 • Kim’s friend’s mother’s car smells. 16 Probabilistic Left-Corner Grammars • Use richer probabilistic conditioning – Left corner and goal category rather than just parent • P(NP Det Adj N | Det, S) S • Allow left-to-right online parsing (which can hope to explain how people build NP partial interpretations online) • Easy integration with lexicalization, Det Adj N part-of-speech tagging models, etc. 17 Probabilistic Head-driven Grammars • The heads of phrases are the source of the main constraining information about a sentence structure • We work out from heads by following the dependency order of the sentence • The crucial property is that we have always built – and have available to us for conditioning – all governing heads and all less oblique dependents of the same head • We can also easily integrate phrase length 18 Information from the web: The problem • When people see web pages, they understand their meaning – By and large. To the extent that they don’t, there’s a gradual degradation • When computers see web pages, they get only character strings and HTML tags 19 The human view 20 The intelligent agent view <HTML> <HEAD> <TITLE>Ford Motor Company - Home Page</title> <META NAME="Keywords" CONTENT="cars, automobiles, trucks, SUV, mazda, volvo, lincoln, mercury, jaguar, aston martin, ford"> <META NAME="description" CONTENT="Ford Motor Company corporate home page"> <SCRIPT LANGUAGE="JavaScript1.2"> … </SCRIPT> <!-- Trustmark code --><DIV ID=trustmarkDiv> <TABLE BORDER="0" CELLPADDING=0 CELLSPACING=0 WIDTH=768> <TR><TD WIDTH=768 ALIGN=CENTER> <A HREF="default.asp?pageid=473" onmouseover="logoOver('fordscript');rolloverText('ht0')" onmouseout="logoOut('fordscript');rolloverText('ht0')"><img border="0" src="images/homepage/fordscript.gif" ALT="Learn more about Ford Motor Company" WIDTH="521" HEIGHT="39"></A><br> … </TD></TR></TABLE></DIV> </BODY></HTML> 21 The problem (cont.) • We'd like computers to see meanings as well, so that computer agents could more intelligently process the web • These desires have led to XML, RDF, agent markup languages, and a host of other proposals and technologies which attempt to impose more syntax and semantics on the web – in order to make life easier for agents. 22 Thesis • The problem can’t and won’t be solved by mandating a universal semantics for the web • The solution is rather agents that can ‘understand’ the human web by text and image processing 23 (1) The semantics • Are there adequate and adequately understood methods for marking up pages with such a consistent semantics, in such a way that it would support simple reasoning by agents? • No. 24 What are some AI people saying? “Anyone familiar with AI must realize that the study of knowledge representation—at least as it applies to the “commensense” knowledge required for reading typical texts such as newspapers—is not going anywhere fast. This subfield of AI has become notorious for the production of countless non-monotonic logics and almost as many logics of knowledge and belief, and none of the work shows any obvious application to actual knowledgerepresentation problems. Indeed, the only person who has had the courage to actually try to create large knowledge bases full of commonsense knowledge, Doug Lenat …, is believed by everyone save himself to be failing in his attempt.” (Charniak 1993:xvii–xviii) 25 (2) Pragmatics not semantics pragmatic relating to matters of fact or practical affairs often to the exclusion of intellectual or artistic matters pragmatics linguistics concerned with the relationship of the meaning of sentences to their meaning in the environment in which they occur • A lot of the meaning in web pages (as in any communication) derives from the context – what is referred to in the philosophy of language tradition as pragmatics • Communication is situated 26 Pragmatics on the web • Information supplied is incomplete – humans will interpret it – Numbers are often missing units – A “rubber band” for sale at a stationery site is a very different item to a rubber band on a metal lathe – A “sidelight” means something different to a glazier than to a regular person • Humans will evaluate content using information about the site, and the style of writing – value filtering 27 (3) The world changes • The way in which business is being done is changing at an astounding rate – or at least that’s what the ads from ebusiness companies scream at us • Semantic needs and usages evolve (like languages) more rapidly than standards (cf. the Académie française) • People use words that aren’t in the dictionary. • Their listeners understand them. 28 (4) Interoperation Ontology: a shared formal conceptualization of a particular domain • Meaning transfer frequently has to occur across the subcommunities that are currently designing *ML languages, and then all the problems reappear, and the current proposals don't do much to help 29 Many products cross industries http://www.interfilm-usa.com/Polyester.htm • Interfilm offers a complete range of SKC's Skyrol® brand polyester films for use in a wide variety of packaging and industrial processes. • Gauges: 48 - 1400 • Typical End Uses: Packaging, Electrical, Labels, Graphic Arts, Coating and Laminating – labels: milk jugs, beer/wine, combination forms, laminated coupons, … 30 (5) Pain but no gain • A lot of the time people won't put in information according to standards for semantic/agent markup, even if they exist. • Three reasons… – Laziness: Only 0.3% of sites currently use the (simple) Dublin Core metadata standard. – Profits: Having an easily robot-crawlable site is a recipe for turning what you sell into a commodity, and hence making little profit – Cheats: There are people out there that will abuse any standard, if it’s profitable 31 (6) Less structure to come • “the convergence of voice and data is creating the next key interface between people and their technology. By 2003, an estimated $450 billion worth of e-commerce transactions will be voicecommanded.*” • Question: will these customers speak XML tags? Intel ad, NYT, 28 Sep 2000 *Data Source: Forrester Research. 32 The connection to language Decker et al. IEEE Internet Computing (2000): • “The Web is the first widely exploited many-tomany data-interchange medium, and it poses new requirements for any exchange format: – Universal expressive power – Syntactic interoperability – Semantic interoperability” But human languages have all these properties, and maintain superior expressivity and interoperability through their flexibility and context dependence 33 NLP and information access • Solution: use robust natural language processing and machine learning techniques • NLP comes into its own when you want to do more than just standard IR. • E.g., defined information needs over text: – “An apartment with 2 bedrooms in Menlo Park for less than $1,500.” – “Where was there an airline accident today?” – “What proteins is this gene known to regulate?” 34 Example of extracting textual relations: Real Estate Ads • System starts with plain text of ads – These are hardly exactly “English” • But an unstructured information source, close to English – Chosen as lowest common denominator • Output: database records – A variety of tables giving information about: • the property: bedrooms, garages, price • the real estate agency • inspection times 35 Real Estate Ads: Input <ADNUM>2067206v1</ADNUM> <DATE>March 02, 1998</DATE> <ADTITLE>MADDINGTON $89,000</ADTITLE> <ADTEXT> OPEN 1.00 - 1.45<BR> U 11 / 10 BERTRAM ST<BR> NEW TO MARKET Beautiful<BR> 3 brm freestanding<BR> villa, close to shops & bus<BR> Owner moved to Melbourne<BR> ideally suit 1st home buyer,<BR> investor & 55 and over.<BR> Brian Hazelden 0418 958 996<BR> R WHITE LEEMING 9332 3477 </ADTEXT> 36 Real Estate Ads: Output • Output is database tables • But the general idea in slot-filler format: SUBURB: ADDRESS: INSPECTION: BEDROOMS: TYPE: AGENT: BUS PHONE: MOB PHONE: MADDINGTON (11,10,BERTRAM,ST) (1.00,1.45,11/Nov/98) 3 HOUSE BRIAN HAZELDEN 9332 3477 0418 958 996 [Manning & Whitelaw, U. Sydney 1998; in daily use at News Corp.] 37 38 One needs a little NLP • There is no semantic coding to use • Standard IR doesn’t work: – suburbs • the Paddington of the west • one hours drive from Sydney • real estate agent – prices • recently sold for $x. Was $y now $z. Rent. – bedrooms – multi-property ads 40 Text Segmentation Real-estate ads have an hiearchical text structure!! SOUTHPORT UNIT SPECIALS $58,900 o.n.o. 2 brm close to water and shops. $114,000 "Grandview", excellent value, good returns LJ Coleman Real Estate Contact Steve 5527 0572 GLEBE 2br yd $250; 4br yd $430 COOGEE 3br yd $320; 1br $150 BALMAIN 1br $180 H.R. Licensed FEE 9516-3211 41 The End 42