Scott Martens on databases

Transcription

Scott Martens on databases
Databases for NLP Development of applica5ons using NLP tools Sco9 Martens, 17 July 2013 Outline •  Preliminaries –  Sor5ng and binary search –  Big O calculus •  Key-­‐value data •  Tables & the Rela5onal Model –  SQL •  Trees & the Hierarchical Model –  XPath & XQuery •  Triplestores & the Network Model –  SPARQL Warning This lecture contains math* You will need to know that: log n ≪ n ≪ n2 ≪ Cn (∀C > 1) as: n → ∞ *although not much
Preliminaries •  A database is a purposeful structured collec5on of data, usually stored on a computer. –  First documented use in 1962 as “data base” –  Spelled as one word since the 1980s. –  Major skills area in computer employment. –  Database programming and opera5on is the most neglected area of NLP. (IMHO) Databases are a big deal •  NLP is driven heavily by database technologies. •  Internet technologies (esp. Web 2.0/3.0) are all about databases. •  Databases are in the news a lot lately. –  (pic related) –  Special IP protec5on for databases in European law. –  Special laws concerning databases of personal informa5on. Dic5onaries are databases •  Structured, purposeful collec5ons of informa5on. •  Used by finding individual words and defini5ons. Sor5ng Unsorted dic,onary Alphabe,cal order dic,onary •  Entries are in a random (or at least, unhelpful) order. •  To find a word: •  Entries are alphabe5cally ordered. •  To find a word: –  Start at the beginning of the book and read un5l you find the word you want. –  Es5mate where in the dic5onary it should approximately be and open the dic5onary there. –  Determine if the page you are on is before or aher the one you want. –  Flip some distance forward or backwards, depending. –  Repeat un5l you find the word. Search in unsorted dic5onaries •  Dic5onary has n entries. •  It takes at most 5me t to find a word in this dic5onary (the 5me to find the last word) •  Then in another dic5onary with 2n entries, it will take at most 5me 2t to find a word in it. •  Searching 5me: t = Cn (C is some constant) ∴ t is propor5onate to n In Big O nota5on: “Search 5me is O(n)” Search in sorted dic5onaries •  Dic5onary has n entries. •  If n = 1000, we have to look in the dic5onary at most 10 5mes to find any word. –  For any n, the number of 5mes we have to look is: log2 n •  If the dic5onary has 2n entries, we only have to look in it log2 2n = log2 n + 1 5mes •  Searching 5me: t = C log2 n ∴ t is propor5onate to log n (note that the base of the log is irrelevant!) In Big O nota5on: “Search 5me is O(log n)” Indexes •  An index is: –  an ordered list of discrete elements designed for searching. –  each entry in the index points to a data object somewhere else. •  Searching in an index usually takes O(log n) 5me, for an index with n elements. •  Most common indexing structure is the b-­‐tree. –  b-­‐trees work best when index entries are all about the same size. Hash func5ons (briefly) •  Not all data is intrinsically sortable •  A hash func5on maps data objects to unique strings, so that they can be sorted. •  Given a hash func5on H: 1) If A and B are func5onally iden5cal data objects (i.e. equal to each other for your purposes) then: H(A) = H(B) 2) If A and B are not func5onally iden5cal data objects then: H(A) ≠ H(B) (at least probably) 3) For all A, H(A) will be about the same size. 4) All hash func5on outputs can be sorted. Either H(A) = H(B), or H(A) > H(B), or H(A) < H(B) Key-­‐value stores •  Key-­‐value store: the simplest kind of database –  a.k.a. associa1ve arrays, maps, some5mes called dic1onaries •  Java, Map interface is a key-­‐value store –  most common implemen5ng class is HashMap import java.util.HashMap; HashMap<String, String> webster1913dict = new HashMap<String, String>(); webster1913dict .put(“ametropia”, “A visual impairment resulting from faulty” + “refraction of light rays in the eye.”); webster1913dict .put(“ametropic”, “Of or pertaining to ametropia.”); webster1913dict .put(“Amharic”, “Of or pertaining to Amhara, a division of Abyssinia.”); String word = “ametropic”; String defn= map.get(word); System.out.println(word + “: ” + defn); Key-­‐value stores •  Storing key-­‐value pairs on disk is a common requirement. •  Many libraries and methods exist. –  Berkeley DB –  DBM, gDBM, nDBM –  “NoSQL” (several things use this name) –  CouchDB (distributable) –  Apache Cassandra (distributed) Making dic5onary databases •  Simplest way: –  Use a key-­‐value database –  Make each headword a key –  Make the rest of the entry a value •  Will this work? –  Sort of •  Does a database like this cover all the ways you might use a dic5onary? Making dic5onary databases •  Key-­‐value stores fail when: –  You want to search for things other than the keys. •  Searching etymologies, synonyms, defini5ons, etc. –  You have structured informa5on in the values that are important for querying or analysis. •  i.e., search for nouns that have La5n roots beginning with the le9er “A”. –  You have more than one data object with the same key. •  Homonyms •  It’s just a big dumb indexed data store. –  Fetching an entry given a key is O(log n) 5me. –  Anything else takes O(n) 5me or more for cross-­‐reference searches. Rela5onal Databases •  Data is organized into tables –  with labeled columns. •  one column is chosen as a key –  Key column contains unique values. •  usually implemented as many key-­‐value databases –  Table with n columns yields n – 1 key-­‐value stores. Key Headword PoS Loan? Defini,on abadon1 Abadon n. FALSE The destroyer, or angel of the bo9omless pit; -­‐-­‐ the same as Apollyon and Asmodeus. abadon2 Abadon n. FALSE Hell; the bo9omless pit. aba@1 Abah prep. FALSE Behind; toward the stern from; as, aba= the wheelhouse. aba@2 Abah adv. FALSE Toward the stern; ah; as, to go aba=. abaisance Abaisance n. FALSE Obeisance. abaiser Abaiser n. FALSE Ivory black or animal charcoal. abaist Abaist p.p. FALSE Abashed; confounded; discomfited. abalienate1 Abalienate v.t. FALSE To transfer the 5tle of from one to another; to alienate. abalienate2 Abalienate v.t. FALSE To estrange; to withdraw. abalienate3 Abalienate v.t. FALSE To cause aliena5on of (mind). abalienta,on Abaliena5on n. FALSE The act of abaliena5ng; aliena5on; estrangement. abalone Abalone n. TRUE A univalve mollusk of the Genus Halio1s. The shell is lined with mother-­‐of-­‐pearl, and used for ornamental purposes; the sea-­‐ear. Rela5onal Databases Authors Author Full name Born Died Jonson Ben Jonson 1574 1637 Milton John Milton 1608 1674 Sandys George Sandys 1577 1643 Headwords Headword Loan Etymology EntryIDs Abadon FALSE Heb. ābaddōn destruc5on, abyss, fr. ābad to be lost, to perish. abadon1, abadon2 Aba@ FALSE pref. a-­‐ on + oe. ba=, ba=en, bia=en, as. beæ=an; be by + æ=an behind. abah1, abah2 Abaisance FALSE For obeisance; confused with F. abaisser, E. abase. abaisance. Abaiser FALSE abaiser Abaist FALSE abaist Abalienate FALSE Abaliena,on Abalone FALSE TRUE L. abalienatus, p. p. of abalienare; ab + alienus foreign, alien. abalienate1, abalienate2, abalienate3 L. abaliena1o: cf. F. abaliéna1on. abalienta5on abalone Defini,ons Key PoS Usage Source Defini,on abadon1 n. abadon2 n. Poe5c aba@1 prep
. Naut. aba@2 adv. abaisance n. abaiser n. abaist p.p. Obs. Abashed; confounded; discomfited. abalienate1 v.t. Civil Law To transfer the 5tle of from one to another; to alienate. abalienate2 v.t. Obs. To estrange; to withdraw. abalienate3 v.t. abalienta,o
n n. Obs. The act of abaliena5ng; aliena5on; estrangement. abalone n. Zoöl. A univalve mollusk of the Genus Halio1s. The shell is lined with mother-­‐
of-­‐pearl, and used for ornamental purposes; the sea-­‐ear. The destroyer, or angel of the bo9omless pit; -­‐-­‐ the same as Apollyon and Asmodeus. Milton Hell; the bo9omless pit. Behind; toward the stern from; as, aba= the wheelhouse. Toward the stern; ah; as, to go aba=. Obs. Jonson Obeisance. Ivory black or animal charcoal. Sandys To cause aliena5on of (mind). SQL SELECT Headwords.headword, Headwords.EntryIDs, Defini5ons.key, Defini5ons.usage, Defini5ons.source, Authors.author, Authors.died FROM Headwords, Defini5ons, Authors WHERE Authors.died < 1700 AND Authors.author = Defini5ons.source AND Defini5ons.key IN Headwords.EntryIDs AND Defini5ons.usage != “Obs.” ORDERBY Headwords.headword Rela5onal Databases •  Much more interes5ng queries than key-­‐value tables. –  Efficient for cross-­‐reference searches •  Widespread, widely supported, high quality and reliable implementa5ons, including some open source. –  Goes back to IBM System/38 in the 70s. Idea dates back to the early 60s. –  Original basis of Oracle’s business. –  Used to run all kinds of databases, including distributed banking transac5on systems. •  SQL and RDBMS theory are widely taught, but implementa5ons do not adhere to standards. Rela5onal Databases •  Finding an entry in a table is O(log n) –  No worse than simple key-­‐value data. •  Filtering linked entries on mul5ple tables is usually no worse than O(n log n). –  Because if we get n matches in one table in O(log n) 5me, we might have to look each one up in another table. –  Clever programmers keep worst case performance under O(n) in almost all cases. –  It’s possible to make terrible queries that take O(Cn) 5me, but only if you really try. Rela5onal Database Models •  Rela5onal databases require a model. •  Model must be specified before entering any data. •  Model cannot be easily changed aher the database is populated. •  For example: –  Every headword has just one etymology. –  Every defini5on has zero or one usage marks. •  If this is ever not true, the whole thing breaks. –  Good because it checks that your data is consistent –  Bad that it’s inflexible Hierarchical databases •  All data fits in a tree structure –  equivalently: a system of nested sets. •  Queries are over rela5ons in the tree. •  Originally devised at IBM in the 60s for NASA mainframes to manage complex inventory for rockets. •  Started to become important in the 90s with XML. Rela5onal Databases XPath and XQuery <p><hw>A*bad"don</hw> <pr>(&adot_;*b&abreve;d"d&ubreve;n)</pr>,
<pos>n.</pos> <ety>[Heb. <ets>&amacr;badd&omacr;n</ets>
destruction, abyss, fr. <ets>&amacr;bad</ets> to be lost, to
perish.]</ety> <sn>1.</sn> <def>The destroyer, or angel of the
bottomless pit; -- the same as Apollyon and Asmodeus.</def><br/
> [<source>1913 Webster</source>]</p>
<p><sn>2.</sn> <def>Hell; the bottomless pit.</def>
<mark>[Poetic]</mark><br/> [<source>1913 Webster</source>]</p>
<p><q>In all her gates, <qex>Abaddon</qex> rues<br/>
Thy bold attempt.</q> <rj><qau>Milton.</qau></rj><br/>
[<source>1913 Webster</source>]</p>
<p><hw>A*baft"</hw> <pr>(&adot_;*b&adot_;ft")</pr>,
<pos>prep.</pos> <ety>[Pref. <ets>a-</ets> on + OE. <ets>baft</
ets>, <ets>baften</ets>, <ets>biaften</ets>, AS.
<ets>be&aelig;ftan</ets>; <ets>be</ets> by + <ets>&aelig;ftan</
ets> behind. See <er>After</er>, <er>Aft</er>, <er>By</er>.]</
ety> <fld>(Naut.)</fld> <def>Behind; toward the stern from;
<as>as, <ex>abaft</ex> the wheelhouse</as>.</def><br/>
[<source>1913 Webster</source>]</p>
<p><cs><col><b>Abaft the beam</b></col>. <cd>See under
<er>Beam</er>.</cd></cs><br/> [<source>1913 Webster</source>]</
p>
<p><hw>A*baft"</hw>, <pos>adv.</pos> <fld>(Naut.)</fld>
<def>Toward the stern; aft; <as>as, to go <ex>abaft</ex></
as>.</def><br/> [<source>1913 Webster</source>]</p>
<p><hw>A*bai"sance</hw> <pr>(&adot_;*b&amacr;"s&aitalic_;ns)</
pr>, <pos>n.</pos> <ety>[For <ets>obeisance</ets>; confused
with F. <ets>abaisser</ets>, E. abase.]</ety> <def>Obeisance.</
def> <mark>[Obs.]</mark> <rj><au>Jonson.</au></rj><br/>
[<source>1913 Webster</source>]</p>
XPath: //p[not(.//fld/text()==“[Obs.]”)] XQuery: for $au in //au for $entry in $au/parent::p for $hw in $entry//hw for $usage in $entry//fld for $aname in //authorlist//author where $aname/name[text()= $au/text()] and $aname/death < 1700 and $usage != “[Obs.]” return $entry Hierarchical databases •  Mostly sold as add-­‐ons to rela5onal databases –  IBM IMS, pureXML, Oracle 11g XML, some features of MS SQL Server. •  Some new DBs designed specially for XML: –  BaseX, eXist, MarkLogic, Sedna •  XPath is widespread •  XQuery is s5ll developing. •  Other vendor specific query languages exist Hierarchical databases •  Does not require a model in advance. –  “Semi-­‐structured data” = data is its own model. –  Parallels “well-­‐formed XML”. •  Much more complex to query efficiently. •  Queries are harder to op5mize. –  Diverse implementa5ons are be9er at some kinds of queries than others. •  Very poor at data correla5on. –  Rela5onal DBs are be9er op5mized for correla5on. •  Few consistency checks – only checks that the data is in a hierarchy. •  Good support for large data, distributed data and parallel processing. •  Very well suited to natural language data, and many other kinds of “organic” data. Hierarchical databases •  Worst case performance can be up to O(n2) •  Using indexes usually keeps performance from ge€ng worse than O(n) on most queries. •  Simple queries can be as fast as O(log n). •  But if you have bad op5miza5on or no indexing, O(n2) is easy to get on simple-­‐
seeming queries. •  Very hard to make queries that take O(Cn) 5me (but not impossible!) Hierarchical databases Network and Graph Databases •  Oldest style of database – goes back to 1959. •  Incredibly simple, robust and powerful. –  Provably the most powerful formula5on of database ideas, since it supports all finite structures. –  Excellent scaling, distributed data supported easily, •  Far too hard to use in the 60s. •  Weak or non-­‐existent no5ons of typing or data modeling. •  IBM didn’t want to do it, so it wasn’t done. Network and Graph databases WordNet: Dic5onary data seen as a collec5on of nodes with rela5ons between them. Network and Graph Databases •  Most common formula5on of network databases is the triplestore. •  Triplestore: A collec5on of triplets – ordered lists of exactly three symbols. –  “brother” “kindof” “rela5ve” –  “rela5ve” “kindof” “person” –  “body” “partof” “person” –  etc. Network and Graph Databases • 
• 
• 
• 
Simple to formulate No modeling required to store data. Easily distributed, easy to store. Widely used in Web 3.0 and AI –  RDF (Resource Descrip5on Framework) is an implementa5on of a triplestore standard. –  OWL (Web Ontology Language) –  Protégé, SparkleDB, BigData pla‚orms –  dbpedia, Freebase are triplestores SPARQL PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?name ?birth ?death ?person WHERE {
?person dbo:birthPlace :Berlin .
?person dbo:birthDate ?birth .
?person foaf:name ?name .
?person dbo:deathDate ?death .
FILTER (?birth < "1900-01-01"^^xsd:date) .
}
ORDER BY ?name
Network and Graph Databases •  No model to store means you have to know the data’s structure to make queries. •  Robust support for very large databases. •  No consistency checking. –  Good because the real world is vast, messy, and inconsistent. –  Bad because you can’t be sure how to find anything. Network and Graph Databases •  Worst case performance can be up to O(Cn) –  Very, very, very bad –  Queries can easily require traversing most of the database many 5mes if badly posed. –  Can be equivalent to the “travelling salesman” problem •  Comparable performance to rela5onal and hierarchical databases on the same kinds of queries. Network and Graph Databases Discussion •  What kind of database is well suited to this class and its dataset? (the le9ers) •  To different annota5on styles? (inline and stand-­‐off) •  To data of unclear accuracy? (OCR with errors, automa5c annota5on) •  To ambiguous annota5ons like genre?