(POS) tagging

Transcription

(POS) tagging
Text Analytics
Part-Of-Speech Tagging using Hidden Markov Models
Ulf Leser
Modeling a Language
• Given a prefix of a sentence: Predict the next word
– “At 5 o’clock, we usually drink …”
•
•
•
•
“tea” – quite likely
“beer” – quite unlikely (would slightly more likely be “drink a beer”)
“biscuits” – semantically wrong
“the windows need cleaning” – syntactically wrong
– Similar to Shannon’s Game: Given a series of characters, predict the next
one (used in communication theory, entropy, …)
• Useful for speech/character recognition, translation, T9 on mobiles
– T9: Use prediction of next word as a-priori (background) probability
enhancing interactive prediction
– Helps to disambiguate between different options
– Helps to make useful suggestions
– Helps to point to likely errors
• Language models evidently are highly language dependent
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
2
N-Grams for Language Modeling
•
•
•
•
Consider the sentence prefix has n-1 words <w1,…,wn-1>
Lookup counts of all n-grams starting with <w1,…,wn-1>
Count the frequencies of all existing continuations
Chose the most frequent continuation
• More formally
– Compute, for every possibly wn,
p ( w1 ,..., wn )
p( wn ) = p ( wn | w1 ,..., wn −1 ) =
p ( w1 ,...wn −1 )
– Chose wn which maximizes p(wn)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
3
Markov-Models and n
•
•
•
•
Which n should we chose?
Consider language generation as a stochastic process
At each stage, the process generates a new word
Question: How long is its memory? How many previous words does it
use to produce the next one?
–
–
–
–
0: Markov chain of order 0: No memory at all
1: Markov chain of order 1: Next word only depends on previous word
2: Markov chain of order 2: Next word only depends on two previous words
…
• In language modeling, one usually chooses n=3-4
– That seems small, but most language effects are local
• But not all: “Dan swallowed the large, shiny, red …” (Car? Pil? Strawberry?)
– Furthermore: We cannot estimate good parameters for higher orders (later)
• Not enough training data
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
4
Visualization
• Since every state emits exactly one word, we can (for now) merge
states and words
• State transition graph
– Nodes are states
– Arcs are transitions with non-zero probability
• Example
– “I go home”, “I go shopping”, “I am shopping, “Go shopping”
0,33
am
I
home
1
0,66
go
0,33
0,66
shopping
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
5
Example
0,33
am
I
home
1
0,66
go
0,33
0,66
shopping
• p(“I go home”)
= p(w1=„I“|w0)* p(w2=„go“|w1=„I“) *
p(w3=„home“|w2=„go“)
= 1 * 0.66 * 0.33 = 2/9
• Problem: Pairs we have not seen in the training data get
probability 0
– With this small “corpus”, almost all sequences get p=0
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
6
Higher Order Markov Models
• Markov Models of order k
– The probability of being in state s after n steps depends on the k
predecessor states sn-1,…sn-k
p(wn=sn|wn-1=sn-1, wn-2=sn-2,…, w1=s1) = p(wn=sn|wn-1=sn-1, …, wn-k=sn-k)
• We can transform any order k model M (k>1) into a
Markov Model of order 1 (M’)
– M’ has |M|k states (all combinations of states of length k)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
7
Predicting the Next State
• Our problem at hand is a bit different
– And simpler
• We do not want to reason about an entire sequence, but
only about the next state, given some previous states
– We borrow terminology from Markov models, but not (yet) algs
p ( wn ) = p ( wn | w1 ,..., wn −1 )
0,33
am
I
home
1
0,66
0,33
= p ( wn | wn −1 )
=
p ( wn −1 , wn )
p ( wn −1 )
~ p ( wn −1 , wn )
go
0,66
shopping
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
This is the most frequent
bi-gram with prefix wn-1
8
Problem
• We have no infinitely large corpus
• How many n-grams do exist?
– Assume a language of 20.000 words
– n=1: 20.000, n=2: 4E8, n=3: 8E12, n=4: 1,6E17, …
• In “normal” corpora, almost all n-grams (n>4) are nonexisting
– This does not mean that they are wrong
– Our model cannot adequately cope with the data sparsity
• Classical trade-off
– Large n: More expressive model, but too many parameters
– Small n: Less expressive model, but easier to learn
• Note: Exponential growth in n# of n-grams cannot be
balanced by “simply use larger corpora”
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
9
Smoothing I: Laplace‘s Law
• We need to give some of the probability mass to unseen events
• Oldest (and simplest) suggestion: “Adding 1”
count ( w1 ,..., wn ) + 1
pLAP ( w1 ,..., wn ) =
N+B
– Where B is the number of possible n-grams, i.e., Kn
• Clearly, this assigns some probability mass to unseen events
– All bi-grams get a probability≠0
• Actually, it assigns very much
– Estimates for seen n-grams are scaled down dramatically (B is huge)
– Estimates for unseen n-grams are all small, but there are so many of them
– In a corpus of 40 M words with K~400T, 99.7% of the total probability
mass is spend in unseen events
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
10
Smoothing II: Lidstone‘s Law
• Laplace not suitable if there are many events of which few are seen
• Lidstone’s law gives less probability mass to unseen events
count ( w1 ,..., wn ) + λ
pLIP ( w1 ,..., wn ) =
N +λ*B
– Small λ: More mass is given to seen events
– Typical estimate is λ=0.5
– Better values can be learned empirically from (sub-)corpora (next slide)
• However: Estimate of seen events is always linear in the MLE estimate
– No good approximation of empirical distributions
• More advanced techniques: Good-Turing Estimator, n-gram
interpolations
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
11
Content of this Lecture
•
•
•
•
Part-Of-Speech (POS)
Simple methods for POS tagging
Hidden Markov Models
Closing Remarks
• Most material from
– [MS99], Chapter 9/10
– Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). "Biological Sequence
Analysis: Probablistic Models of Proteins and Nucleic Acids". Cambridge University
Press.
– Rabiner, L. R. (1988). "A Tutorial on Hidden Markov Models and Selected
Applications in Speech Recognition." Proceedings of the IEEE 77(2): 257-286.
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
12
Part-of-Speech (POS)
• In a sentence, each word has a grammatical class
• Simplest case: Noun, verb, adjective, adverb, article, …
WORDS
the
koala
put
the
keys
on
the
table
TAGS
Noun
Verb
Particle
Determiner
• Usual format
– The/D koala/N put/V the/D keys/N on/P the/D table/N
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
13
Tag Sets
• A (POS-) tag set is a set of terms used to describe POSclasses
– Simple tag sets: only broad word classes
– Complex tag sets: Include morphological information
•
•
•
•
Noun: Gender, case, number
Verb: Tense, number, person
Adjective: normal, comparative, superlative
…
• Example
– The/D koala/N-s put/V-past-3rd the/D keys/N-p on/P the/D table/N-s
• Important tag sets
– English: Brown/U-Penn tag set
– German: STTS (Stuttgart-Tübingen Tagset)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
14
Brown Tag Set
• Some important
tags
• Has 87 tags
• Definition of “class” is
not at all fixed
– London-Lund Corpus of
Spoken English: 197
– Lancaster-Oslo/
Bergen: 135
BEZ
DT
IN
JJ
NN
NNP
NNS
PERIOD
PN
RB
TO
VB
VBZ
VBD
WDT
…
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
„is“
Determiner
Preposition
Adjective
Noun, singular
Proper Noun
Noun, plural
„.“, „?“, „!“
Personal pronoun
Adverb
„to“
Verb, base form
Verb, 3d singular present
Verb, past tense
Wh – determiner
…
15
U-Penn TreeBank Tag Set (45 tags)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
16
Tagged Sentences
• Simple tag set
– The/DET koala/N put/V the/DET keys/N on/P the/DET table/N
• Including morphological information
– The/DET koala/N-s put/V-past-3rd the/DET keys/N-p on/P the/DET
table/N-s
• Encoded in Penn tag set
– The/DT koala/NN put/VBN the/DT keys/NNS on/P the/DT table/NN
The
koala
put
the
keys
on
the
table
D
N
V
D
N
P
D
N
D
N-plu
P
D
N-sing
DT
NNS
P
DT
NN
D
DT
N-sing V-pre-3rd
NN
VBN
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
17
POS Tagging
• We might assume that each term has an intrinsic grammatical class
– Peter, buy, school, the, …
– POS tagging would be trivial: Collect the class of each word in a dictionary
and lookup up each word in a sentence
• Homonyms
– One term can represent many words (senses), depending on the context
– The different senses can be of different word classes
– “ist modern” – “Balken modern”, “We won a grant” – “to grant access”
• Words intentionally used in different word classes
– “We flour the pan”, “Put the buy here”, “the buy back of the first tranche”
– Words usually have a preferred class, but exceptions are not rare (>10%)
– In German, this usually requires a suffix etc.: kaufen – Einkauf, gabeln –
Gabelung
• Of course, there are exceptions: wir essen – das Essen
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
18
Problems
• Words very often have more than one possible POS
–
–
–
–
The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
• Structure of sentences may be ambiguous
– The representative put chairs on the table
• The/DT representative/NN put/VBD chairs/NNS on/IN the/DT table/NN
• The/DT representative/JJ put/NN chairs/VBZ on/IN the/DT table/NN
– Presumably the first is more probable than the second
• Unseen words
– Recall Zipf’s law – there will always be unseen words
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
19
A Real Issue
Source: Jurasky / Martin
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
20
Why POS Tagging
• Parsing a sentence usually starts with POS tagging
• Finding phrases (shallow parsing) requires POS tagging
– Noun phrases: “The <large crowd of people> went away”
– Verb phrases
• Applications in all areas of Text Mining
– NER: ~10% performance boost when using POS tags as features
for single-token entities
– NER: ~20% performance boost when using POS tags during postprocessing of multi-token entities
– POS tags are a natural source for word sense disambiguation
• Solvable
– Simple and quite accurate (96-99%)
– Many tagger available (BRILL, TNT, MedPost, …)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
21
Content of this Lecture
• Part-Of-Speech (POS)
• Simple methods for POS tagging
– Syntagmatic structure information
– Most frequent class
• Hidden Markov Models
• Closing Remarks
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
22
Syntagmatic Structure Information
• Syntagmatic: „the relationship between linguistic units in a
construction or sequence“
– [http://www.thefreedictionary.com]
• Here: Look at the surrounding POS tags
– Some POS-tag sequences are frequent, others impossible
– DT JJ NN versus DT JJ VBZ
• We can count the frequency of POS-patterns using (large)
tagged corpora
–
–
–
–
Count all tag bi-grams
Count all tri-grams
Count regular expressions (DT * NN versus DT * VBZ)
… (many ways to define a pattern)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
23
Application
• Tagging with frequent POS tag combinations
– Start with words with unique POS tags (the, to, where, lady, …)
• But: “The The”
– Apply the most frequent patterns that partially match
• Assume <DT JJ> and <JJ NN> are frequent
• “The blue car” -> DT * * -> DT JJ * -> DT JJ NN
• But: “The representative put chairs” -> DT * * * -> DT JJ * * -> DT JJ
NN *
– Need to resolve conflicts: “the bank in” -> DT * IN
• Assume frequent bi-grams <DT JJ> and <VBZ IN>
– Pattern-cover algorithm
• Results usually are bad (<80% accuracy)
– Exceptions are too frequent
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
24
Even Simpler: Most Frequent Class
• Words usually have a preferred POS
– “Adjektiviertes Verb”, “adjektiviertes Nomen”, “a noun being used
as an adjective”
– The POS tag which a word most often gets assigned to
• Method: Tag each word with its preferred POS
• Charniak, 1993: Reaches 90% accuracy for English
• The observation of “preferred POS plus (less frequent)
exceptions” calls for a probabilistic approach
– Preferred POS has a-priori high probability
– Exceptions may win in certain contexts
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
25
Content of this Lecture
• Part-Of-Speech (POS)
• Simple methods for POS tagging
• Hidden Markov Models
– Definition and Application
– Learning the Model
– Tagging
• Closing Remarks
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
26
General Approach
• We build a sequential probabilistic model
• Recall Markov Models
– A Markov Model is a sequential process with states s1, …, sn with …
• Every state may emit exactly one symbol from Σ
• No two states emit the same symbol
• p(wn=sn|wn-1=sn-1, wn-2=sn-2,…, w1=s1) = p(wt=st|wn-1=sn-1)
• That doesn’t help much
– If every word is a state, it can emit only one symbol (POS tag)
• We need an extension like this
– We assume one state per POS class
– Each state may emit any word with a given probability
– But: When seeing a new sentence, we can only observe the
sequence of emissions, but not the underlying sequence of states
– This is a Hidden Markov Model
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
27
Example
the
representative
DT
JJ
put
NN
chairs
VBZ
• Impossible (p=0) emissions omitted
• Several possible paths, each with individual probability
– DT – JJ – NN – VBZ
• p(DT|start) * p(JJ|DT) * p(NN|JJ) * p(VBZ|NN) *
p(the|DT) * p(representative|JJ) * p(put|NN) * p(chairs|VBZ)
– DT – NN – VBZ – NN
• p(DT|start) * p(NN|DT) * p(VBZ|NN) * p(NN|VBZ) *
p(the|DT) * p(representative|NN) * p(put|VBZ) * p(chairs|NN)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
28
Example
the
DT
representative
JJ
put
NN
chairs
Transition
probabilities
VBZ
Emission
probabilities
• Impossible (p=0) emissions omitted
• Several possible paths, each with individual probability
– DT – JJ – NN – VBZ
• p(DT|start) * p(JJ|DT) * p(NN|JJ) * p(VBZ|NN) *
p(the|DT) * p(representative|JJ) * p(put|NN) * p(chairs|VBZ)
– DT – NN – VBZ – NN
• p(DT|start) * p(NN|DT) * p(VBZ|NN) * p(NN|VBZ) *
p(the|DT) * p(representative|NN) * p(put|VBZ) * p(chairs|NN)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
29
Definition
• Definition
A Hidden Markov Model is a sequential stochastic process
with k states s1, …, sk with
– Every state s emits every symbol x∈Σ with probability p(x|s)
– The sequence of states is an order 1 Markov Model:
p(zt=st|zt-1=st-1, zt-2=st-2,…, z0=s0) = p(zt=st|zt-1=st-1)=at-1,t
– The a0,1 are call start probabilities
– The at-1,t are called transition probabilities
– The es(x)=p(x|s) are called emission probabilities
• Note
– A given sequence of symbols can be emitted by many different
sequences of states
– These have individual probabilities depending on the transition
probabilities and the emission probabilities in the state sequence
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
30
Example
the
representative
put
chairs
DT
DT
JJ
NN
0,3
0,4
JJ
DT
JJ
NN
VBZ
0,6
NN
VBZ
VBZ
0,2
0,1
• We omit start probabilities (are equal in both examples)
• DT – JJ – NN – VBZ
– p(DT|start) * p(JJ|DT) * p(NN|JJ) * p(VBZ|NN) *
p(the|DT) * p(representative|JJ) * p(put|NN) * p(chairs|VBZ)
– = 1 * 0,3 * 0,6 * 0,2 * 1 * 0,9 * 0,05 * 0,3
• DT – NN – VBZ – NN
– p(DT|start) * p(NN|DT) * p(VBZ|NN) * p(NN|VBZ) *
p(the|DT) * p(representative|NN) * p(put|VBZ) * p(chairs|NN)
– = 1 * 0,4 * 0,2 * 0,1 * 1 * 0,1 * 0,95 * 0,6
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
31
HMM: Classical Problems
• Decoding / parsing: Given a sequence S of symbols and a
HMM M: Which sequence of states did most likely emit S?
– This is our tagging problem once we have the model
– Solution: Viterbi algorithm
• Evaluation: Given a sequence S of symbols and a HMM M:
With which probability did M emit S?
– Different problem, as more than one sequence may have emitted S
– Solution Forward/Backward algorithm (skipped)
• Learning: Given a sequence S of symbols with tags and a
set of states: Which HMM emits S using the given state
sequence with the highest probability?
– We need to learn (start), emission, and transition probabilities
– Solution: Maximum Likelihood Estimate
– Learning problem without tags: Baum-Welch algorithm (skipped)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
32
Another Example: The Dishonest Casino
A casino has two dice:
• Fair die
p(1)=p(2)=p(3)=p(4)=p(5)=p(6)=1/6
• Loaded die
p(1)=p(2)=p(3)=p(4)=p(5)=1/10
p(6) = 1/2
Casino occasionally switches between dice
(and you want to know when)
Game:
1. You bet $1
2. You roll (always with a fair die)
3. You may bet more or surrender
4. Casino player rolls (with some die…)
5. Highest number wins
Quelle: Batzoglou, Stanford
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
33
The dishonest casino model
0.05
0.95
FAIR
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
0.95
LOADED
0.05
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
34
Question # 1 – Decoding
GIVEN A sequence of rolls by the casino player
62146146136136661664661636616366163616515615115146123562344
QUESTION What portion of the sequence was generated
with the fair die, and what portion with the loaded die?
This is the DECODING question in HMMs
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
35
Question # 2 – Evaluation
GIVEN A sequence of rolls by the casino player
62146146136136661664661636616366163616515615115146123562344
QUESTION How likely is this sequence, given our model of
how the casino works?
This is the EVALUATION problem in HMMs
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
36
Question # 3 – Learning
GIVEN A sequence of rolls by the casino player
6146136136661664661636616366163616515615115146123562344
QUESTION
How “loaded” is the loaded die? How “fair” is the fair
die? How often does the casino player change from
fair to loaded, and back?
This is the LEARNING question in HMMs
[Note: We need to know how many dice there are!]
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
37
Content of this Lecture
• Part-Of-Speech (POS)
• Simple methods for POS tagging
• Hidden Markov Models
– Definition and Application
– Learning the Model
– Tagging
• Closing Remarks
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
38
Learning a HMM
• We always assume the set of states as fixed
– These are the POS tags
• We need to learn the emission and the transition
probabilities
• Assuming a large corpus, this can be reasonably achieved
using a Maximum Likelihood Estimate
– Read: Counting relative frequencies of all interesting events
– Events are emissions and transitions
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
39
Maximum Likelihood Estimates for Learning a HMM
• Count the frequencies of all state transitions s → t
• Transform in relative frequencies for each outgoing state
– Let Ast be the number of transitions s→t
• Thus
Ast
ast = p(t | s ) =
∑ Ast '
t '∈M
• Count the frequencies of all emissions es(x) over all symbols x and
states s
• Transform in relative frequencies for each state
– Let Es(x) be the number of times that state s emits symbol x
• Thus
Es ( x)
es ( x) =
∑ Es ( x' )
x '∈Σ
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
40
Extensions
• Overfitting
– We have a data sparsity problem
• Not so bad for the state transitions
– Not too many POS Tags
– Exception: rare classes
• But very bad for emission probabilities
– Need to apply smoothing
• Background knowledge
– We know that many things are impossible
• Impossible state sequences, words with only one POS tag, …
– We want to include this knowledge
– Option 1: Do not smooth these probabilities
– Option 2: Use a Bayesian approach for probability estimation
• Estimates the probability of the corpus given the background
knowledge
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
41
Content of this Lecture
• Part-Of-Speech (POS)
• Simple methods for POS tagging
• Hidden Markov Models
– Definition and Application
– Learning the Model
– Tagging
• Closing Remarks
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
42
Viterbi Algorithm
• Definition
Let M be a HMM and S a sequence of symbols. The parsing
problem is the problem of finding the state sequence of M
which has generates S with the highest probability
– Very often, we call a sequence of states a path
• Naïve solution
– Let’s assume that aij>0 and ei(x)>0 for all x,i,j and i,j≤k
– Then there exist kn paths
– We cannot look at all of them
• Viterbi-Algorithm
– Viterbi, A. J. (1967). "Error bounds for convolution codes and an
asymptotically optimal decoding algorithm." IEEE Transactions on
Information Theory IT-13: 260-269.
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
43
Basic Idea
• Dynamic programming
– Every potential state s at position i in S is reachable by many paths
– However, one of those must be the most probable one
– All continuations of the path for S from s only need this highest
probability over all paths reaching s (at i)
– Thus, we can compute those probabilities iteratively for all
sequence positions
…
JJ
JJ
JJ
JJ
…
NN
NN
NN
NN
…
fat
blue
cat
was
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
…
44
Viterbi: Dynamic Programming
• We compute optimal (= most probable) paths for
increasingly long prefixes of S ending in the states of M
• Let vs(i) be the probability of the optimal path for S[..i]
ending in state s
• We want to express vs(i) using only the vs(i-1) values
• Once we have found this formula, we may iteratively
compute vs(1), vs(2), …, vs(|S|) (for all s)
…
JJ
JJ
JJ
JJ
……
NN
NN
NN
NN
……
fat
blue
cat
was
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
…
45
Recursion
• Let vs(i) be the probability of the optimal path for S[..i]
ending in state s
• Assume we proceed from s in position i to t in position i+1
• What is the probability of the path ending in t passing
through s before?
– The probability of s (=vs(i))
– * the transition probability form s to t (ast)
– * the probability that t emits S[i+1] (=et(S[i+1])
• Of course, we may reach t from any state at position i
• This gives
vt (i + 1) = et ( S[i + 1]) * max(vs (i ) * ast )
s∈M
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
46
Tabular Computation
The
fat
blue
S0
1
0
0
DT
0
1
0
0
JJ
0
0
…
…
NN
0
0
…
…
NNS
0
0
…
…
VB
0
0
…
…
VBZ
0
0
…
…
…
• We use a table for storing
the vs(i)
• Special start state with
start probability 1; all
other states have start
probability 0
• Compute column-wise
• Every cell can be reached
from every cell in the
previous column
• If a state never emits a
certain symbol, all
probabilities in columns
with this symbol will be 0
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
47
Result
The
fat
blue
…
cake.
S0
1
0
0
0
…
0
DT
0
1
0
0
…
0,004
JJ
0
0
…
…
…
0,034
NN
0
0
…
…
…
0,0012
NNS
0
0
…
…
…
0,0001
VB
0
0
…
…
…
0,002
VBZ
0
0
…
…
…
0,013
…
0,008
• The probability of the most probably parse is the largest
value in the right-most column
• Most probable tag sequence is determined by trace back
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
48
Complexity
• Let |S|=n, |M|=k (states)
• This gives
– The table has n*k cells
– For computing a cell value, we need to access all potential
predecessor states (=k)
– Together: O(n*k2)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
49
Numerical Difficulties
• Naturally, the numbers are getting extremely small
– We are multiplying small probabilities (all <<1)
• We need to take care of not running into problems with
computational accuracy
• Solution: Use logarithms
– Instead of
vt (i + 1) = et ( S[i + 1]) * max(vs (i ) * ast ))
s∈M
– Compute
vt (i + 1) = log(et ( S [i + 1]) ) + max(vs (i ) + log(ast ))
s∈M
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
50
Unknown Words
• We have no good emission probabilities for unknown
words
• Thus, their tags are estimated only by the transition
probabilities, which is not very accurate
• The treatment of unknown words is one of the major
differentiating features in different POS taggers
• Information one may use
– Morphological clues: suffixes (-ed mostly is past tense of a verb)
– Likelihood of a POS class of allowing a new word
• Some classes are closed: Determiner, pronouns, …
– Special characters, “Greek” syllables, … (hint to proper names)
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
51
Content of this Lecture
•
•
•
•
Part-Of-Speech (POS)
Simple methods for POS tagging
Hidden Markov Models
Closing Remarks
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
53
Wrap-Up
• Advantages of HMM
– Clean framework
– Rather simple math
– Good performance (for tri-grams)
• Disadvantages
– Cannot capture non-local dependencies
• Tri-grams sometimes are not enough
– Cannot condition tags on preceding words (but only on tags)
– Such extensions cannot be built into a HMM easily: space of
parameters to learn explodes
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
54
Transformation-Based Tagger
• Brill: „Transformation-Based Error-Driven Learning and
Natural Language Processing: A Case Study in Part-ofSpeech Tagging“, Computational Linguistics, 1995.
• Idea: Identify „typical situations“ in a trained corpus
– After “to”, there usually comes a verb
– Situations may combine words, tags, morphological information,
etc.
• Abstract into transformation rules
• Apply when seeing untagged text
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
55
Transformation-Based Tagging
• Learning rules
–
–
–
–
We simulate the real case
“Untag” tagged corpus
Tag each word with its most probably POS-tag
Find the most typical differences between the original (tagged) text
and the retagged text
• These are the most typical errors one performs when using only the
most probable classes
• Their correction (using the gold standard) is learned
• Tagging
– Assign each word its most probable POS tag
– Apply transformation rules to rewrite tag sequence into a hopefully
more correct one
– Issues: Order of application of rules? Termination?
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
56
POS Tagging Today
• A number of free tagger are available: Brill tagger, TnT, TreeTagger,
OpenNLP MAXENT tagger, …
• When choosing a tagger
– Which corpus was used to learn the model?
• Some can be retrained, some come with a fixed model
– Treatment of unknown words?
• Some figures
– Brill tagger has ~87% accuracy on Medline abstracts
• When learned on Brown corpus
• A very large fraction of Medline terms do not occur in newspaper corpora
– Performance of >97% accuracy is possible
• MedPost: HMM-based, with a dictionary of fixed (word / POS-tag) assignments
for the 10.000 most frequent “unknown” Medline terms
• TnT / MaxEnt tagger reach 95-98 on newspaper corpora
– Further improvements hit the inter-annotator agreement borders
• And depend on the tag set – the richer, the more difficult
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
57
References
• Brill, E. (1992). "A simple rule-based part of speech tagger". Conf
Applied Natural Language Processing (ANLP92), Trento, Italy. pp 152155.
• Brants, T. (2000). "TnT - a statistical part-of-speech tagger". Conf
Applied Natural Language Processing (ANLP00), Seattle, USA.
– TnT = Trigrams‘n‘Tags
• Ratnaparkhi, A. (1996). "A Maximum Entropy Model for Part-Of-Speech
Tagging". Conference on Empirical Methods in Natural Language
Processing: 133-142.
– We will discuss Maximum Entropy models later
• Smith, L., Rindflesch, T. and Wilbur, W. J. (2004). "MedPost: a part-ofspeech tagger for biomedical text." Bioinformatics 20(14): 2320-1.
Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008
58

Similar documents