Taal- en spraaktechnologie

Transcription

Taal- en spraaktechnologie
Covered so far
Today
Taal- en spraaktechnologie
Sophia Katrenko
(thanks to R. Navigli and S. P. Ponzetto)
Utrecht University, the Netherlands
Sophia Katrenko
Lecture 3
Covered so far
Today
Outline
1
Covered so far
2
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Sophia Katrenko
Lecture 3
Covered so far
Today
Recap
Last time, we
discussed WSD resources (WordNet, SemCor, SemEval
competitions), and also methods:
dictionary-based (Lesk, 1986)
supervised WSD (Gale et al., 1992)
minimally supervised WSD (Yarowsky, 1995)
noun categorization
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Today we discuss Chapter 19 (Jurafsky), and more precisely
1
unsupervised word sense disambiguation
2
lexical acquisition
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
WSD methods: an overview
Source: Navigli and Ponzetto, 2010.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Unsupervised WSD
Most methods we have discussed so far focused on the
classification, where the number of senses is fixed.
Noun categorization has already shifted the focus to the
unsupervised learning, whereby the learning itself was
unsupervised, while the evaluation was done as for the
supervised systems.
We will move now more to the unsupervised learning, and
discuss clustering (as a mechanism) in more detail.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Unsupervised WSD
The sense of a word can never be taken in isolation.
The same sense of a word will have similar neighboring words.
“You shall know a word by the company it keeps” (Firth, 1957).
“For a large class of cases though not for all in which we
employ the word meaning it can be defined thus: the meaning of
a word is its use in the language.” (Witgenschtein, “Philosophical
Investigations (1953)”).
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Unsupervised WSD
The unsupervised WSD relies on the observations above:
take word occurrences in some (possibly predefined) contexts
cluster them
assign new words to one of the clusters
The noun categorization task followed only the first 2 steps (no
assignment for new words).
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
clustering is a type of unsupervised machine learning which
aims at grouping similar objects into groups
no apriori output(i.e., no labels)
a cluster is a collection of objects which are similar (in some way)
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
Types of clustering
E XCLUSIVE CLUSTERING (= a certain datum belongs to a definite
cluster, no overlapping clusters)
OVERLAPPING CLUSTERING (= uses fuzzy sets to cluster data,
so that each point may belong to two or more clusters with
different degrees of membership)
H IERARCHICAL CLUSTERING (= explores the union between the
two nearest clusters)
P ROBABILISTIC CLUSTERING
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
Hierarchical clustering is in turn of two types
B OTTOM - UP ( AGGLOMERATIVE )
TOP - DOWN ( DIVISIVE )
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
Hierarchical clustering for Dutch text
Source: van de Cruys (2006)
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
Hierarchical clustering for Dutch dialects
Source: Wieling and Nerbonne (2010)
The Goeman-Taeldeman-Van
Reenen-project data
1876 phonetically transcribed items
for 613 dialect varieties in the
Netherlands and Flanders
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
Now . . .
clustering problems are NP-hard (it is impossible to try all
possible clustering solutions).
clustering algorithms look at a small fraction of all possible
partitions of the data.
the portions of the search space that are considered depend on
the kind of algorithm used.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
What is a good clustering solution?
the intra-cluster similarity is high, and the inter-cluster
similarity is low.
the quality of clusters depends on the definition and the
representation of clusters.
the quality of clustering depends on the similarity measure.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
AGGLOMERATIVE CLUSTERING works as follows:
1
Assign each object to a separate cluster.
2
Evaluate all pair-wise distances between clusters.
3
Construct a distance matrix using the distance values.
4
Look for the pair of clusters with the shortest distance.
5
Remove the pair from the matrix and merge them.
6
Evaluate all distances from this new cluster to all other clusters,
and update the matrix.
7
Repeat until the distance matrix is reduced to a single element.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
K-means algorithm
Partitions n samples (objects) into k clusters.
Each cluster c is represented by its centroid:
µ(c) =
1 X
x
|c| x∈c
The algorithm converges to stable centroids of clusters (=
minimizes the sum of the squared distances to the cluster
centers)
E=
k X
X
||x − µi ||2
i=1 x∈ci
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
K-means algorithm
1
I NITIALIZATION : select k points into the space represented by the
objects that are being clustered (seed points)
2
A SSIGNMENT : assign each object to the cluster that has the
closest centroid (mean)
3
U PDATE : after all objects have been assigned, recalculate the
positions of the k centroids (means)
4
T ERMINATION : go back to (2) until the centroids no longer move
i.e. there are no more new assignments
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
K-means algorithm
1
I NITIALIZATION : select k points into the space represented by the
objects that are being clustered (seed points)
2
A SSIGNMENT : assign each object to the cluster that has the
closest centroid (mean)
3
U PDATE : after all objects have been assigned, recalculate the
positions of the k centroids (means)
4
T ERMINATION : go back to (2) until the centroids no longer move
i.e. there are no more new assignments
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
K-means algorithm
1
I NITIALIZATION : select k points into the space represented by the
objects that are being clustered (seed points)
2
A SSIGNMENT : assign each object to the cluster that has the
closest centroid (mean)
3
U PDATE : after all objects have been assigned, recalculate the
positions of the k centroids (means)
4
T ERMINATION : go back to (2) until the centroids no longer move
i.e. there are no more new assignments
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
K-means algorithm
1
I NITIALIZATION : select k points into the space represented by the
objects that are being clustered (seed points)
2
A SSIGNMENT : assign each object to the cluster that has the
closest centroid (mean)
3
U PDATE : after all objects have been assigned, recalculate the
positions of the k centroids (means)
4
T ERMINATION : go back to (2) until the centroids no longer move
i.e. there are no more new assignments
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
K-means: limitations
sensitive to initial seed points (it does not specify how to
initialize the mean values - randomly)
need to specify k, the number of clusters, in advance (how
do we chose the value of k?)
unable to handle noisy data and outliers
unable to model the uncertainty in cluster assignment
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
K-means: limitations
sensitive to initial seed points (it does not specify how to
initialize the mean values - randomly)
need to specify k, the number of clusters, in advance (how
do we chose the value of k?)
unable to handle noisy data and outliers
unable to model the uncertainty in cluster assignment
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
K-means: limitations
sensitive to initial seed points (it does not specify how to
initialize the mean values - randomly)
need to specify k, the number of clusters, in advance (how
do we chose the value of k?)
unable to handle noisy data and outliers
unable to model the uncertainty in cluster assignment
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
K-means: limitations
sensitive to initial seed points (it does not specify how to
initialize the mean values - randomly)
need to specify k, the number of clusters, in advance (how
do we chose the value of k?)
unable to handle noisy data and outliers
unable to model the uncertainty in cluster assignment
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
“Good” choice of seeds:
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Clustering
“Bad” choice of seeds:
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Back to unsupervised WSD
1
C ONTEXT CLUSTERING
Each occurrence of a target word in a corpus is
represented as a context vector
Vectors are then clustered into groups, each identifying a
sense of the target word
2
W ORD CLUSTERING
clustering words which are semantically similar and can
thus convey a specific meaning
3
C O - OCCURRENCE GRAPHS
apply graph algorithms to co-occurrence graph, i.e.graphs
connect pairs of words which co-occur in a syntactic
relation, in the same paragraph, or in a larger context
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Back to unsupervised WSD
1
C ONTEXT CLUSTERING
Each occurrence of a target word in a corpus is
represented as a context vector
Vectors are then clustered into groups, each identifying a
sense of the target word
2
W ORD CLUSTERING
clustering words which are semantically similar and can
thus convey a specific meaning
3
C O - OCCURRENCE GRAPHS
apply graph algorithms to co-occurrence graph, i.e.graphs
connect pairs of words which co-occur in a syntactic
relation, in the same paragraph, or in a larger context
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Back to unsupervised WSD
1
C ONTEXT CLUSTERING
Each occurrence of a target word in a corpus is
represented as a context vector
Vectors are then clustered into groups, each identifying a
sense of the target word
2
W ORD CLUSTERING
clustering words which are semantically similar and can
thus convey a specific meaning
3
C O - OCCURRENCE GRAPHS
apply graph algorithms to co-occurrence graph, i.e.graphs
connect pairs of words which co-occur in a syntactic
relation, in the same paragraph, or in a larger context
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Context clustering
A first proposal is based on the notion of word space (Schütze,
1992)
A vector space whose dimensions are words
The architecture proposed by Schütze (1998)
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Context clustering
So what is a word space?
represent a word with a word vector
a co-occurrence vector which counts the number of times a word
co-occurs with other words
dimension
legal
clothes
vector
judge
300
75
robe
133
200
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Context clustering
Now, what can we do with all the vectors?
compute the so-called dot-product (or inner product) A · B
measure their magnitudes |A| and |B| (Euclidean distance)
A · B = x1 ∗ x2 + y1 ∗ y2
|A| = dAC =
(1)
q
(x1 − x0 )2 + (y1 − y0 )2
(2)
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Context clustering
Word vectors capture the “topical dimensions” of a word
Given the word vector space, the similarity between two words v
and w can be measured geometrically, e.g. by cosine similarity:
Pm
vi wi
vw
= qP i=1qP
sim(v, w) =
(3)
m 2
m
|v||w|
2
i vi
i wi
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Context clustering
Problem: the word vectors conflate senses of the word
we need to include information from the context
context vector: the centroid (or sum) of the word vectors
occurring in the context weighted according to their
discriminating potential
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Context clustering
Finally: sense vectors are derived by clustering context vectors
into a predefined number of clusters
A sense is a group of similar contexts
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Word clustering
Lins approach (1998)
Extract dependency triples from a text corpus
John eats a yummy kiwi
(eat subj John)
(John subj-of eat)
(eat obj kiwi)
(kiwi obj-of eat)
(kiwi adj-mod yummy)
(yummy adj-mod-of kiwi)
(kiwi det a)
(a det-of kiwi)
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Word clustering
Define a measure of similarity between two words
The occurrence of a dependency triple (w, r , w 0 ) can be seen as
the co-occurrence of three events
A: a randomly selected word is w, B: a randomly selected
dependency type is r , C: a randomly selected word is w 0
Assume that A and C are conditionally independent given B:
P(A, B, C) = P(B)P(A|B)P(C|B)
Compute the information content IC(A, B, C) = − log P(A, B, C)
based on the independence assumption
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Word clustering
Use the similarity scores to create a similarity tree
Let w1 , . . . , wn be a list of words in descending order of their
similarity to a given word w0
Initialize the similarity tree with single root node w0
For i = 1, . . . , n, insert wi as a child of wj s.t. wj is the most
similar one to wi among {w0 . . . wi−1 }
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Word clustering
An example of Lin’s output:
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Word clustering
Clustering By Committee (Lin and Pantel, 2002)
(1) Parse entire corpus using a dependency parser
(2) Represent each word as a feature vector (features express the
syntactic context in which a word occurs)
(3) Create a similarity matrix S such that Sij is the similarity between
wi and wj
(4) Cluster the words by using group-average clustering
not all words are clustered at the first iteration
residue words are clustered at later iterations
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Word clustering
Clustering By Committee (Lin and Pantel, 2002)
(5) Disambiguate
Find cluster centroids: word committees
For non-centroid words, match their pattern features to the
committee words features; features are removed from the
word representation (to allow new assignments of the same
word)
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Graph-based methods
Co-occurrence graphs
Based on the notion of co-occurrence graph
A graph G = (V , E) where
V is the set of vertices, i.e. words
E is the set of edges, i.e. typically,
Simple co-occurrence relations (e.g. within the same
sentence or paragraph)
Syntactic relations between pairs of co-occurring words
Given a target ambiguous word w, a graph is built of the words
co-occurring with w
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Graph-based methods
An example of a graph:
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Graph-based methods
C URVATURE C LUSTERING (Dorow et al., 2005)
Based on curvature (clustering coefficient)
Based on the notion of triangle:
a triple of vertices {v , v , v 00 } such that
{v , v 0 }, {v 0 , v 00 }, {v 00 , v } ∈ E
Quantifies the ratio of interconnections of a node with its
neighbors
curv (v ) =
#A
#B
(4)
where A - number of traingles
including v , B - possible triangles
)
including v = degree(v
2
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Graph-based methods
C URVATURE C LUSTERING (Dorow et al., 2005)
1
Build the co-occurrence graph of a target word w
2
Calculate curvature for each node in the graph
3
Remove nodes whose curvature is below a threshold
4
Each connected component constitutes a meaning (i.e., a
sense) of the target word w
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Graph-based methods
C URVATURE C LUSTERING (Dorow et al., 2005)
Step 1 Build the co-occurrence graph of a target word jaguar.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Graph-based methods
C URVATURE C LUSTERING (Dorow et al., 2005)
Step 2 Calculate curvature for each node in the graph.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Graph-based methods
C URVATURE C LUSTERING (Dorow et al., 2005)
Step 3 Remove nodes with curvature below a threshold, e.g. < 0.5
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Graph-based methods
C URVATURE C LUSTERING (Dorow et al., 2005)
Step 4 Output a meaning for each connected component
{ ict, os, mac, unix }, { car, engine }, { feline, tiger, jungle }
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Graph-based methods
Word Sense Induction (WSI)
Actually performs Word Sense Discrimination
Aims to divide the occurrences of a word into a number of
classes
Makes objective evaluation more difficult if not embedded into an
application
but WSI and WSD are strictly related → the clusters produced
are used to sense tag new word occurrences
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Evaluation
How to evaluate WSI?
Manual evaluation
Gold standard clustering
Mapping to an existing sense inventory
Mapping to an annotated corpus + supervised WSD
Pseudowords
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Evaluation
Manual evaluation: People are asked to judge the quality of a
clustering
How to assess the following clustering for jaguar?
On a sample basis, each evaluator is asked to judge the
similarity of pairs of words from the same cluster and from
different clusters (without having such information)
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Evaluation
Gold standard clustering
Given a gold standard clustering
Compare the gold standard with the output clustering
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Evaluation
Clusters are compared (also from the previous lecture) using
PURITY (to compute purity , each cluster is assigned to the class
which is most frequent in the cluster, and then the accuracy of
this assignment is measured by counting the number of correctly
assigned objects and dividing by the total number of objects
clustered)
ENTROPY (measures cluster homogeneity; lower entropy →
more homogeneous clusters)
Ideally, purity should be 1, and entropy - 0.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Evaluation
Evaluation via mapping to an existing sense inventory
Clusters are mapped to senses of an existing sense inventory
(e.g. WordNet)
Lin and Pantel (2002) automatically map clusters to WordNet
synsets
The similarity between a cluster c and a synset s is
P
SimW (s, w)
SimC(c, s) = w∈c
|c|
(5)
A cluster c is correct if a synset s exists such that SimC(c, s) ≥
a fixed threshold
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Evaluation
Evaluation via pseudowords (Schütze, 1992)
Generates new words with artificial ambiguity
First, select two or more monosemous words, e.g.: pizza, blog
Given all their occurrences in a corpus:
Yesterday we ate a pizza at the restaurant.
Margherita: pizza with mozzarella and tomato
I am writing a new post on my blog.
How many blogs are there on-line?
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Evaluation
Evaluation via pseudowords (Schütze, 1992)
Replace them with a pseudoword obtained by joining the
monosemous words, e.g. pizzablog
Yesterday we ate a pizzablog at the restaurant.
Margherita: pizzablog with mozzarella and tomato.
I am writing a new post on my pizzablog.
How many pizzablog are there on-line?
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Evaluation
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Evaluation
To consider
hard vs. soft clustering
baselines
All-in-one: group all words into one big cluster
Random: produce a random set of clusters
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Evaluation
SemEval 2007:
Coarse-grained WSD allows it to reach performance over 80%
accuracy
Lexical sample WSD even reaches almost 90%
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Evaluation
Senseval/Semeval: findings
The performance variations are quite consistent with the
hardness of the tasks (e.g., all-words fine-grained tasks are
more and more difficult)
Among supervised systems, instance-based approaches and
SVM perform best
The most frequent sense baseline is a real challenge in an
all-words WSD setting (not for lexical sample tasks)
Knowledge-based methods achieve performance similar to
the baseline
However, some of them can provide justifications for their
sense choices (e.g. SSI (Navigli and Velardi, 2005))
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Lexical acquisition
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Similarity measures
A note on measure vs. metric
A metric on a set X is a function d, such that d : X × X → R and
which has the following properties:
d(x, y ) ≥ 0
d(x, y ) = 0 iff x = y
d(x, y ) = d(y , x)
d(x, z) ≤ d(x, y ) + d(y , z)
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Similarity measures
Similarity between two lexical items can be measured in many ways,
e.g.
using distributional information (corpora counts)
using WordNet structure
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Similarity measures
For two binary vectors w and v, the most common measures are as
follows:
measure
matching coefficient
Dice coefficient
Jaccard coefficient
Overlap coefficient
cosine
Sophia Katrenko
definition
|X ∩ Y |
2|X ∩Y |
|X |+|Y |
|X ∩Y |
|X ∪Y |
|X ∩Y |
min(|X |,|Y |)
|X ∩Y |
√
Lecture 3
|X |×|Y |
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Similarity measures
If we move to frequency counts:
word
w
v
context1
w1
v1
context2
w2
v2
dDice =
dDice
...
...
...
2|X ∩ Y |
|X | + |Y |
Pn
2 i=1 min(wi , vi )
= Pn
Pn
i=1 wi +
i=1 vi
Sophia Katrenko
Lecture 3
contextn
wn
vn
(6)
(7)
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Similarity measures
If we move to frequency counts:
word
w
v
context1
w1
v1
context2
w2
v2
...
...
...
contextn
wn
vn
Jaccard coefficient
dJaccard =
dJaccard
|X ∩ Y |
|X ∪ Y |
Pn
min(wi , vi )
= Pni=1
i=1 max(wi , vi )
Sophia Katrenko
Lecture 3
(8)
(9)
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Similarity measures
If we move to frequency counts:
word
w
v
context1
w1
v1
context2
w2
v2
dManhattan =
n
X
...
...
...
|wi − vi |
contextn
wn
vn
(10)
i=1
dEuclidean
v
u n
uX
= t (wi − vi )2
i=1
Sophia Katrenko
Lecture 3
(11)
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Similarity measures
If we move to frequency counts:
word
w
v
context1
w1
v1
context2
w2
v2
Pn
dcosine = qP
n
wv
qiPi
n
2
contextn
wn
vn
i=1
i=1 wi
Sophia Katrenko
...
...
...
Lecture 3
2
i=1 vi
(12)
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
WordNet-based measures
How to use WordNet to measure relatedness/similarity? The
following notions are used:
Path between two synsets c1 and c2 , pathlen(c1, c2) (the
number of edges in the shortest path in the thesaurus graph
between the sense nodes c1 and c2 )
The lowest common subsumer lcs(c1 , c2 ) (the lowest node in the
hierarchy that subsumes (is a hypernym of) both c1 and c2 )
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
WordNet-based measures
artifact
instrumentation
implement
device
tool
trap
drill
net
Figure: Part of the WordNet hierarchy
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
WordNet-based measures
The following notions are used:
The probability that a randomly selected word in a corpus is an
instance of concept c, P(c) (Resnik, 1995)
P
w∈words(c) count(w)
(13)
P(c) =
N
words(c) = the set of words subsumed by concept c,
N = the total number of words in the corpus that are also present
in the thesaurus.
Information content
IC(c) = − log P(c)
Sophia Katrenko
Lecture 3
(14)
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
WordNet-based measures
D EFINITIONS
Leacock and Chodorow, 1998 (lch)
simpath (c1 , c2 ) = − log pathlen(c1 , c2 )
(15)
Resnik measure (Resnik, 1995) (res)
simresnik (c1 , c2 ) = − log P(lcs(c1 , c2 ))
Sophia Katrenko
Lecture 3
(16)
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
WordNet-based measures
D EFINITIONS
Wu and Palmer, 1998 (wup)
simwup (c1 , c2 ) =
2 ∗ dep(lcs(c1 , c2 ))
len(c1 , lcs(c1 , c2 )) + len(c2 , lcs(c1 , c2 )) + 2 ∗ dep(lcs(c1 , c2 ))
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
WordNet-based measures
Lin (1998) has compared two object A and B given their
COMMONALITY: the more information A and B have in common,
the more similar they are (IC(common(A, B))).
DIFFERENCE : the more differences between the information in A
and B, the less similar they are
(IC(description(A, B)) − IC(common(A, B))).
simLin (A, B) =
log P(common(A, B))
log P(description(A, B))
Sophia Katrenko
Lecture 3
(17)
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
WordNet-based measures
How to apply it to WordNet?
simLin (c1 , c2 ) =
2 log P(lcs(c1 , c2 ))
log P(c1 ) + log P(c2 )
(18)
Jiang-Conrath distance (Jiang and Conrath, 1997)
distJC (c1 , c2 ) = 2 log P(lcs(c1 , c2 )) − (log P(c1 ) + log P(c2 ))
Sophia Katrenko
Lecture 3
(19)
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Measures
So, what measure is the best?
there is no best measure apriori (similarly as there is no
machine learning method that always performs the best so-called No-free lunch theorem).
different applications may require different measures to be
used.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Measures
So, what measure is the best?
there is no best measure apriori (similarly as there is no
machine learning method that always performs the best so-called No-free lunch theorem).
different applications may require different measures to be
used.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Measures
So, what measure is the best?
there is no best measure apriori (similarly as there is no
machine learning method that always performs the best so-called No-free lunch theorem).
different applications may require different measures to be
used.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Measures
L. Lee. Measures of Distributional Similarity. In Proceedings of the
37th ACL, 1999.
DATA : verb-object co-occurrence pairs in the 1988 Associated
Press newswire (1000 most frequent nouns).
various distributional measures (cosine, Euclidean, others).
G OAL : improving probability estimation for unseen
co-occurrences: “replaced each noun- verb pair (n, v1 ) with a
noun-verb-verb triple (n, v1 , v2 ) such that P(v2 ) ≈ P(v1 ). The
task for the language model under evaluation was to reconstruct
which of (n, v1 ) and (n, v2 ) was the original cooccurrence.”
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
Measures
L. Lee. Measures of Distributional Similarity. In Proceedings of the
37th ACL, 1999.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
WordNet measures
S. Katrenko et al.. Using Local Alignments for Relation Recognition.
In JAIR, 2010.
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
To summarize (1)
Today, we have looked at
clustering methods
unsupervised WSD methods
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
To summarize (1)
Today, we have looked at
clustering methods
unsupervised WSD methods
Sophia Katrenko
Lecture 3
Covered so far
Today
Unsupervised Word Sense Disambiguation (WSD)
Lexical acquisition
To summarize (2)
ToDo
read at home (if you haven’t done it yet) chapter 19 from
Jurafsky.
Sophia Katrenko
Lecture 3