D4.15 Framework for combining user supplied knowledge from

Transcription

European Seventh Framework Programme
FP7-218086-Collaborative Project
D4.15 Framework for combining user supplied
knowledge from diverse sources
D4.15 Framework for combining user supplied knowledge from diverse sources
c
INDECT
Consortium —
www.indect-project.eu
The INDECT Consortium
AGH — University of Science and Technology, AGH, Poland
Gdansk University of Technology, GUT, Poland
InnoTec DATA GmbH & Co. KG, INNOTEC, Germany
IP Grenoble (Ensimag), INP, France
MSWiA — General Headquarters of Police (Polish Police), GHP, Poland
Moviquity, MOVIQUITY, Spain
Products and Systems of Information Technology, PSI, Germany
Police Service of Northern Ireland, PSNI, United Kingdom
Poznan University of Technology, PUT, Poland
Universidad Carlos III de Madrid, UC3M, Spain
Technical University of Sofia, TU-SOFIA, Bulgaria
University of Wuppertal, BUW, Germany
University of York, UoY, Great Britain
Technical University of Ostrava, VSB, Czech Republic
Technical University of Kosice, TUKE, Slovakia
X-Art Pro Division G.m.b.H., X-art, Austria
Fachhochschule Technikum Wien, FHTW, Austria
c
Copyright
2013, the Members of the INDECT Consortium
D4.15
Public
2/25
c
INDECT
Consortium —
Document Information
Contract Number
218086
Deliverable Name
Framework for combining user supplied knowledge from diverse sources
Deliverable number
D4.15
Editor(s)
Suresh Manandhar, University of York, [email protected]
Author(s)
Suraj Jung Pandey, University of York, [email protected]
Reviewer(s)
Ethical Review: Andreas Pongratz (X-art)
Security Review: Petr Machnik (VSB)
End-users review:Sergejs Amplejevs (Latvian Police), Michael Ross (PSNI)
Scientific Review: Nick Pears (UoY)
Dissemination level
Public
Contractual date of
December 2012
delivery
Delivery date
July 2013
Status
Final version
Keywords
Entity resolution, Kernel methods
This project is funded under 7th Framework Program
D4.15
Public
1/25
c
INDECT
Consortium —
Contents
Document Information
1
1 Executive Summary
6
2 Introduction
7
2.1
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2
List of participants & roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
Ethical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
3 Method for entity linking
3.1
3.2
3.3
11
Model Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.1.1
Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.1.2
Learning a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
Candidate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2.1
Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
Candidate Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
4 Training and Evaluation
4.1
21
Experimental Setting and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Conclusion
D4.15
21
23
Public
2/25
c
INDECT
Consortium —
List of Figures
1
Framework for combining user supplied knowledge. . . . . . . . . . . . . . . . . . . . .
2
Summary of steps required to ensure ethical use of WP4 software. Critical steps involving human intervention are shaded. . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3
Conceptual architecture of the proposed method . . . . . . . . . . . . . . . . . . . . .
12
4
Document 1 for entity "SFAX" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
5
Document 2 for entity "SFAX" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
6
Assigning named entity labels using Stanford’s NER. The labels are either PERSON
or ORGANISATION or LOCATION.
D4.15
7
. . . . . . . . . . . . . . . . . . . . . . . . . . .
Public
14
3/25
c
INDECT
Consortium —
List of Tables
1
Necessary security measures for use of software within WP4 and their benefits. . . . .
9
2
Squared distance matrix - square and symmetric . . . . . . . . . . . . . . . . . . . . .
16
3
Dissimilarity between words from WordNet . . . . . . . . . . . . . . . . . . . . . . . .
17
4
Reduced dimension. N words from Table 3 is reduced to size 5 . . . . . . . . . . . . .
17
5
Number of training and test instances. . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
6
Accuracy results of test entities present (positive) in KB and test entities not present
7
D4.15
(NIL) in KB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
Document Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
Public
4/25
c
INDECT
Consortium —
(This page is left blank intentionally)
D4.15
Public
5/25
c
INDECT
Consortium —
1. Executive Summary
Security is becoming a weak point of energy and communications infrastructures, commercial stores,
conference centers, airports and sites with high person traffic in general. Practically any crowded place
is vulnerable, and the risks should be controlled and minimised as much as possible. Access control and
rapid response to potential dangers are properties that every security system for such environments
should have. The INDECT project is aiming to develop new tools and techniques that will help the
potential end users in improving their methods for crime detection and prevention thereby offering
more security to the citizens of the European Union.
In the context of the INDECT project, Work Package 4 (WP4) is responsible for the Extraction of
Information for Crime Prevention by Unstructured Data. This document (D4.15) presents a novel
method for entity resolution which is used to successfully locate and merge entities in records.
D4.15
Public
6/25
c
INDECT
Consortium —
2. Introduction
The general aim of WP4 is the development of key technologies that facilitate the building of an
intelligence gathering system by combining and extending the current-state-of-the-art methods in
Natural Language Processing (NLP). One of the specific goals of WP4 is to propose NLP and machine
learning methods that learn relationships between people and organizations through websites and
social networks.
The task of combining user supplied knowledge is essentially maintaining a database of records or
knowledge base which contains description of different entities (person, location, organisation). First,
it is necessary to group documents within the knowledge base, where each group refers to a same
unique entity. For example, separating all documents referring to Michael Jordon the basketball
player from documents of other Michael Jordon(s). Next, the user would add or delete records in the
database. The new information could be in form of single document or itself a database of records.
During addition of new information, the consistency of the knowledge base should be maintained i.e.
if the entity already exists in the knowledge base then, the new information about the entity should
be merged with already existing information of the entity and if the new entity does not exist in the
knowledge base then a new ID for the entity should be created in the knowledge base. Figure 1 shows
the framework for combining user supplied knowledge with the ’entity disambiguation’ working as the
central control of such system.
Knowledge Base Single document Add/Delete
Records
Add/Delete
Records
En#ty Disambigua#
on System En##es Descrip#on En#ty 1 Descrip#on Add/Delete/Update
En#ty 2 Records
Descrip#on En#ty 3 Descrip#on En#ty 4 Descrip#on Database Records Figure 1: Framework for combining user supplied knowledge.
D4.15
Public
7/25
c
INDECT
Consortium —
In this document, we describe an entity resolution/linking/disambiguation method that uses relation
between entities as the key feature. An entity can be any subject of interest, generally mentions
of person, places or organisations. Given a target entity mention, its corresponding article and a
knowledge base containing real-world data; the task of entity resolution is to associate target entity
to its corresponding and unique knowledge base entry, provided that it exists in the knowledge base.
Entity resolution is a highly challenging problem with a variety of applications. For instance, entity
resolution is essential from a security point of view, in which the target would be to identify the
different entity mentions (first names, surnames, nicknames) of persons or organisations that are
subject of an investigation. Additionally, the disambiguation of entities would possibly be beneficial
for identifying different types of relations between persons, organisations, locations and other types
of entities, in effect providing useful information for a variety of security-related tasks.
The disambiguation of an entity is a difficult task for two main reasons. Firstly, an entity can be
mentioned in a variety of ways. For example, many organisations and people are often referred by
their initials, i.e. BA for British Airways or MJ for Michael Jackson. Secondly and most importantly,
entity mentions are ambiguous, i.e. they might refer to different entities. For example, BA might also
refer to Bosnian Airlines, while MJ might also refer to Michael Jordon. In the same vein, Washington
might refer to the capital of USA, a newspaper (Washington daily), or George Washington.
Entity resolution method described in this document leverages relational information between entity
as the key feature. The evaluation of our method is performed using subset of TAC KBP dataset[1].
2.1. Objectives
The objective of this report is to provide thorough description and evaluate the performance of our
novel method for entity resolution.
2.2. List of participants & roles
This report has been produced by the University of York (UoY).
D4.15
Public
8/25
c
INDECT
Consortium —
2.3. Ethical Issues
Since entity resolution creates relation between different entities, its misuse may affect citizens privacy.
Hence, care must be taken to prevent improper use of such software. For the purpose of developing our
scientific methods we have explicitly avoided any prejudicial stereotypes and have only used publicly
available data from various sources. Any use of the methodology in real situations will require careful
monitoring to ensure compliance with ethical and legal standards.
Organisations that use the methods and software described or produced within WP4 and the INDECT project in general should be aware of the security risks associated with the use of such software/methods. Software/methods produced within WP4 can be employed to automatically analyse
text documents or web pages to extract names of people, organisations, locations, dates etc., extract
their relationships and identify behavioural patterns. Such software can process massive amounts of
data continuously.
At least the following set of security measures need to be in place for any software described or
produced within WP4: (1) the use of the software in a secure environment, (2) secure management
of data sources by encrypting the used knowledge bases, (3) secure access policy, (4) secure storage of
access logs and (5) anonymisation of Ids (email, URL, username etc) from data.
The security risk associated with using software within WP4 can be minimised by following important
security protocols. A carefully designed security protocol prevents critical software, like behavioural
profiling software, from being misused and compromised. Table 1 provides a summary of some of the
benefits associated with the corresponding security measures.
Security Measures
Use of software in secure environment
Encrypting data
Secure access policy
Anonymisation of Ids
Benefits
Safeguards critical application and data.
Protects sensitive information in case of data misplacement. Also, access of data over network becomes secure.
Includes authentication protocols which provides necessary restriction for untrusted use. Also maintains logs of
software access.
Protects privacy of the owner of the document.
Table 1: Necessary security measures for use of software within WP4 and their benefits.
Figure 2 shows stepwise use of WP4 software in a secure environment. A brief explanation of each of
the steps is given below:
Step 1 Before supplying the software with input data any user identification present within the data
D4.15
Public
9/25
c
INDECT
Consortium —
Input&data&(Chat&to&be&
analysed,&seed&
suspicious&websites&
and&textual&data&for&
rela3on&mining)&
Internet&
Step%3%
DeCanonymisa3on&
and&search&
Step%1%
Id&(email,&URL,&
username&etc.)&
anonymisa3on&&
Step%13%
Step%4%
Step%2%
WP4&so8ware&
Step%8%
Poten3al&
data&of&
interest&
Encrypted&analysed&
content&(suspicious&
websites,&
suspicious&chat,&
rela3on&graphs)&
Step%7%
Encrypted&
human&verified&
data&
Step%12%
User&Id&deC
anonymisa3on&
Human&
verifica3on&
yes&
Step%9%
yes&
Legal&
authorisa3on?&
Step%10%
Encrypted&
browsing&and&
verifica3on&log&
Secure&
authorisa3on?&
Step%6%
Step%11%
Human&expert&
Step%5%
Encrypted&
access&log&
Human&expert&
Encrypted&
access&log&
Figure 2: Summary of steps required to ensure ethical use of WP4 software. Critical steps involving
human intervention are shaded.
is anonymised. This step is necessary to protect the identity of the user or website in the
subsequent steps.
Step 2 The WP4 software processes input data.
Step 3 When additional data needs to be fetched from World Wide Web relating to an anonymised user
or website, de-anonymisation is necessary. However, the de-anonymisation and search module
is separated from the WP4 software. The de-anonymisation and search is performed in a secure
location (e.g. police servers). The newly extracted content is again fed back to Step 1 for
anonymisation. The WP4 software will only have access to anonymised content providing an
additional layer of security.
Step 4 The output of the software can be suspicious websites or suspicious chat or relation graphs
depending on the type of software used. The output is stored in an encrypted storage. The
D4.15
Public
10/25
c
INDECT
Consortium —
encryption of the output is necessary as it prevents any unauthorised access to processed data.
Step 5 A human expert uses a secure access mechanism to view the processed data.
Step 6 The access logs are stored in a secure encrypted storage to prevent unauthorised viewing of
access logs.
Step 7 Although the WP4 software exhibits reasonable accuracy it is necessary to have human verification to remove false positives i.e. data that has been incorrectly classified by the software as
being potential data of interest.
In this step, a human expert will analyse all processed data and identify potential data of interest.
Step 8 The human verified data is stored in a secure storage.
Step 9 All activities of the human expert are logged and these are stored in a secure storage.
Step 10 At the point of need, a legal authorisation is given to a human expert to de-anonymise some of
the data requiring serious investigation.
Step 11 The human expert gains legally authorised access and this is logged in a secure storage.
Step 12 The selected data is de-anonymised.
Step 13 The de-anonymised data is revealed to the human expert for further investigation.
The above steps for the usage of the software is needed to minimise the risk associated with unauthorised use of the software and to ensure that the use of the software is strictly within the law. Step
5,7,10,12 are flagged as critical steps as it involves human access of sensitive data. These steps should
only be executed once appropriate legal authorisation has been obtained as indicated above.
3. Method for entity linking
Figure 3 shows the conceptual architecture of our method for entity linking that associates a target
ambiguous entity mention to its corresponding and unique knowledge base (KB) entry.
The task of entity linking can be divided into three sub-processes: (1) Candidate Generation (2)
Candidate Ranking and (2) Entry Selection. Given a target entity mention and a corresponding
article in which the target entity mention appears, the proposed method first queries Lucene
1 http://lucene.apache.org/
D4.15
1
in
[Accessed: 04/30/2013]
Public
11/25
c
INDECT
Consortium —
Knowledge)Base)
(KB))
Ar1cle)
Men1on)1)
Ar1cle)1)
Ar1cle)2)
.)
)
Training)
Men1on)2)
Ar1cle)1)
Ar1cle)2)
.)
)
Training)
Men1on)3)
Ar1cle)1)
Ar1cle)2)
.)
)
Men1on)n)
Ar1cle)1)
Ar1cle)2)
.)
)
Model)1)
Lucene)Search)
Engine)
Ar1cle)Men1on)
Candidate)
Genera1on)
Training)
Model)2)
Model)3)
Candidate)
Model)1)
Ar1cle)Ranking)
Candidate)
Model)2)
Winning)
Model)
Candidate)
Model)m)
Training)
Model)n)
Figure 3: Conceptual architecture of the proposed method
order to get a list of candidate knowledge base entries. In the next step, Candidate Ranking module
ranks the candidate entries according to the likelihood of the entry being represented by the target
entity mention. Finally, according to the threshold parameter, the knowledge base entry with highest
likelihood is either selected or rejected to be linked with target entity. If knowledge base entry with
highest likelihood is rejected then we conclude that target entity is not mentioned in knowledge base.
In our approach, as shown in Figure 3, for each unique target entity mention in the knowledge base
we create a model. Models are used to classify whether a new entity is same as one represented by
the model. In the following sections we explain each step of our process. We start by explaining the
process of model creation.
3.1. Model Creation
Each unique entity in the knowledge base is represented by a different model. A model is created for
each entity by training a classifier based on the features extracted from documents of the corresponding
D4.15
Public
12/25
c
INDECT
Consortium —
entity. Each model is then used for classification of new entities. The positive result of the classification
asserts that the new entity is the same as the one represented by the model. Model creation can be
divided into two steps, namely Feature Generation and Training.
3.1.1. Feature Generation
Sfaxien outclassed Astres Douala of Cameroon 3-0 in the
Mediterranean town of Sfax via goals from Congolese Blaise 'Lelo'
Mbele, Ivorian Blaise Kouassi and Hamza Younes after leading 1-0
at half-time.
Figure 4: Document 1 for entity "SFAX"
Sfaxien overcame Astres 3-0 in the Mediterraean city of
Sfax
Saturday through goals from Blaise 'Lelo' Mbele, Blaise
Kouassi and Hamza Younes to remain top of the
standings with 10 points, one more than Mazembe.
Figure 5: Document 2 for entity "SFAX"
In out setting, each entity is represented by its corresponding documents. For example, Figure 4 and
Figure 5 are two different documents of a same entity "SFAX". An entity can be disambiguated by
inspecting how it relates with other entities. If two entities are the same then they exhibit similar
relations to similar entities. For example, if given an entity Xa represented by ’Xa works for Z’ and
another entity Xb represented by ’Xb is employed at Z’, than the chances that both entities Xa and Xb
are the same is very high as they show similar relation to a common entity Z. Similarly, the relations
in Figure 4 – ’Sfax goals Hamza Younes’ and in Figure 5 ’Sfax goals Hamza Younes’ increases the
probability that both "SFAX" in two different document represents the same entity. It should be
noted that the relation does not need to be represented by adjoining words, rather as in previous
example, relations can be across non-continuous words within a sentence.
Following above mentioned intuition, the feature generation in our setting begins with the extraction
of triplets of the form ’Target Entity ->Relation ->Named Entity’, from the given sets of documents.
Target Entity is the representative entity of the document, the Named Entity is the entity mention
present in the document and the relation is the word relation shared between Target Entity and
Named Entity.
D4.15
Public
13/25
c
INDECT
Consortium —
Sfaxien outclassed Astres Douala of Cameroon [ORGANISATION]
3-0 in the
Mediterranean[LOCATION] town of Sfax [LOCATION] via goals from
Congolese Blaise [ORGANISATION] 'Lelo'
Mbele [PERSON], Ivorian Blaise Kouassi [PERSON] and Hamza
Younes [PERSON] after leading 1-0
at half-time.
Figure 6: Assigning named entity labels using Stanford’s NER. The labels are either PERSON or
ORGANISATION or LOCATION.
Triplets Extraction
To extract the triplets from a document, first the text in the document is parsed using Stanford’s
Named Entity Recognition (NER) tool [2]. The parser will assign a label of either ’PERSON’ or
’ORGANISATION’ or ’LOCATION’ to all the entities within the document. The example of named
entity labelling can be seen in Figure 6; the labels are adjacent to the words and are within brackets
’[]’. Additionally, we also use Stanford’s coreference tagger[3, 4, 5] to disambiguate entities within a
document. Coreference disambiguation also replaces pronouns with its representative mention whenever possible. This step enables us to create unique identities for multiple entities referring to a same
entity.
Next, we extract sub-sequences from the document. A sub-sequence is a sequence that is obtained by
deleting one or more terms from a given sequence while preserving the order of the original sequence.
The sub-sequences are extracted under the constraint that it should begin with the target entity of
the document and end with a named entity. This way we can extract all the relation that the target
entity shares with other entities in the document.
Feature Vector
A feature vector is created for each target entity. For each feature vector a feature is the Relation
and its value is the corresponding Named Entity. For example, if triplets for document in Figure 4 is
extracted as:
{ Sfax->outclassed->sfaxien, Sfax->of->astres douala , Sfax->in-> cameroon, Sfax->town->sfaxien,
Sfax->via->blaise kouassi, Sfax->goals->blaise kouassi , Sfax->from->blaise kouassi, .. } then the
feature vector of the document with "Sfax" as target entity will be:
{ outclassed : sfaxien, of : astres douala , in : cameroon, town : sfaxien, via : blaise kouassi, goals :
blaise kouassi , from : blaise kouassi, ..other }
D4.15
Public
14/25
c
INDECT
Consortium —
where, the feature vector is of the form {feature:value}. Similarly for the document in Figure 5 the
feature vector will be:
{ overcame : sfaxien, city : Astres, city : sfaxien, through : blaise kouassi, goals : blaise kouassi, than
: mazembe, ... }
Problems with lexical feature matching
Once, feature vector is created, entities can be disambiguated by analysing pair of feature vectors. If
sufficient number of lexically matching relation has lexically matching value then the entities representing both vectors can be regarded as same. Although, in some ideal cases lexical matching will yield
accurate results, but mostly it will fail since perfect lexical matching rarely occurs. Additionally, the
problem will be compounded as semantically similar but lexically different words are also not matched.
For example, in previous feature representation, even though words like "outclassed" and "overcame"
show similarity in meaning they are represented as two different features. We can overcome both the
problems of lexical feature matching by replacing lexical matching with similarity matching. We can
extract similarity scores between pairs of features, then the scores can be used to collapse lexically
different features with similar meaning into one single feature. For example, since "outclassed" and
"overcame" show similar meaning, we can represent both features with a single identifier, say, "F1".
Dimensionality Reduction
To overcome the problems created by lexical matching of features, we employ dimensionality reduction
on feature set, so that similar features get collapsed to a single feature. We use Multidimensional
Scaling (MDS) for dimensionality reduction.
MDS
Multidimensional scaling (MDS)[6] is a problem of finding a n-dimensional set of points, y, which
are most closely consistent (under some cost function) with a measured set of dissimilarities {δij } of
objects x. In other words MDS transforms a distance matrix into a set of coordinates such that the
(Euclidean) distances computed from these coordinates approximate the original distances. Basic idea
is to convert distance matrix into a cross product matrix to find its eigen-decompostion which gives a
Principle Component Analysis. Dimension reduction is done by eliminating low score. More formally,
if D is a squared distance matrix containing squared distances between points as shown in Table 2,
then the task is to find a set of points which have, under the Euclidean distance measure, the same
squared distance matrix as D. Let X be the matrix of these points, then using the Euclidean distance:
D4.15
Public
15/25
c
INDECT
Consortium —
0
d201
d201
.
.
d201
0
d201
d201
d201
.
...
0
Table 2: Squared distance matrix - square and symmetric
d2ij = (xi − xj )2
If
x = (xT1 ...xTn )T
d2ij = (xi − xj )2 = x2i + x2j − 2xi .xj
We now introduce the kernel matrix K
K = XX T
kij = xi .xj
Thus,
Dij = d2ij = Kii + Kjj − 2kij
Which will give us:
J
J
1
K = − (I − )D(I − )
2
n
n
Now, since we know K, we can find X from K as:
K = XX T
K = U ∧ UT
1
1
= U ∧2 ∧2 UT
Thus,
1
X = U∧2
For dimensionality reduction, from eigen decomposition of K, we set all negative eigenvalues and small
positive values to zero.
D4.15
Public
16/25
c
INDECT
Consortium —
Using MDS to reduce feature dimension
If we assume our feature dictionary is subset of words from WordNet, Then for N words in WordNet
we have a pairwise dissimilarity matrix R, where Rij = 1 − simW N (wordi , wordj )
r1
r2
r3
.
.
rn
r1
0
0.6
0
.
.
0.7
r2
0.3
0
0.3
.
.
0.9
r3
0.9
0.2
0
.
.
0.3
...
.
.
.
.
.
.
rn
0.5
0.3
0.6
.
.
0
Table 3: Dissimilarity between words from WordNet
Table 3 shows dissimilarity between n words in WordNet. Next, we have to reduce the n-dimension
into m dimension where, m< < <n. After the application of MDS we represent the features in the
form of reduced dimension.
d7
d8
d9
d10
d11
r1
1
3
3
0
1
r2
4
1
0
1
0
r3
7
0
5
0
1
...
.
.
.
.
.
rn
Table 4: Reduced dimension. N words from Table 3 is reduced to size 5
Table 4 shows the reduced dimension. The n-dimension feature has been reduced to size 5. Now, each
word ri is represented in linear form of d7, d8, d9, d10 and d11, for example, r3={7d7+5d9+1d11}.
Reduced feature vector
Now, the feature representation will be in the form of reduced dimension. Say, the feature representation in N-dimension was:
For Entity 1:
< label r1 : v11 , r2 : v21 , r3 : v31 , .....rn : vn1 >
if values except v11 and v13 is zero then
< label r1 : v11 , r3 : v31 >
Similarly for Entity 2:
< label r1 : v12 , r2 : v22 >
D4.15
Public
17/25
c
INDECT
Consortium —
Then the feature representation in new reduced dimension from Table 4 will be:
For Entity 1: < label (d7 + 3d8 + 3d9 + d11) : v11 , (7d7 + 5d9 + d11) : v31 >
=< label d7 : (v11 , 7v31 ), d8 : (3v11 ), d9 : (3v11 , 5v31 ), d11 : (v11 , v31 ) >
Similarly for Entity 2:
< label (d7 + 3d8 + 3d9 + d11) : V12 , (dr7 + d8 + d10) : v22 ) >
=< label : d7 : (v12 , 4v22 ), d8 : (v12 , v22 ), d9 : (v12 ), d10 : (v22 ), d11 : (v12 ) >
Thus, in the reduced dimension each feature can have multiple values and the values are weighted by
the weight of its corresponding feature.
3.1.2. Learning a model
Once the feature vectors have been created we need to learn a generalised model for each target entity.
Learning is done by matching the values of same features and then create a generalised model. In this
case too, perfect lexical matching will be a rare case and additionally semantically similar values will
be ignored if they do not match lexically, thus we need to use a similarity kernel that matches two
values according to the similarity score of the pair.
Defining a kernel
Suppose , we have relations for Entity 1 as: {lives->UK and stays->Fulford}. If ’lives’ and ’stays’ has
been reduced to a feature d7, then:
<d7:(UK, Fulford)> is the feature vector.
For Entity 2 if the relations are: {lives->UK and stays->Broadway}
<d7:(UK, Broadway)> is the feature vector.
We have a function which returns similarity value between two words, for example, sim(UK, UK)=1,
sim(Fulford,Broadway)=0.4, sim(UK,Broadway)=0.3 and sim(Broadway,UK)=0.3. In general case we
want the kernel to return the maximum match score between the values. Thus, the kernel is defined
as:
K(x, Y ) = maxyi Y sim(x, yi )
P
P
1
yY K(X, y)
xX K(x, Y )
K(X, Y ) = [
]
2
|X|
|Y |
Intuitively the kernel defined above seems correct but there is a major drawback associated with
this kernel. In case with example of UK, Fulford and Broadway there is always a biased caused by
D4.15
Public
18/25
c
INDECT
Consortium —
matching of (UK,UK) pairs as UK is a common and a general place. Ideally, we want the kernel to
reflect the matching of less common place. The intuition behind this being, two Entities can live in
UK but can be different if the entities live in different place within UK. Thus, we need to weigh our
feature with regards to its popularity, in sense that less popular places are weighed high.
Values Weighting
The type of weighing desired can be achieved by Term Frequency-Inverse Document Frequency (TFIDF) weighting. To calculate the tf-idf score we first need to generate a corpus. For each Feature, we
will generate different documents and calculate tf-idf score. Intuition being, same value is shared by
different feature and may have varied importance between features. For a feature, say, R, a document
is a collection of all values of R from our collection of original documents. We should note that the
feature R is in terms of reduced dimension. Then, the corpus is the collection of all documents of each
features. The tf-idf score then can be calculated as:
f (t, d
max{f (w, d) : wd}
tf (t, d) =
idf (t, D) = log
|D|
|dD : tD|
tf idf (t, d, D) = tf (t, d) ∗ idf (t, D)
Where ,f(t,d) =frequency of t in D. A high tfidf shows that the value has higher importance for a
feature for distinguishing entities. ’UK’ will have lower tfidf, as even though its frequency might be
high within a document, it might occur across many document thus also increasing the idf value.
Redefining the kernel
With the weights of values now calculated, the kernel can now be readjusted as:
K(x, Y ) = maxyi Y {
1
K(X, Y ) = [
2
P
Wx + Wyi
sim(x, yi )}
2
K(x, Y )
|X|
xX
P
yY
K(X, y)
|Y |
]
Where, Wx and Wy are tfidf weights of x and y respectively.
D4.15
Public
19/25
c
INDECT
Consortium —
3.2. Candidate Generation
Once we have created models for each entity, to disambiguate a new entity we need to get a list of
models that could potentially represent the mention associated with the new entity mention. Since,
each model is associated with a set of documents (documents on which a model is trained), we can
first generate list of documents and map the documents into respective models. Given that searching
for the entity mention in the text from all the KB articles is a time-consuming task, we use Lucene to
index these articles in order to speed-up the search process. All relations and Named Entity extracted
during feature extraction in each document in KB is indexed using Lucene. Any words not extracted
as relations and entity for a given target entity during feature extraction are not used for indexing.
3.2.1. Query Expansion
Input query to Lucene search engine is the input entity mention. Although the input entity mention
is an important query to search related articles it might not be complete to achieve high recall. For
example, if given entity is BA, then we will not retrieve articles that mentions British Airlines but
not BA.
The article accompanying input entity will have other entities with important relation to input entity.
For example, the article for BA might have entity British Airlines or Waterside (location of British
Airlines) etc. Including these entities as query can increase recall. For example, both British Airlines and Waterside with high probability can retrieve article that mentions British Airlines. Thus,
in addition to input entity as query we augment the query with all the entities mention (persons,
organisations and location) present in the accompanying document of input entity that shows a direct
relation with given target entity (i.e. entities extracted with sub-sequence).
Specifically, the input query to Lucene search engine is the target augmented query set, while the
output of Lucene is a list of documents, L = D0 ...Dn , ranked by their relevance to the input query.
The score of KB article Di for entity mention E is the cosine distance between document and query
vectors. The features of these vectors are stemmed words occurring in the document and the query,
weighted by Term Frequency/Inverse Document Frequency (TF.IDF). We only consider top 10 models
for next step; i.e. for all documents retrieved through Lucene, we only extract top 10 representative
models.
D4.15
Public
20/25
c
INDECT
Consortium —
3.3. Candidate Ranking
The target at this stage is to rank each model according to the score the model provides for the
classification of new entity. Since, we use a binary classification, the score is in the form of ’true’ or
’false’. For any query to be correctly identified, it should be classified as ’true’ by its representative
model and classified as ’false’ by remaining 9 models. We use LIBSVM [7] for both training and
classification.
4. Training and Evaluation
4.1. Experimental Setting and Datasets
For the evaluation we create our own training and testing data from TAC KBP test dataset[1]. The
dataset consists of 3904 entity mentions of which 560 are distinct entities. We train our model with
5 training examples. Thus, to perform at least 3-fold cross validation, we only extracted entities
which have at least 15 representative documents. In total, we extracted 37 entities with total 880
representative documents. For each entity 3-folds of training and testing data was created with
5 training examples in each fold. For negative training examples, for each entity all the training
documents of rest of the entities are used. Total instances used for training and testing is shown in
Table 5
Type of entity
Positive
NIL
For individual entity
Training
Testing
5
>=10
-
For total entities
Training Testing
185
695
129
Table 5: Number of training and test instances.
After following the pipeline explained in previous sections the results are shown in Table 6.
In Table 6 positive test entities refers to the entities for which we have built a model. It should be
noted that the test entities were not used during training of models. Across all three folds, with average
accuracy of 69.25% entities were assigned to the representative entities. NIL entities represents the
entity not present in the knowledge base. A NIL entity is correctly classified if all the selected models
D4.15
Public
21/25
c
INDECT
Consortium —
Folds
1
1
2
2
3
3
Type of test entity
Positive
NIL
Positive
NIL
Positive
NIL
Number of test entities
695
129
695
129
695
129
Number of correctly classified
462
81
501
88
48
68
Accuracy
66.47%
62.7%
72.08%
68.21%
69.20%
52.71%
Table 6: Accuracy results of test entities present (positive) in KB and test entities not present (NIL)
in KB.
classify the entity as ’false’. This is an important evaluation as the system should be able to detect
entities not present in the knowledge base. It is clear that our system depends on number of training
examples used. Since examples of NIL were never seen during the training, the resulting classification
accuracy is low (61% on average). This shows the dependent of training examples in the classification
of entities. Additionally, the similarity kernel returns a zero value when matching abbreviated names.
For example, even if ’Bill Clinton’ and ’B. Clinton’ are same person, the wordnet similarity will return
zero similarity between the two phrases. We have tackled the problem by using co-reference resolution
but due to low accuracy of the tool used for co-reference resolution the problem still persists. In
future, the system could be tested with different similarity measures.
D4.15
Public
22/25
c
INDECT
Consortium —
5. Conclusion
This report presents a method for entity disambiguation. It is clear that entities can be disambiguated
by observing the relation it shows with other entities (Person, Location and Organisation). Thus,
relation between entities is used as feature and related entities are used as values of features.
Results using the relations as features are encouraging although accuracy is dependent on number of
training instances used. Additionally, to further improve the system it is necessary that a mechanism
to handle NIL entities is incorporated in the system.
D4.15
Public
23/25
c
INDECT
Consortium —
References
[1] P. McNamee and H. T. Dang, “Overview of the TAC 2009 knowledge base population track,”
in In Proceedings of the 2009 Text Analysis Conference.
National Institute of Standards and
Technology, Nov. 2009.
[2] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into
information extraction systems by gibbs sampling,” in Proceedings of the 43rd Annual
Meeting on Association for Computational Linguistics, ser. ACL ’05.
Stroudsburg, PA,
USA: Association for Computational Linguistics, 2005, pp. 363–370. [Online]. Available:
http://dx.doi.org/10.3115/1219840.1219885
[3] H. Lee, A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu, and D. Jurafsky, “Deterministic
coreference resolution based on entity-centric, precision-ranked rules.” in Computational Linguistics 39(4, 2013.
[4] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky, “Stanford’s multipass sieve coreference resolution system at the conll-2011 shared task.” in In Proceedings of the
CoNLL-2011 Shared Task ., 2011.
[5] K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. Manning,
“A multi-pass sieve for coreference resolution.” in EMNLP-2010, Boston, USA., 2010.
[6] F. Wickelmaier, “An introduction to mds.” in Reports from the Sound Quality Research Unit
(SQRU), No. 7., 2003.
[7] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions
on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011, software available at http:
//www.csie.ntu.edu.tw/~cjlin/libsvm.
D4.15
Public
24/25
c
INDECT
Consortium —
Document Updates
Table 7: Document Updates
Version
20130325
20130326
20130327
20130425
20130426
20130506
20130508
20130508
20130530
20130702
D4.15
Date
25/03/2013
26/03/2013
27/03/2013
25/04/2013
26/04/2013
06/05/2013
08/05/2013
08/05/2013
30/05/2013
02/07/2013
Updates and Revision History
Introduction
Description of method
Description of method
Various corrections
Additional material for MDS
Evaluation and conclusion
Evaluation and conclusion
Changes as suggested by Suresh Manandhar
Changes as suggested by Ethical and Security Reviewers
Updated list of reviewers
Public
Suraj
Suraj
Suraj
Suraj
Suraj
Suraj
Suraj
Suraj
Suraj
Suraj
Author
Jung Pandey
Jung Pandey
Jung Pandey
Jung Pandey
Jung Pandey
Jung Pandey
Jung Pandey
Jung Pandey
Jung Pandey
Jung Pandey
25/25

D4.15 Framework for combining user supplied knowledge from

Transcription

Similar documents

Information Extraction and the Semantic Web

D9.35 Framework for combining user supplied knowledge from

important information

The story behind our new Group Home

HERE - Perfect Shots Photography

rice, fish fillet, ampalaya con carne,

From entity extraction to network analysis: a method and

Half Price Half Price Half Price Half Price Half Price Half Price

extremesearch help key

FULL TEXT - Journal of Information and Communication Technology