D4.15 Framework for combining user supplied knowledge from
Transcription
D4.15 Framework for combining user supplied knowledge from
European Seventh Framework Programme FP7-218086-Collaborative Project D4.15 Framework for combining user supplied knowledge from diverse sources D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu The INDECT Consortium AGH — University of Science and Technology, AGH, Poland Gdansk University of Technology, GUT, Poland InnoTec DATA GmbH & Co. KG, INNOTEC, Germany IP Grenoble (Ensimag), INP, France MSWiA — General Headquarters of Police (Polish Police), GHP, Poland Moviquity, MOVIQUITY, Spain Products and Systems of Information Technology, PSI, Germany Police Service of Northern Ireland, PSNI, United Kingdom Poznan University of Technology, PUT, Poland Universidad Carlos III de Madrid, UC3M, Spain Technical University of Sofia, TU-SOFIA, Bulgaria University of Wuppertal, BUW, Germany University of York, UoY, Great Britain Technical University of Ostrava, VSB, Czech Republic Technical University of Kosice, TUKE, Slovakia X-Art Pro Division G.m.b.H., X-art, Austria Fachhochschule Technikum Wien, FHTW, Austria c Copyright 2013, the Members of the INDECT Consortium D4.15 Public 2/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Document Information Contract Number 218086 Deliverable Name Framework for combining user supplied knowledge from diverse sources Deliverable number D4.15 Editor(s) Suresh Manandhar, University of York, [email protected] Author(s) Suraj Jung Pandey, University of York, [email protected] Reviewer(s) Ethical Review: Andreas Pongratz (X-art) Security Review: Petr Machnik (VSB) End-users review:Sergejs Amplejevs (Latvian Police), Michael Ross (PSNI) Scientific Review: Nick Pears (UoY) Dissemination level Public Contractual date of December 2012 delivery Delivery date July 2013 Status Final version Keywords Entity resolution, Kernel methods This project is funded under 7th Framework Program D4.15 Public 1/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Contents Document Information 1 1 Executive Summary 6 2 Introduction 7 2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 List of participants & roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Ethical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Method for entity linking 3.1 3.2 3.3 11 Model Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1 Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.2 Learning a model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Candidate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Candidate Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Training and Evaluation 4.1 21 Experimental Setting and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion D4.15 21 23 Public 2/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu List of Figures 1 Framework for combining user supplied knowledge. . . . . . . . . . . . . . . . . . . . . 2 Summary of steps required to ensure ethical use of WP4 software. Critical steps involving human intervention are shaded. . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Conceptual architecture of the proposed method . . . . . . . . . . . . . . . . . . . . . 12 4 Document 1 for entity "SFAX" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5 Document 2 for entity "SFAX" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 6 Assigning named entity labels using Stanford’s NER. The labels are either PERSON or ORGANISATION or LOCATION. D4.15 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . Public 14 3/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu List of Tables 1 Necessary security measures for use of software within WP4 and their benefits. . . . . 9 2 Squared distance matrix - square and symmetric . . . . . . . . . . . . . . . . . . . . . 16 3 Dissimilarity between words from WordNet . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Reduced dimension. N words from Table 3 is reduced to size 5 . . . . . . . . . . . . . 17 5 Number of training and test instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 6 Accuracy results of test entities present (positive) in KB and test entities not present 7 D4.15 (NIL) in KB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Document Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Public 4/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu (This page is left blank intentionally) D4.15 Public 5/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu 1. Executive Summary Security is becoming a weak point of energy and communications infrastructures, commercial stores, conference centers, airports and sites with high person traffic in general. Practically any crowded place is vulnerable, and the risks should be controlled and minimised as much as possible. Access control and rapid response to potential dangers are properties that every security system for such environments should have. The INDECT project is aiming to develop new tools and techniques that will help the potential end users in improving their methods for crime detection and prevention thereby offering more security to the citizens of the European Union. In the context of the INDECT project, Work Package 4 (WP4) is responsible for the Extraction of Information for Crime Prevention by Unstructured Data. This document (D4.15) presents a novel method for entity resolution which is used to successfully locate and merge entities in records. D4.15 Public 6/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu 2. Introduction The general aim of WP4 is the development of key technologies that facilitate the building of an intelligence gathering system by combining and extending the current-state-of-the-art methods in Natural Language Processing (NLP). One of the specific goals of WP4 is to propose NLP and machine learning methods that learn relationships between people and organizations through websites and social networks. The task of combining user supplied knowledge is essentially maintaining a database of records or knowledge base which contains description of different entities (person, location, organisation). First, it is necessary to group documents within the knowledge base, where each group refers to a same unique entity. For example, separating all documents referring to Michael Jordon the basketball player from documents of other Michael Jordon(s). Next, the user would add or delete records in the database. The new information could be in form of single document or itself a database of records. During addition of new information, the consistency of the knowledge base should be maintained i.e. if the entity already exists in the knowledge base then, the new information about the entity should be merged with already existing information of the entity and if the new entity does not exist in the knowledge base then a new ID for the entity should be created in the knowledge base. Figure 1 shows the framework for combining user supplied knowledge with the ’entity disambiguation’ working as the central control of such system. Knowledge Base Single document Add/Delete Records Add/Delete Records En#ty Disambigua# on System En##es Descrip#on En#ty 1 Descrip#on Add/Delete/Update En#ty 2 Records Descrip#on En#ty 3 Descrip#on En#ty 4 Descrip#on Database Records Figure 1: Framework for combining user supplied knowledge. D4.15 Public 7/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu In this document, we describe an entity resolution/linking/disambiguation method that uses relation between entities as the key feature. An entity can be any subject of interest, generally mentions of person, places or organisations. Given a target entity mention, its corresponding article and a knowledge base containing real-world data; the task of entity resolution is to associate target entity to its corresponding and unique knowledge base entry, provided that it exists in the knowledge base. Entity resolution is a highly challenging problem with a variety of applications. For instance, entity resolution is essential from a security point of view, in which the target would be to identify the different entity mentions (first names, surnames, nicknames) of persons or organisations that are subject of an investigation. Additionally, the disambiguation of entities would possibly be beneficial for identifying different types of relations between persons, organisations, locations and other types of entities, in effect providing useful information for a variety of security-related tasks. The disambiguation of an entity is a difficult task for two main reasons. Firstly, an entity can be mentioned in a variety of ways. For example, many organisations and people are often referred by their initials, i.e. BA for British Airways or MJ for Michael Jackson. Secondly and most importantly, entity mentions are ambiguous, i.e. they might refer to different entities. For example, BA might also refer to Bosnian Airlines, while MJ might also refer to Michael Jordon. In the same vein, Washington might refer to the capital of USA, a newspaper (Washington daily), or George Washington. Entity resolution method described in this document leverages relational information between entity as the key feature. The evaluation of our method is performed using subset of TAC KBP dataset[1]. 2.1. Objectives The objective of this report is to provide thorough description and evaluate the performance of our novel method for entity resolution. 2.2. List of participants & roles This report has been produced by the University of York (UoY). D4.15 Public 8/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu 2.3. Ethical Issues Since entity resolution creates relation between different entities, its misuse may affect citizens privacy. Hence, care must be taken to prevent improper use of such software. For the purpose of developing our scientific methods we have explicitly avoided any prejudicial stereotypes and have only used publicly available data from various sources. Any use of the methodology in real situations will require careful monitoring to ensure compliance with ethical and legal standards. Organisations that use the methods and software described or produced within WP4 and the INDECT project in general should be aware of the security risks associated with the use of such software/methods. Software/methods produced within WP4 can be employed to automatically analyse text documents or web pages to extract names of people, organisations, locations, dates etc., extract their relationships and identify behavioural patterns. Such software can process massive amounts of data continuously. At least the following set of security measures need to be in place for any software described or produced within WP4: (1) the use of the software in a secure environment, (2) secure management of data sources by encrypting the used knowledge bases, (3) secure access policy, (4) secure storage of access logs and (5) anonymisation of Ids (email, URL, username etc) from data. The security risk associated with using software within WP4 can be minimised by following important security protocols. A carefully designed security protocol prevents critical software, like behavioural profiling software, from being misused and compromised. Table 1 provides a summary of some of the benefits associated with the corresponding security measures. Security Measures Use of software in secure environment Encrypting data Secure access policy Anonymisation of Ids Benefits Safeguards critical application and data. Protects sensitive information in case of data misplacement. Also, access of data over network becomes secure. Includes authentication protocols which provides necessary restriction for untrusted use. Also maintains logs of software access. Protects privacy of the owner of the document. Table 1: Necessary security measures for use of software within WP4 and their benefits. Figure 2 shows stepwise use of WP4 software in a secure environment. A brief explanation of each of the steps is given below: Step 1 Before supplying the software with input data any user identification present within the data D4.15 Public 9/25 c INDECT Consortium — D4.15 Framework for combining user supplied knowledge from diverse sources www.indect-project.eu Input&data&(Chat&to&be& analysed,&seed& suspicious&websites& and&textual&data&for& rela3on&mining)& Internet& Step%3% DeCanonymisa3on& and&search& Step%1% Id&(email,&URL,& username&etc.)& anonymisa3on&& Step%13% Step%4% Step%2% WP4&so8ware& Step%8% Poten3al& data&of& interest& Encrypted&analysed& content&(suspicious& websites,& suspicious&chat,& rela3on&graphs)& Step%7% Encrypted& human&verified& data& Step%12% User&Id&deC anonymisa3on& Human& verifica3on& yes& Step%9% yes& Legal& authorisa3on?& Step%10% Encrypted& browsing&and& verifica3on&log& Secure& authorisa3on?& Step%6% Step%11% Human&expert& Step%5% Encrypted& access&log& Human&expert& Encrypted& access&log& Figure 2: Summary of steps required to ensure ethical use of WP4 software. Critical steps involving human intervention are shaded. is anonymised. This step is necessary to protect the identity of the user or website in the subsequent steps. Step 2 The WP4 software processes input data. Step 3 When additional data needs to be fetched from World Wide Web relating to an anonymised user or website, de-anonymisation is necessary. However, the de-anonymisation and search module is separated from the WP4 software. The de-anonymisation and search is performed in a secure location (e.g. police servers). The newly extracted content is again fed back to Step 1 for anonymisation. The WP4 software will only have access to anonymised content providing an additional layer of security. Step 4 The output of the software can be suspicious websites or suspicious chat or relation graphs depending on the type of software used. The output is stored in an encrypted storage. The D4.15 Public 10/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu encryption of the output is necessary as it prevents any unauthorised access to processed data. Step 5 A human expert uses a secure access mechanism to view the processed data. Step 6 The access logs are stored in a secure encrypted storage to prevent unauthorised viewing of access logs. Step 7 Although the WP4 software exhibits reasonable accuracy it is necessary to have human verification to remove false positives i.e. data that has been incorrectly classified by the software as being potential data of interest. In this step, a human expert will analyse all processed data and identify potential data of interest. Step 8 The human verified data is stored in a secure storage. Step 9 All activities of the human expert are logged and these are stored in a secure storage. Step 10 At the point of need, a legal authorisation is given to a human expert to de-anonymise some of the data requiring serious investigation. Step 11 The human expert gains legally authorised access and this is logged in a secure storage. Step 12 The selected data is de-anonymised. Step 13 The de-anonymised data is revealed to the human expert for further investigation. The above steps for the usage of the software is needed to minimise the risk associated with unauthorised use of the software and to ensure that the use of the software is strictly within the law. Step 5,7,10,12 are flagged as critical steps as it involves human access of sensitive data. These steps should only be executed once appropriate legal authorisation has been obtained as indicated above. 3. Method for entity linking Figure 3 shows the conceptual architecture of our method for entity linking that associates a target ambiguous entity mention to its corresponding and unique knowledge base (KB) entry. The task of entity linking can be divided into three sub-processes: (1) Candidate Generation (2) Candidate Ranking and (2) Entry Selection. Given a target entity mention and a corresponding article in which the target entity mention appears, the proposed method first queries Lucene 1 http://lucene.apache.org/ D4.15 1 in [Accessed: 04/30/2013] Public 11/25 c INDECT Consortium — D4.15 Framework for combining user supplied knowledge from diverse sources www.indect-project.eu Knowledge)Base) (KB)) Ar1cle) Men1on)1) Ar1cle)1) Ar1cle)2) .) ) Training) Men1on)2) Ar1cle)1) Ar1cle)2) .) ) Training) Men1on)3) Ar1cle)1) Ar1cle)2) .) ) Men1on)n) Ar1cle)1) Ar1cle)2) .) ) Model)1) Lucene)Search) Engine) Ar1cle)Men1on) Candidate) Genera1on) Training) Model)2) Model)3) Candidate) Model)1) Ar1cle)Ranking) Candidate) Model)2) Winning) Model) Candidate) Model)m) Training) Model)n) Figure 3: Conceptual architecture of the proposed method order to get a list of candidate knowledge base entries. In the next step, Candidate Ranking module ranks the candidate entries according to the likelihood of the entry being represented by the target entity mention. Finally, according to the threshold parameter, the knowledge base entry with highest likelihood is either selected or rejected to be linked with target entity. If knowledge base entry with highest likelihood is rejected then we conclude that target entity is not mentioned in knowledge base. In our approach, as shown in Figure 3, for each unique target entity mention in the knowledge base we create a model. Models are used to classify whether a new entity is same as one represented by the model. In the following sections we explain each step of our process. We start by explaining the process of model creation. 3.1. Model Creation Each unique entity in the knowledge base is represented by a different model. A model is created for each entity by training a classifier based on the features extracted from documents of the corresponding D4.15 Public 12/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu entity. Each model is then used for classification of new entities. The positive result of the classification asserts that the new entity is the same as the one represented by the model. Model creation can be divided into two steps, namely Feature Generation and Training. 3.1.1. Feature Generation Sfaxien outclassed Astres Douala of Cameroon 3-0 in the Mediterranean town of Sfax via goals from Congolese Blaise 'Lelo' Mbele, Ivorian Blaise Kouassi and Hamza Younes after leading 1-0 at half-time. Figure 4: Document 1 for entity "SFAX" Sfaxien overcame Astres 3-0 in the Mediterraean city of Sfax Saturday through goals from Blaise 'Lelo' Mbele, Blaise Kouassi and Hamza Younes to remain top of the standings with 10 points, one more than Mazembe. Figure 5: Document 2 for entity "SFAX" In out setting, each entity is represented by its corresponding documents. For example, Figure 4 and Figure 5 are two different documents of a same entity "SFAX". An entity can be disambiguated by inspecting how it relates with other entities. If two entities are the same then they exhibit similar relations to similar entities. For example, if given an entity Xa represented by ’Xa works for Z’ and another entity Xb represented by ’Xb is employed at Z’, than the chances that both entities Xa and Xb are the same is very high as they show similar relation to a common entity Z. Similarly, the relations in Figure 4 – ’Sfax goals Hamza Younes’ and in Figure 5 ’Sfax goals Hamza Younes’ increases the probability that both "SFAX" in two different document represents the same entity. It should be noted that the relation does not need to be represented by adjoining words, rather as in previous example, relations can be across non-continuous words within a sentence. Following above mentioned intuition, the feature generation in our setting begins with the extraction of triplets of the form ’Target Entity ->Relation ->Named Entity’, from the given sets of documents. Target Entity is the representative entity of the document, the Named Entity is the entity mention present in the document and the relation is the word relation shared between Target Entity and Named Entity. D4.15 Public 13/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Sfaxien outclassed Astres Douala of Cameroon [ORGANISATION] 3-0 in the Mediterranean[LOCATION] town of Sfax [LOCATION] via goals from Congolese Blaise [ORGANISATION] 'Lelo' Mbele [PERSON], Ivorian Blaise Kouassi [PERSON] and Hamza Younes [PERSON] after leading 1-0 at half-time. Figure 6: Assigning named entity labels using Stanford’s NER. The labels are either PERSON or ORGANISATION or LOCATION. Triplets Extraction To extract the triplets from a document, first the text in the document is parsed using Stanford’s Named Entity Recognition (NER) tool [2]. The parser will assign a label of either ’PERSON’ or ’ORGANISATION’ or ’LOCATION’ to all the entities within the document. The example of named entity labelling can be seen in Figure 6; the labels are adjacent to the words and are within brackets ’[]’. Additionally, we also use Stanford’s coreference tagger[3, 4, 5] to disambiguate entities within a document. Coreference disambiguation also replaces pronouns with its representative mention whenever possible. This step enables us to create unique identities for multiple entities referring to a same entity. Next, we extract sub-sequences from the document. A sub-sequence is a sequence that is obtained by deleting one or more terms from a given sequence while preserving the order of the original sequence. The sub-sequences are extracted under the constraint that it should begin with the target entity of the document and end with a named entity. This way we can extract all the relation that the target entity shares with other entities in the document. Feature Vector A feature vector is created for each target entity. For each feature vector a feature is the Relation and its value is the corresponding Named Entity. For example, if triplets for document in Figure 4 is extracted as: { Sfax->outclassed->sfaxien, Sfax->of->astres douala , Sfax->in-> cameroon, Sfax->town->sfaxien, Sfax->via->blaise kouassi, Sfax->goals->blaise kouassi , Sfax->from->blaise kouassi, .. } then the feature vector of the document with "Sfax" as target entity will be: { outclassed : sfaxien, of : astres douala , in : cameroon, town : sfaxien, via : blaise kouassi, goals : blaise kouassi , from : blaise kouassi, ..other } D4.15 Public 14/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu where, the feature vector is of the form {feature:value}. Similarly for the document in Figure 5 the feature vector will be: { overcame : sfaxien, city : Astres, city : sfaxien, through : blaise kouassi, goals : blaise kouassi, than : mazembe, ... } Problems with lexical feature matching Once, feature vector is created, entities can be disambiguated by analysing pair of feature vectors. If sufficient number of lexically matching relation has lexically matching value then the entities representing both vectors can be regarded as same. Although, in some ideal cases lexical matching will yield accurate results, but mostly it will fail since perfect lexical matching rarely occurs. Additionally, the problem will be compounded as semantically similar but lexically different words are also not matched. For example, in previous feature representation, even though words like "outclassed" and "overcame" show similarity in meaning they are represented as two different features. We can overcome both the problems of lexical feature matching by replacing lexical matching with similarity matching. We can extract similarity scores between pairs of features, then the scores can be used to collapse lexically different features with similar meaning into one single feature. For example, since "outclassed" and "overcame" show similar meaning, we can represent both features with a single identifier, say, "F1". Dimensionality Reduction To overcome the problems created by lexical matching of features, we employ dimensionality reduction on feature set, so that similar features get collapsed to a single feature. We use Multidimensional Scaling (MDS) for dimensionality reduction. MDS Multidimensional scaling (MDS)[6] is a problem of finding a n-dimensional set of points, y, which are most closely consistent (under some cost function) with a measured set of dissimilarities {δij } of objects x. In other words MDS transforms a distance matrix into a set of coordinates such that the (Euclidean) distances computed from these coordinates approximate the original distances. Basic idea is to convert distance matrix into a cross product matrix to find its eigen-decompostion which gives a Principle Component Analysis. Dimension reduction is done by eliminating low score. More formally, if D is a squared distance matrix containing squared distances between points as shown in Table 2, then the task is to find a set of points which have, under the Euclidean distance measure, the same squared distance matrix as D. Let X be the matrix of these points, then using the Euclidean distance: D4.15 Public 15/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu 0 d201 d201 . . d201 0 d201 d201 d201 . ... 0 Table 2: Squared distance matrix - square and symmetric d2ij = (xi − xj )2 If x = (xT1 ...xTn )T d2ij = (xi − xj )2 = x2i + x2j − 2xi .xj We now introduce the kernel matrix K K = XX T kij = xi .xj Thus, Dij = d2ij = Kii + Kjj − 2kij Which will give us: J J 1 K = − (I − )D(I − ) 2 n n Now, since we know K, we can find X from K as: K = XX T K = U ∧ UT 1 1 = U ∧2 ∧2 UT Thus, 1 X = U∧2 For dimensionality reduction, from eigen decomposition of K, we set all negative eigenvalues and small positive values to zero. D4.15 Public 16/25 c INDECT Consortium — D4.15 Framework for combining user supplied knowledge from diverse sources www.indect-project.eu Using MDS to reduce feature dimension If we assume our feature dictionary is subset of words from WordNet, Then for N words in WordNet we have a pairwise dissimilarity matrix R, where Rij = 1 − simW N (wordi , wordj ) r1 r2 r3 . . rn r1 0 0.6 0 . . 0.7 r2 0.3 0 0.3 . . 0.9 r3 0.9 0.2 0 . . 0.3 ... . . . . . . rn 0.5 0.3 0.6 . . 0 Table 3: Dissimilarity between words from WordNet Table 3 shows dissimilarity between n words in WordNet. Next, we have to reduce the n-dimension into m dimension where, m< < <n. After the application of MDS we represent the features in the form of reduced dimension. d7 d8 d9 d10 d11 r1 1 3 3 0 1 r2 4 1 0 1 0 r3 7 0 5 0 1 ... . . . . . rn Table 4: Reduced dimension. N words from Table 3 is reduced to size 5 Table 4 shows the reduced dimension. The n-dimension feature has been reduced to size 5. Now, each word ri is represented in linear form of d7, d8, d9, d10 and d11, for example, r3={7d7+5d9+1d11}. Reduced feature vector Now, the feature representation will be in the form of reduced dimension. Say, the feature representation in N-dimension was: For Entity 1: < label r1 : v11 , r2 : v21 , r3 : v31 , .....rn : vn1 > if values except v11 and v13 is zero then < label r1 : v11 , r3 : v31 > Similarly for Entity 2: < label r1 : v12 , r2 : v22 > D4.15 Public 17/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Then the feature representation in new reduced dimension from Table 4 will be: For Entity 1: < label (d7 + 3d8 + 3d9 + d11) : v11 , (7d7 + 5d9 + d11) : v31 > =< label d7 : (v11 , 7v31 ), d8 : (3v11 ), d9 : (3v11 , 5v31 ), d11 : (v11 , v31 ) > Similarly for Entity 2: < label (d7 + 3d8 + 3d9 + d11) : V12 , (dr7 + d8 + d10) : v22 ) > =< label : d7 : (v12 , 4v22 ), d8 : (v12 , v22 ), d9 : (v12 ), d10 : (v22 ), d11 : (v12 ) > Thus, in the reduced dimension each feature can have multiple values and the values are weighted by the weight of its corresponding feature. 3.1.2. Learning a model Once the feature vectors have been created we need to learn a generalised model for each target entity. Learning is done by matching the values of same features and then create a generalised model. In this case too, perfect lexical matching will be a rare case and additionally semantically similar values will be ignored if they do not match lexically, thus we need to use a similarity kernel that matches two values according to the similarity score of the pair. Defining a kernel Suppose , we have relations for Entity 1 as: {lives->UK and stays->Fulford}. If ’lives’ and ’stays’ has been reduced to a feature d7, then: <d7:(UK, Fulford)> is the feature vector. For Entity 2 if the relations are: {lives->UK and stays->Broadway} <d7:(UK, Broadway)> is the feature vector. We have a function which returns similarity value between two words, for example, sim(UK, UK)=1, sim(Fulford,Broadway)=0.4, sim(UK,Broadway)=0.3 and sim(Broadway,UK)=0.3. In general case we want the kernel to return the maximum match score between the values. Thus, the kernel is defined as: K(x, Y ) = maxyi Y sim(x, yi ) P P 1 yY K(X, y) xX K(x, Y ) K(X, Y ) = [ ] 2 |X| |Y | Intuitively the kernel defined above seems correct but there is a major drawback associated with this kernel. In case with example of UK, Fulford and Broadway there is always a biased caused by D4.15 Public 18/25 c INDECT Consortium — D4.15 Framework for combining user supplied knowledge from diverse sources www.indect-project.eu matching of (UK,UK) pairs as UK is a common and a general place. Ideally, we want the kernel to reflect the matching of less common place. The intuition behind this being, two Entities can live in UK but can be different if the entities live in different place within UK. Thus, we need to weigh our feature with regards to its popularity, in sense that less popular places are weighed high. Values Weighting The type of weighing desired can be achieved by Term Frequency-Inverse Document Frequency (TFIDF) weighting. To calculate the tf-idf score we first need to generate a corpus. For each Feature, we will generate different documents and calculate tf-idf score. Intuition being, same value is shared by different feature and may have varied importance between features. For a feature, say, R, a document is a collection of all values of R from our collection of original documents. We should note that the feature R is in terms of reduced dimension. Then, the corpus is the collection of all documents of each features. The tf-idf score then can be calculated as: f (t, d max{f (w, d) : wd} tf (t, d) = idf (t, D) = log |D| |dD : tD| tf idf (t, d, D) = tf (t, d) ∗ idf (t, D) Where ,f(t,d) =frequency of t in D. A high tfidf shows that the value has higher importance for a feature for distinguishing entities. ’UK’ will have lower tfidf, as even though its frequency might be high within a document, it might occur across many document thus also increasing the idf value. Redefining the kernel With the weights of values now calculated, the kernel can now be readjusted as: K(x, Y ) = maxyi Y { 1 K(X, Y ) = [ 2 P Wx + Wyi sim(x, yi )} 2 K(x, Y ) |X| xX P yY K(X, y) |Y | ] Where, Wx and Wy are tfidf weights of x and y respectively. D4.15 Public 19/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu 3.2. Candidate Generation Once we have created models for each entity, to disambiguate a new entity we need to get a list of models that could potentially represent the mention associated with the new entity mention. Since, each model is associated with a set of documents (documents on which a model is trained), we can first generate list of documents and map the documents into respective models. Given that searching for the entity mention in the text from all the KB articles is a time-consuming task, we use Lucene to index these articles in order to speed-up the search process. All relations and Named Entity extracted during feature extraction in each document in KB is indexed using Lucene. Any words not extracted as relations and entity for a given target entity during feature extraction are not used for indexing. 3.2.1. Query Expansion Input query to Lucene search engine is the input entity mention. Although the input entity mention is an important query to search related articles it might not be complete to achieve high recall. For example, if given entity is BA, then we will not retrieve articles that mentions British Airlines but not BA. The article accompanying input entity will have other entities with important relation to input entity. For example, the article for BA might have entity British Airlines or Waterside (location of British Airlines) etc. Including these entities as query can increase recall. For example, both British Airlines and Waterside with high probability can retrieve article that mentions British Airlines. Thus, in addition to input entity as query we augment the query with all the entities mention (persons, organisations and location) present in the accompanying document of input entity that shows a direct relation with given target entity (i.e. entities extracted with sub-sequence). Specifically, the input query to Lucene search engine is the target augmented query set, while the output of Lucene is a list of documents, L = D0 ...Dn , ranked by their relevance to the input query. The score of KB article Di for entity mention E is the cosine distance between document and query vectors. The features of these vectors are stemmed words occurring in the document and the query, weighted by Term Frequency/Inverse Document Frequency (TF.IDF). We only consider top 10 models for next step; i.e. for all documents retrieved through Lucene, we only extract top 10 representative models. D4.15 Public 20/25 c INDECT Consortium — D4.15 Framework for combining user supplied knowledge from diverse sources www.indect-project.eu 3.3. Candidate Ranking The target at this stage is to rank each model according to the score the model provides for the classification of new entity. Since, we use a binary classification, the score is in the form of ’true’ or ’false’. For any query to be correctly identified, it should be classified as ’true’ by its representative model and classified as ’false’ by remaining 9 models. We use LIBSVM [7] for both training and classification. 4. Training and Evaluation 4.1. Experimental Setting and Datasets For the evaluation we create our own training and testing data from TAC KBP test dataset[1]. The dataset consists of 3904 entity mentions of which 560 are distinct entities. We train our model with 5 training examples. Thus, to perform at least 3-fold cross validation, we only extracted entities which have at least 15 representative documents. In total, we extracted 37 entities with total 880 representative documents. For each entity 3-folds of training and testing data was created with 5 training examples in each fold. For negative training examples, for each entity all the training documents of rest of the entities are used. Total instances used for training and testing is shown in Table 5 Type of entity Positive NIL For individual entity Training Testing 5 >=10 - For total entities Training Testing 185 695 129 Table 5: Number of training and test instances. After following the pipeline explained in previous sections the results are shown in Table 6. In Table 6 positive test entities refers to the entities for which we have built a model. It should be noted that the test entities were not used during training of models. Across all three folds, with average accuracy of 69.25% entities were assigned to the representative entities. NIL entities represents the entity not present in the knowledge base. A NIL entity is correctly classified if all the selected models D4.15 Public 21/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Folds 1 1 2 2 3 3 Type of test entity Positive NIL Positive NIL Positive NIL Number of test entities 695 129 695 129 695 129 Number of correctly classified 462 81 501 88 48 68 Accuracy 66.47% 62.7% 72.08% 68.21% 69.20% 52.71% Table 6: Accuracy results of test entities present (positive) in KB and test entities not present (NIL) in KB. classify the entity as ’false’. This is an important evaluation as the system should be able to detect entities not present in the knowledge base. It is clear that our system depends on number of training examples used. Since examples of NIL were never seen during the training, the resulting classification accuracy is low (61% on average). This shows the dependent of training examples in the classification of entities. Additionally, the similarity kernel returns a zero value when matching abbreviated names. For example, even if ’Bill Clinton’ and ’B. Clinton’ are same person, the wordnet similarity will return zero similarity between the two phrases. We have tackled the problem by using co-reference resolution but due to low accuracy of the tool used for co-reference resolution the problem still persists. In future, the system could be tested with different similarity measures. D4.15 Public 22/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu 5. Conclusion This report presents a method for entity disambiguation. It is clear that entities can be disambiguated by observing the relation it shows with other entities (Person, Location and Organisation). Thus, relation between entities is used as feature and related entities are used as values of features. Results using the relations as features are encouraging although accuracy is dependent on number of training instances used. Additionally, to further improve the system it is necessary that a mechanism to handle NIL entities is incorporated in the system. D4.15 Public 23/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu References [1] P. McNamee and H. T. Dang, “Overview of the TAC 2009 knowledge base population track,” in In Proceedings of the 2009 Text Analysis Conference. National Institute of Standards and Technology, Nov. 2009. [2] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into information extraction systems by gibbs sampling,” in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ser. ACL ’05. Stroudsburg, PA, USA: Association for Computational Linguistics, 2005, pp. 363–370. [Online]. Available: http://dx.doi.org/10.3115/1219840.1219885 [3] H. Lee, A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu, and D. Jurafsky, “Deterministic coreference resolution based on entity-centric, precision-ranked rules.” in Computational Linguistics 39(4, 2013. [4] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky, “Stanford’s multipass sieve coreference resolution system at the conll-2011 shared task.” in In Proceedings of the CoNLL-2011 Shared Task ., 2011. [5] K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. Manning, “A multi-pass sieve for coreference resolution.” in EMNLP-2010, Boston, USA., 2010. [6] F. Wickelmaier, “An introduction to mds.” in Reports from the Sound Quality Research Unit (SQRU), No. 7., 2003. [7] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011, software available at http: //www.csie.ntu.edu.tw/~cjlin/libsvm. D4.15 Public 24/25 D4.15 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Document Updates Table 7: Document Updates Version 20130325 20130326 20130327 20130425 20130426 20130506 20130508 20130508 20130530 20130702 D4.15 Date 25/03/2013 26/03/2013 27/03/2013 25/04/2013 26/04/2013 06/05/2013 08/05/2013 08/05/2013 30/05/2013 02/07/2013 Updates and Revision History Introduction Description of method Description of method Various corrections Additional material for MDS Evaluation and conclusion Evaluation and conclusion Changes as suggested by Suresh Manandhar Changes as suggested by Ethical and Security Reviewers Updated list of reviewers Public Suraj Suraj Suraj Suraj Suraj Suraj Suraj Suraj Suraj Suraj Author Jung Pandey Jung Pandey Jung Pandey Jung Pandey Jung Pandey Jung Pandey Jung Pandey Jung Pandey Jung Pandey Jung Pandey 25/25
Similar documents
D9.35 Framework for combining user supplied knowledge from
Suresh Manandhar, University of York, [email protected]
More information