D9.35 Framework for combining user supplied knowledge from
Transcription
D9.35 Framework for combining user supplied knowledge from
European Seventh Framework Programme FP7-218086-Collaborative Project D9.35 Framework for combining user supplied knowledge from diverse sources D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu The INDECT Consortium AGH — University of Science and Technology, AGH, Poland Gdansk University of Technology, GUT, Poland InnoTec DATA GmbH & Co. KG, INNOTEC, Germany IP Grenoble (Ensimag), INP, France MSWiA — General Headquarters of Police (Polish Police), GHP, Poland Moviquity, MOVIQUITY, Spain Products and Systems of Information Technology, PSI, Germany Police Service of Northern Ireland, PSNI, United Kingdom Poznan University of Technology, PUT, Poland Universidad Carlos III de Madrid, UC3M, Spain Technical University of Sofia, TU-SOFIA, Bulgaria University of Wuppertal, BUW, Germany University of York, UoY, Great Britain Technical University of Ostrava, VSB, Czech Republic Technical University of Kosice, TUKE, Slovakia X-Art Pro Division G.m.b.H., X-art, Austria Fachhochschule Technikum Wien, FHTW, Austria c Copyright 2013, the Members of the INDECT Consortium D9.35 Public 2/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Document Information Contract Number 218086 Deliverable Name Framework for combining user supplied knowledge from diverse sources Deliverable number D9.35 Editor(s) Suresh Manandhar, University of York, [email protected] Author(s) Suraj Jung Pandey, University of York, [email protected] Reviewer(s) Ethical Review: Andreas Pongratz (X-art) Security Review: Petr Machnik (VSB) End-users review: Dmitrijs Apanasovics, Michael Ross (PSNI) Scientific Review: Nick Pears (UoY) Dissemination level Public Contractual date of December 2012 delivery Delivery date July 2013 Status Final version Keywords Entity resolution, Kernel methods This project is funded under 7th Framework Program D9.35 Public 1/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Contents Document Information 1 1 Executive Summary 6 2 Introduction 7 2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 List of participants & roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Ethical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 General Steps for Entity Linking 11 3.1 Generating similar entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Comparing entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 Building a model representing entity . . . . . . . . . . . . . . . . . . . . . . . . 13 Entity Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 4 Datasets and Results D9.35 18 Public 2/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu List of Figures 1 Framework for combining user supplied knowledge. . . . . . . . . . . . . . . . . . . . . 2 Both superstars are often referred as ’MJ’ Source: http://en.wikipedia.org/wiki/Michael_Jackson Source:http://en.wikipedia.org/wiki/Michael_Jordan . . . . . . . . . . . . . . . . . . . 8 3 Summary of steps required to ensure ethical use of WP4 software . . . . . . . . . . . . 10 4 Example pipeline for the task of entity linking . . . . . . . . . . . . . . . . . . . . . . . 12 5 Model creation by training with positive and negative examples . . . . . . . . . . . . . 14 6 Entities associated with unknown entEntryity ’MJ’, Michael Jackson and Michael Jordan. In this case, MJ is probably Michael Jordan. . . . . . . . . . . . . . . . . . . . . 15 7 Document 1 for entity "SFAX" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 8 Document 2 for entity "SFAX" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 9 Assigning named entity labels using Stanford’s NER. The labels are either PERSON or ORGANISATION or LOCATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 10 Pipeline for feature vector creation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 11 Example of dimensionality reduction. Semantically similar words are collapsed to single feature label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D9.35 7 Public 17 3/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu List of Tables D9.35 1 Necessary security measures for use of software within WP4 and their benefits. . . . . 9 2 Document Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Public 4/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu (This page is left blank intentionally) D9.35 Public 5/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu 1. Executive Summary Security is becoming a weak point of energy and communications infrastructures, commercial stores, conference centers, airports and sites with high person traffic in general. Practically any crowded place is vulnerable, and the risks should be controlled and minimised as much as possible. Access control and rapid response to potential dangers are properties that every security system for such environments should have. The INDECT project is aiming to develop new tools and techniques that will help the potential end users in improving their methods for crime detection and prevention thereby offering more security to the citizens of the European Union. In the context of the INDECT project, Work Package 4 (WP4) is responsible for the Extraction of Information for Crime Prevention by Unstructured Data. On of the task in WP4 is combining user supplied knowledge. Combining user supplied knowledge is essentially maintaining a database of records or knowledge base which contains description of different entities (person, location, organisation). First, it is necessary to group documents within the knowledge base, where each group refers to a same unique entity. For example, separating all documents referring to Michael Jordon the basketball player from documents of other Michael Jordon(s). Next, the user would add or delete records in the database. The new information could be in form of single document or itself a database of records. During addition of new information, the consistency of the knowledge base should be maintained i.e. if the entity already exists in the knowledge base then, the new information about the entity should be merged with already existing information of the entity and if the new entity does not exist in the knowledge base then a new ID for the entity should be created in the knowledge base. Figure 1 shows the framework for combining user supplied knowledge with the ’entity disambiguation’ working as the central control of such system. D9.35 Public 6/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Knowledge Base Single document Add/Delete Records Add/Delete Records En#ty Disambigua# on System En##es Descrip#on En#ty 1 Descrip#on Add/Delete/Update En#ty 2 Records Descrip#on En#ty 3 Descrip#on En#ty 4 Descrip#on Database Records Figure 1: Framework for combining user supplied knowledge. 2. Introduction In this document, we describe an entity resolution/linking/disambiguation method. Entity resolution can be defined as a method that identifies whether two entities are same or not. Entities can be mentions of person, places or organisations. Entity resolution has a variety of applications. For instance, entity resolution is essential from a security point of view, in which the target would be to identify the different entity mentions (first names, surnames, nicknames) of persons or organisations that are subject of an investigation. Additionally, it can be used to remove duplicates from a large database by merging identical entities. The disambiguation of an entity is a difficult task for two main reasons. Firstly, an entity can be mentioned in a variety of ways. For example, many organisations and people are often referred by their initials, i.e. BA for British Airways or MJ for Michael Jackson. Secondly and most importantly, entity mentions are ambiguous, i.e. they might refer to different entities. For example, BA might also refer to Bosnian Airlines, while MJ might also refer to Michael Jordon. In the same vein, Washington might refer to the capital of USA, a newspaper (Washington daily), or George Washington. In this document we describe a system that can identify if a given entity is present in the database or not. The most important feature to disambiguate two entities are the relations that the given entity shows with other entities. For example, as can been seen in Figure 2 both ’MJ’ can be disambiguated D9.35 Public 7/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu (a) Michael Jackson’s profile (b) Michael Jordon’s profile Figure 2: Both superstars are often referred as ’MJ’ Source: http://en.wikipedia.org/wiki/Michael_Jackson Source:http://en.wikipedia.org/wiki/Michael_Jordan from information about their occupation, birth date, place etc. 2.1. Objectives The objective of this report is to provide introduction and motivation for entity resolution. The report also briefly explains the steps necessary for entity resolution in a knowledge base (database). The report also provides key insight on the method developed by UoY for entity resolution. 2.2. List of participants & roles This report has been produced by the University of York (UoY). D9.35 Public 8/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu 2.3. Ethical Issues Since entity resolution creates relation between different entities, its misuse may affect citizens privacy. Hence, care must be taken to prevent improper use of such software. For the purpose of developing our scientific methods we have explicitly avoided any prejudicial stereotypes and have only used publicly available data from various sources. Any use of the methodology in real situations will require careful monitoring to ensure compliance with ethical and legal standards. Organisations that use the methods and software described or produced within WP4 and the INDECT project in general should be aware of the security risks associated with the use of such software/methods. Software/methods produced within WP4 can be employed to automatically analyse text documents or web pages to extract names of people, organisations, locations, dates etc., extract their relationships and identify behavioural patterns. Such software can process massive amounts of data continuously. At least the following set of security measures need to be in place for any software described or produced within WP4: (1) the use of the software in a secure environment, (2) secure management of data sources by encrypting the used knowledge bases, (3) secure access policy, (4) secure storage of access logs and (5) anonymisation of Ids (email, URL, username etc) from data. The security risk associated with using software within WP4 can be minimised by following important security protocols. A carefully designed security protocol prevents critical software, like behavioural profiling software, from being misused and compromised. Table 1 provides a summary of some of the benefits associated with the corresponding security measures. Security Measures Use of software in secure environment Encrypting data Secure access policy Anonymisation of Ids Benefits Safeguards critical application and data. Protects sensitive information in case of data misplacement. Also, access of data over network becomes secure. Includes authentication protocols which provides necessary restriction for untrusted use. Also maintains logs of software access. Protects privacy of the owner of the document. Table 1: Necessary security measures for use of software within WP4 and their benefits. Figure 3 shows stepwise use of WP4 software in a secure environment. A brief explanation of each of the steps is given below: Step 1 Before supplying the software with input data any user identification present within the data D9.35 Public 9/20 c INDECT Consortium — D9.35 Framework for combining user supplied knowledge from diverse sources www.indect-project.eu Input&data&(Chat&to&be& analysed,&seed& suspicious&websites& and&textual&data&for& rela3on&mining)& Internet& Step%3% DeCanonymisa3on& and&search& Step%1% Id&(email,&URL,& username&etc.)& anonymisa3on&& Step%13% Step%4% Step%2% WP4&so8ware& Step%8% Poten3al& data&of& interest& Encrypted&analysed& content&(suspicious& websites,& suspicious&chat,& rela3on&graphs)& Step%7% Encrypted& human&verified& data& Step%12% User&Id&deC anonymisa3on& Human& verifica3on& yes& Step%9% yes& Legal& authorisa3on?& Step%10% Encrypted& browsing&and& verifica3on&log& Secure& authorisa3on?& Step%6% Step%11% Human&expert& Step%5% Encrypted& access&log& Human&expert& Encrypted& access&log& Figure 3: Summary of steps required to ensure ethical use of WP4 software is anonymised. This step is necessary to protect the identity of the user or website in the subsequent steps. Step 2 The WP4 software processes input data. Step 3 When additional data needs to be fetched from World Wide Web relating to an anonymised user or website, de-anonymisation is necessary. However, the de-anonymisation and search module is separated from the WP4 software. The de-anonymisation and search is performed in a secure location (e.g. police servers). The newly extracted content is again fed back to Step 1 for anonymisation. The WP4 software will only have access to anonymised content providing an additional layer of security. Step 4 The output of the software can be suspicious websites or suspicious chat or relation graphs depending on the type of software used. The output is stored in an encrypted storage. The D9.35 Public 10/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu encryption of the output is necessary as it prevents any unauthorised access to processed data. Step 5 A human expert uses a secure access mechanism to view the processed data. Step 6 The access logs are stored in a secure encrypted storage to prevent unauthorised viewing of access logs. Step 7 Although the WP4 software exhibits reasonable accuracy it is necessary to have human verification to remove false positives i.e. data that has been incorrectly classified by the software as being potential data of interest. In this step, a human expert will analyse all processed data and identify potential data of interest. Step 8 The human verified data is stored in a secure storage. Step 9 All activities of the human expert are logged and these are stored in a secure storage. Step 10 At the point of need, a legal authorisation is given to a human expert to de-anonymise some of the data requiring serious investigation. Step 11 The human expert gains legally authorised access and this is logged in a secure storage. Step 12 The selected data is de-anonymised. Step 13 The de-anonymised data is revealed to the human expert for further investigation. The above steps for the usage of the software is needed to minimise the risk associated with unauthorised use of the software and to ensure that the use of the software is strictly within the law. Step 5,7,10,12 are flagged as critical steps as it involves human access of sensitive data. These steps should only be executed once appropriate legal authorisation has been obtained as indicated above. 3. General Steps for Entity Linking Given a target entity mention, its corresponding article and the database containing real-world data: the task of entity linking can be divided into three sub-processes: • Generating similar entities, • Comparing entities, and D9.35 Public 11/20 c INDECT Consortium — D9.35 Framework for combining user supplied knowledge from diverse sources www.indect-project.eu Candidate' Genera3on' Database'' Michael' Jordon,' Sports' MJ,' poli3cs' Michael' Jackson,' Musician' input' MJ,' Basketball' Targeted' Search'' Candidate'Ranking'and'' Entry'Selec3on' Michael' Jordon,' Sports' Matching'en3ty' Figure 4: Example pipeline for the task of entity linking • Entity Selection. An example of pipeline shown in Figure 4. Given a target entity mention and its corresponding article, the proposed method performs a targeted search in the database. The keywords for search is acquired from the given article. From the relatively small list of candidate generated, ranking is performed by comparing with input article. Finally, according to the threshold parameter, the highest ranked candidate is either selected or rejected as a match for the input entity. If candidate with the highest rank is rejected then we conclude that input entity is not present in the database. 3.1. Generating similar entities As we already know from Section 2 if two entity are the same then their corresponding article will have similar information content. Thus, to enquire whether a given entity is present in the database or not we need to match the given article with articles in the database. For an input entity there may be only few entity in the database which show similarity to the input entity. For example, Michael Jordan and Michael Jackson may share similarity (initials, common celebrity friends) but they will not D9.35 Public 12/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu share similarity with Dennis Ritchie (Father of programming language C.). Thus, it is not necessary that we match the given article with all articles in the database. The match can be done with only selected articles. These selected articles can be generated by a targeted search. A search is performed in the database by using keywords extracted from the given article. Only those document which will match the keywords are selected for further processing. For keywords we can use the given entity, other entities in the given article and the relation between given entity and other entities. We will discuss on extraction of relation in coming Sections. There are many tools available freely, which are optimised to query large database and return result in few seconds. In our context we use Lucene 1 for searching the database. 3.2. Comparing entities After we generate candidate articles from the database we need to assign a confidence value to each article which indicates whether the article represents the same entity as the given input entity. To generate such classification we associate each entity in the database with an model. Such model when presented with new articles representing some entity will be able to classify whether the new articles represents the same entity as the one represented by the model. 3.2.1. Building a model representing entity As already discussed, each unique entity in the knowledge base is represented by a different model. A model represents an entity through various features, collected from articles linked with the entities. A model is created for each entity by training a classifier based on the documents represented by the entities and also on documents representing other entities as shown in Figure 5. Thus, the model stores cases when a given entity is same as one linked with the model and also cases when it is not. Each model is then used for classification of new entities. The positive result of the classification asserts that the new entity is the same as the one represented by the model. Model creation can be divided into two steps, namely Feature Generation and Learning. Feature Generation An entity can be disambiguated by inspecting how it relates with other entities. If two entities are the same then they exhibit similar relations to similar entities as shown in Figure 6, unknown entity 1 http://lucene.apache.org/core/ D9.35 Public 13/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Ar#cles(for( En#ty(A( Provided(as(posi%ve( examples( Training( Ar#cles(for( en##es(other( than(A( Model( for( En#ty(A( Provided(as(nega%ve( examples( Figure 5: Model creation by training with positive and negative examples ’MJ’ has more entities in common with ’Michael Jordan’. In out setting, each entity is represented by its corresponding documents. For example, Figure 7 and Figure 8 are two different documents of a same entity "SFAX". Features are generated by extracting words linking given entity and other entity in the document. Disambiguation is performed by matching entities of same features. For example, in Figure 7 – ’Sfax goals Hamza Younes’ and in Figure 8 ’Sfax goals Hamza Younes’ increases the probability that both "SFAX" in two different document represents the same entity as the feature goals contains same entity ’Hamza Younes’. It should be noted that the relation does not need to be represented by adjoining words, rather as in previous example, relations can be across non-continuous words within a sentence. Relations across non-continuous words can be extracted through sub-sequence extraction. Following above mentioned intuition, the feature generation in our setting begins with pre-processing using Stanford’s Named Entity Recognition (NER) tool [1], Stanford’s coreference tagger[2, 3, 4] and Stanford’s Part of Speech (PoS) tagger[5]. Next, by extracting sub-sequences we extract the relation between target entity and other entities in the document. Then, we create feature vector using the extracted sub-sequences. Figure 9 shows the application of NER and Figure 10 shows the pipeline of feature extraction phase. The feature extraction process generates feature vectors for each entity. The values for each feature is the linking entity. Learning a model Once the feature vectors have been created we need to learn a model for each target entity. Learning is done by matching the values of common relations (features) and then create a generalised model. The D9.35 Public 14/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu MJ# Associated*En--es:# NBA# Basketball# Chicago#Bulls# Nike# Michael#Jackson# Associated*En--es:# Thriller# Songs# Indiana# Sony# Michael#Jordan# Associated*En--es:# Basketball# Bulls# Nike# New#York# Figure 6: Entities associated with unknown entEntryity ’MJ’, Michael Jackson and Michael Jordan. In this case, MJ is probably Michael Jordan. Sfaxien outclassed Astres Douala of Cameroon 3-0 in the Mediterranean town of Sfax via goals from Congolese Blaise 'Lelo' Mbele, Ivorian Blaise Kouassi and Hamza Younes after leading 1-0 at half-time. Figure 7: Document 1 for entity "SFAX" most straight forward form of matching is a lexical matching, i.e the two relations are same if the have an exact string. Although, in some ideal cases lexical matching will yield accurate results, but mostly it will fail since perfect lexical matching rarely occurs. Additionally, the problem will be compounded as semantically similar but lexically different words are also not matched. For example, in Figure 7 and Figure 8, even though words like "outclassed" and "overcame" show similarity in meaning they are represented as two different features. Thus, to overcome the problem of lexical matching we use similarity matching, where two words are considered the same if they show high similarity scores. Sfaxien overcame Astres 3-0 in the Mediterraean city of Sfax Saturday through goals from Blaise 'Lelo' Mbele, Blaise Kouassi and Hamza Younes to remain top of the standings with 10 points, one more than Mazembe. Figure 8: Document 2 for entity "SFAX" D9.35 Public 15/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Sfaxien outclassed Astres Douala of Cameroon [ORGANISATION] 3-0 in the Mediterranean[LOCATION] town of Sfax [LOCATION] via goals from Congolese Blaise [ORGANISATION] 'Lelo' Mbele [PERSON], Ivorian Blaise Kouassi [PERSON] and Hamza Younes [PERSON] after leading 1-0 at half-time. Figure 9: Assigning named entity labels using Stanford’s NER. The labels are either PERSON or ORGANISATION or LOCATION. Figure 10: Pipeline for feature vector creation. The similarity score can be obtained from WordNet[6]. Then the group of words which show high similarity between each other can be represented by a single feature representation. Figure 11 shows an example where many semantically similar words are collapsed to a single feature label. The process will reduce the number of relation we use and is termed as dimensionality reduction. We use a popular algorithm called Multidimensional Scaling (MDS)[7] to group words with high similarity between each other. Matching values of features The next step in learning a model is to comparing values of a same feature. Suppose , we have relations for Entity 1 as: {lives->UK and stays->Fulford}. If ’lives’ and ’stays’ has been reduced to a feature d7, then: D9.35 Public 16/20 c INDECT Consortium — D9.35 Framework for combining user supplied knowledge from diverse sources www.indect-project.eu outclasss,)outdistance,) outdo,)outhussle,) outmatch,)outpace,) outperform,)outplay,) outrank,)outrival,)outrun,) outshine,)outstrip) accumula8on,)aggrega8on,) assemblage,)assembly,) associa8on,)assortment,) band,)batch,)ba;ery,)bevy,) body,)bunch,)bundle,) cartel,)category,)) affec8on,)angle,)ar8cle,) aspect,)a;ribute,)character,) component,)cons8tuent,) detail,)differen8al,) element,)facet,)factor,)gag,) gimmick,,) Feature)1) Feature)2) Feature)3) Feature)4) Feature)5) Feature)6) aboard,)conjoin,)consociate,) correlate,)couple,)equate,) fasten,)get)into,)hitch)on,) hook)on,)hook)up,)interface,) join,)join)up)with,)marry,)meld) with,)network)with,)plug)into,) relate,) appella8on,)appella8ve,) class,)classifica8on,) cognomen,)compella8on,) denomina8on,) descrip8on,)epithet,) iden8fica8on,)key)word,)) bookish,)college,) collegiate,)erudite,) intellectual,)learned,) pedan8c,)scholarly,) scholas8c,)studious,) university) Figure 11: Example of dimensionality reduction. Semantically similar words are collapsed to single feature label. <d7:(UK, Fulford)> is the feature vector. For Entity 2 if the relations are: {lives->UK and stays->Broadway} <d7:(UK, Broadway)> is the feature vector. The model learns the similarity between Entity 1 and Entity two by comparing the values set of feature ’d7’ i.e. (UK, Fulford) and (UK, Broadway). In this case too, perfect lexical matching will be a rare case and additionally semantically similar values will be ignored if they do not match lexically, Thus, to overcome this problem we define a function that returns the similarity score of a matching pair. D9.35 Public 17/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu 3.3. Entity Selection This is the final step where we decide if the given entity is present in the database or not. The article representing input entity is classified by models of each candidate articles. We can say that the input article is present in the database if it is classified as ’true’ by only one candidate model and ’false’ by rest of the model. If all of the models give false classification then the given entity is not present in the database. 4. Datasets and Results The training of the model by using the features and the process described above is performed on TAC KBP dataset[8]. The dataset consists of 3904 entity mentions of which 560 are distinct entities. On the test dataset the system showed an average accuracy of 69.25% for correctly identifying the entities. The average accuracy of identifying entities not present in the dataset is 61%, this is mainly caused due the the lack of training examples for these entities. D9.35 Public 18/20 c INDECT Consortium — D9.35 Framework for combining user supplied knowledge from diverse sources www.indect-project.eu References [1] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into information extraction systems by gibbs sampling,” in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ser. ACL ’05. Stroudsburg, PA, USA: Association for Computational Linguistics, 2005, pp. 363–370. [Online]. Available: http://dx.doi.org/10.3115/1219840.1219885 [2] H. Lee, A. Chang, Y. Peirsman, N. Chambers, M. Surdeanu, and D. Jurafsky, “Deterministic coreference resolution based on entity-centric, precision-ranked rules.” in Computational Linguistics 39(4, 2013. [3] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky, “Stanford’s multipass sieve coreference resolution system at the conll-2011 shared task.” in In Proceedings of the CoNLL-2011 Shared Task ., 2011. [4] K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. Manning, “A multi-pass sieve for coreference resolution.” in EMNLP-2010, Boston, USA., 2010. [5] K. Toutanova, D. Klein, C. Manning, and Y. Singer, “Feature-rich part-of-speech tagging with a cyclic dependency network.” in HLT-NAACL 2003, pp. 252-259., 2003. [6] G. A. Miller, “A lexical databse for english,” in Communications of the ACM Vol. 38, No. 11: 39-41, 1995. [7] F. Wickelmaier, “An introduction to mds.” in Reports from the Sound Quality Research Unit (SQRU), No. 7., 2003. [8] P. McNamee and H. T. Dang, “Overview of the TAC 2009 knowledge base population track,” in In Proceedings of the 2009 Text Analysis Conference. National Institute of Standards and Technology, Nov. 2009. D9.35 Public 19/20 D9.35 Framework for combining user supplied knowledge from diverse sources c INDECT Consortium — www.indect-project.eu Document Updates Table 2: Document Updates Version 20130508 20130530 20130612 20130619 20130702 D9.35 Date 09/05/2013 30/05/2013 12/06/2013 19/06/2013 02/07/2013 Updates and Revision History Introduction, Evaluation and conclusion Changes as suggested by Ethical and Security Reviewers Addition of figures and changes throughout Changes throughout as Suggested by Suresh Manandhar Updated list of reviewers Public Suraj Suraj Suraj Suraj Suraj Author Jung Pandey Jung Pandey Jung Pandey Jung Pandey Jung Pandey 20/20