AGDISTIS - VideoLectures.NET
Transcription
AGDISTIS - VideoLectures.NET
AGDISTIS - Graph-based Disambiguation of Named Entities Ricardo Usbeck1,2 Axel-Cyrille Ngonga Ngomo1 Michael Röder1,2 Daniel Gerber1 Sandro Athaide Coelho3 Sören Auer4 Andreas Both2 1 University 2 R & D, 3 Federal 4 University Usbeck et al. (AKSW) of Leipzig, Germany Unister GmbH, Germany University of Juiz de Fora, Brazil of Bonn & Fraunhofer IAIS, Germany AGDISTIS 1 / 19 Motivation 1 Every minute on the Document Web ... 278,000 tweets 41,000 Facebook posts 571 new websites 2 In contrast Linked Data Web is mostly static Most data is encyclopedic Lack of actuality Usbeck et al. (AKSW) AGDISTIS 2 / 19 Motivation 1 Every minute on the Document Web ... 278,000 tweets 41,000 Facebook posts 571 new websites 2 In contrast Linked Data Web is mostly static Most data is encyclopedic Lack of actuality Solution Deploy scalable knowledge extraction to bridge between unstructured and structured data Usbeck et al. (AKSW) AGDISTIS 2 / 19 Named Entity Disambiguation Drawbacks 1 Poor performance on Web documents 2 Current approaches rely on exhaustive data mining methods or algorithms with non-polynomial time complexity 3 Partly difficult to port to other languages Usbeck et al. (AKSW) AGDISTIS 3 / 19 Named Entity Disambiguation Drawbacks 1 Poor performance on Web documents 2 Current approaches rely on exhaustive data mining methods or algorithms with non-polynomial time complexity 3 Partly difficult to port to other languages Goal 1 Design accurate knowledge-base-agnostic approach 2 Ensure polynomial time complexity 3 Provide easy portability to other languages Usbeck et al. (AKSW) AGDISTIS 3 / 19 Overview Usbeck et al. (AKSW) AGDISTIS 4 / 19 Entity Recognition Example Barack Obama arrived this afternoon in Washington, D.C.. By default we use FOX for named entity recognition Figure: All token-based. Figure: All entity-based. Figure: All dataset. Usbeck et al. (AKSW) AGDISTIS 5 / 19 Candidate Generation Given: Set of entity labels Output: Set of candidate resources for each label Greedy Approach: Merge objects of labeling properties (e.g., rdfs:label, skos:prefLabel, . . .) and surface forms (if available). Select all resources with label similarity larger than θ Usbeck et al. (AKSW) AGDISTIS 6 / 19 Candidate Generation Given: Set of entity labels Output: Set of candidate resources for each label Greedy Approach: Merge objects of labeling properties (e.g., rdfs:label, skos:prefLabel, . . .) and surface forms (if available). Select all resources with label similarity larger than θ Example Barack Obama arrived this afternoon in Washington, D.C.. Usbeck et al. (AKSW) AGDISTIS 6 / 19 Candidate Generation Given: Set of entity labels Output: Set of candidate resources for each label Greedy Approach: Merge objects of labeling properties (e.g., rdfs:label, skos:prefLabel, . . .) and surface forms (if available). Select all resources with label similarity larger than θ Example Barack Obama arrived this afternoon in Washington, D.C.. Example (List of Candidates) Barack Obama: dbr:Barack Obama, dbr:Barack Obama,Sr. Washington, D.C.: dbr:Washington D.C., dbr:Washington D.C. (novel), . . . Usbeck et al. (AKSW) AGDISTIS 6 / 19 Breadth-First Search and HITS Given: Set of resources for each label, i.e., set of nodes Output: Highest ranked resource for each label Method: Breadth-first search from each initial resource Run HITS algorithm on this graph Choose resource with highest authority for each label Usbeck et al. (AKSW) AGDISTIS 7 / 19 Example II Usbeck et al. (AKSW) AGDISTIS 8 / 19 Example III Node xa dbr:Barack Obama dbr:Barack Obama, Sr. dbr:Washington, D.C. dbr:Washington, D.C. (novel) Usbeck et al. (AKSW) AGDISTIS 0.273 0.089 0.093 0.000 9 / 19 Experimental Setup Goal Measure the accuracy of AGDISTIS on different languages Baseline: State-of-the-art frameworks (AIDA, Spotlight, TagMe2) Evaluation measure: Micro F-measure Nine datasets in three languages (7x English, 1x Chinese, 1x German) Default settings: d = 2, θ = 0.82 Usbeck et al. (AKSW) AGDISTIS 10 / 19 Evaluation Best θ ∈ [0.8, 0.9] across all datasets d = 2 best in all experiments Usbeck et al. (AKSW) AGDISTIS 11 / 19 Evaluation English datasets Deployed AGDISTIS on DBpedia and YAGO2 Corpus AGDISTIS AGDISTIS AIDA Spotlight K DBpedia YAGO2 YAGO2 DBpedia Reuters RSS-500 AIDA-YAGO2 F-measure 0.78 0.75 0.73 F-measure 0.60 0.53 0.58 F-measure 0.62 0.6 0.83 F-measure 0.56 0.56 0.57 Usbeck et al. (AKSW) AGDISTIS 12 / 19 Evaluation Benchmark datasets from BAT framework (English) Dataset Approach F1-measure Precision Recall AIDA/CONLL-TestB TagMe 2 DBpedia Spotlight AGDISTIS 0.565 0.341 0.596 0.58 0.308 0.642 0.551 0.384 0.556 AQUAINT TagMe 2 DBpedia Spotlight AGDISTIS 0.457 0.26 0.547 0.412 0.178 0.777 0.514 0.48 0.422 IITB TagMe 2 DBpedia Spotlight AGDISTIS 0.408 0.46 0.31 0.416 0.434 0.646 0.4 0.489 0.204 MSNBC TagMe 2 DBpedia Spotlight AGDISTIS 0.466 0.331 0.761 0.431 0.317 0.796 0.508 0.347 0.729 Usbeck et al. (AKSW) AGDISTIS 13 / 19 Chinese AGDISTIS Support a non-European language Benchmark: QALD4 queries 200 questions in the training data 50 questions in the test data F-measure between 65% (training data) and 70% (test data). Usbeck et al. (AKSW) AGDISTIS 14 / 19 German AGDISTIS news.de Dataset (N3 collection) Collected from web news portal news.de 53 documents, 627 named entities AGDISTIS: 0.87 F1-measure, Spotlight: 0.84 F1-measure Usbeck et al. (AKSW) AGDISTIS 15 / 19 GERBIL Annotators ... GERBIL Web service calls Natural Language Interchange Format Matching BAT-Framework for Entity Annotators by Cornolti et al. (WWW ‘13) Experiment type only results are (measures) transferred, not annotations Online User Interface Natural Language Interchange Format NLP Interchange Format (Pull via HTTP or HDD lookup) Datasets OPEN http://github.com/AKSW/GERBIL Usbeck et al. (AKSW) AGDISTIS 16 / 19 Demo http://agdistis.aksw.org/demo/ Usbeck et al. (AKSW) AGDISTIS 17 / 19 Conclusion Presented AGDISTIS Polynomial time complexity Greedy and knowledge-base agnostic Multilingual (English, German and Chinese, more to come) High accuracy on diverse knowledge bases and test datasets Future Work Include graph summarization Extension to more languages Combination with other approaches Usbeck et al. (AKSW) AGDISTIS 18 / 19 That’s all Folks! Thank you! Questions? Axel Ngonga AKSW Research Group [email protected] http://github.com/AKSW/AGDISTIS http://agdistis.aksw.org/demo Live Demo: Oct. 21st, Stand 79 Usbeck et al. (AKSW) AGDISTIS 19 / 19