AGDISTIS - VideoLectures.NET

Transcription

AGDISTIS - VideoLectures.NET
AGDISTIS - Graph-based Disambiguation of Named
Entities
Ricardo Usbeck1,2 Axel-Cyrille Ngonga Ngomo1 Michael Röder1,2
Daniel Gerber1 Sandro Athaide Coelho3 Sören Auer4 Andreas
Both2
1 University
2 R & D,
3 Federal
4 University
Usbeck et al. (AKSW)
of Leipzig, Germany
Unister GmbH, Germany
University of Juiz de Fora, Brazil
of Bonn & Fraunhofer IAIS, Germany
AGDISTIS
1 / 19
Motivation
1
Every minute on the Document
Web ...
278,000 tweets
41,000 Facebook posts
571 new websites
2
In contrast
Linked Data Web is mostly
static
Most data is encyclopedic
Lack of actuality
Usbeck et al. (AKSW)
AGDISTIS
2 / 19
Motivation
1
Every minute on the Document
Web ...
278,000 tweets
41,000 Facebook posts
571 new websites
2
In contrast
Linked Data Web is mostly
static
Most data is encyclopedic
Lack of actuality
Solution
Deploy scalable knowledge extraction to bridge between unstructured and
structured data
Usbeck et al. (AKSW)
AGDISTIS
2 / 19
Named Entity Disambiguation
Drawbacks
1
Poor performance on Web documents
2
Current approaches rely on exhaustive data mining methods or
algorithms with non-polynomial time complexity
3
Partly difficult to port to other languages
Usbeck et al. (AKSW)
AGDISTIS
3 / 19
Named Entity Disambiguation
Drawbacks
1
Poor performance on Web documents
2
Current approaches rely on exhaustive data mining methods or
algorithms with non-polynomial time complexity
3
Partly difficult to port to other languages
Goal
1
Design accurate knowledge-base-agnostic approach
2
Ensure polynomial time complexity
3
Provide easy portability to other languages
Usbeck et al. (AKSW)
AGDISTIS
3 / 19
Overview
Usbeck et al. (AKSW)
AGDISTIS
4 / 19
Entity Recognition
Example
Barack Obama arrived this afternoon in Washington, D.C..
By default we use FOX for named entity recognition
Figure: All token-based.
Figure: All entity-based.
Figure: All dataset.
Usbeck et al. (AKSW)
AGDISTIS
5 / 19
Candidate Generation
Given: Set of entity labels
Output: Set of candidate resources for each label
Greedy Approach: Merge objects of labeling properties (e.g.,
rdfs:label, skos:prefLabel, . . .) and surface forms (if available).
Select all resources with label similarity larger than θ
Usbeck et al. (AKSW)
AGDISTIS
6 / 19
Candidate Generation
Given: Set of entity labels
Output: Set of candidate resources for each label
Greedy Approach: Merge objects of labeling properties (e.g.,
rdfs:label, skos:prefLabel, . . .) and surface forms (if available).
Select all resources with label similarity larger than θ
Example
Barack Obama arrived this afternoon in Washington, D.C..
Usbeck et al. (AKSW)
AGDISTIS
6 / 19
Candidate Generation
Given: Set of entity labels
Output: Set of candidate resources for each label
Greedy Approach: Merge objects of labeling properties (e.g.,
rdfs:label, skos:prefLabel, . . .) and surface forms (if available).
Select all resources with label similarity larger than θ
Example
Barack Obama arrived this afternoon in Washington, D.C..
Example (List of Candidates)
Barack Obama: dbr:Barack Obama, dbr:Barack Obama,Sr.
Washington, D.C.: dbr:Washington D.C.,
dbr:Washington D.C. (novel), . . .
Usbeck et al. (AKSW)
AGDISTIS
6 / 19
Breadth-First Search and HITS
Given: Set of resources for each label, i.e., set of nodes
Output: Highest ranked resource for each label
Method:
Breadth-first search from each initial resource
Run HITS algorithm on this graph
Choose resource with highest authority for each label
Usbeck et al. (AKSW)
AGDISTIS
7 / 19
Example II
Usbeck et al. (AKSW)
AGDISTIS
8 / 19
Example III
Node
xa
dbr:Barack Obama
dbr:Barack Obama, Sr.
dbr:Washington, D.C.
dbr:Washington, D.C. (novel)
Usbeck et al. (AKSW)
AGDISTIS
0.273
0.089
0.093
0.000
9 / 19
Experimental Setup
Goal
Measure the accuracy of AGDISTIS on different languages
Baseline: State-of-the-art
frameworks (AIDA, Spotlight,
TagMe2)
Evaluation measure: Micro
F-measure
Nine datasets in three
languages (7x English, 1x
Chinese, 1x German)
Default settings:
d = 2, θ = 0.82
Usbeck et al. (AKSW)
AGDISTIS
10 / 19
Evaluation
Best θ ∈ [0.8, 0.9] across all datasets
d = 2 best in all experiments
Usbeck et al. (AKSW)
AGDISTIS
11 / 19
Evaluation
English datasets
Deployed AGDISTIS on DBpedia and YAGO2
Corpus
AGDISTIS
AGDISTIS
AIDA
Spotlight
K
DBpedia
YAGO2
YAGO2
DBpedia
Reuters
RSS-500
AIDA-YAGO2
F-measure
0.78
0.75
0.73
F-measure
0.60
0.53
0.58
F-measure
0.62
0.6
0.83
F-measure
0.56
0.56
0.57
Usbeck et al. (AKSW)
AGDISTIS
12 / 19
Evaluation
Benchmark datasets from BAT framework (English)
Dataset
Approach
F1-measure
Precision
Recall
AIDA/CONLL-TestB
TagMe 2
DBpedia Spotlight
AGDISTIS
0.565
0.341
0.596
0.58
0.308
0.642
0.551
0.384
0.556
AQUAINT
TagMe 2
DBpedia Spotlight
AGDISTIS
0.457
0.26
0.547
0.412
0.178
0.777
0.514
0.48
0.422
IITB
TagMe 2
DBpedia Spotlight
AGDISTIS
0.408
0.46
0.31
0.416
0.434
0.646
0.4
0.489
0.204
MSNBC
TagMe 2
DBpedia Spotlight
AGDISTIS
0.466
0.331
0.761
0.431
0.317
0.796
0.508
0.347
0.729
Usbeck et al. (AKSW)
AGDISTIS
13 / 19
Chinese AGDISTIS
Support a non-European language
Benchmark: QALD4 queries
200 questions in the training data
50 questions in the test data
F-measure between 65% (training data) and 70% (test data).
Usbeck et al. (AKSW)
AGDISTIS
14 / 19
German AGDISTIS
news.de Dataset (N3 collection)
Collected from web news portal news.de
53 documents, 627 named entities
AGDISTIS: 0.87 F1-measure, Spotlight: 0.84 F1-measure
Usbeck et al. (AKSW)
AGDISTIS
15 / 19
GERBIL
Annotators
...
GERBIL
Web service calls
Natural Language
Interchange
Format
Matching
BAT-Framework for
Entity Annotators
by Cornolti et al.
(WWW ‘13)
Experiment type
only results are
(measures) transferred,
not annotations
Online User Interface
Natural Language
Interchange
Format
NLP Interchange Format (Pull via HTTP or HDD lookup)
Datasets
OPEN
http://github.com/AKSW/GERBIL
Usbeck et al. (AKSW)
AGDISTIS
16 / 19
Demo
http://agdistis.aksw.org/demo/
Usbeck et al. (AKSW)
AGDISTIS
17 / 19
Conclusion
Presented AGDISTIS
Polynomial time complexity
Greedy and knowledge-base
agnostic
Multilingual (English, German
and Chinese, more to come)
High accuracy on diverse
knowledge bases and test
datasets
Future Work
Include graph summarization
Extension to more languages
Combination with other
approaches
Usbeck et al. (AKSW)
AGDISTIS
18 / 19
That’s all Folks!
Thank you!
Questions?
Axel Ngonga
AKSW Research Group
[email protected]
http://github.com/AKSW/AGDISTIS
http://agdistis.aksw.org/demo
Live Demo: Oct. 21st, Stand 79
Usbeck et al. (AKSW)
AGDISTIS
19 / 19