Similarity - SAPO Labs

Transcription

Similarity - SAPO Labs
Connecting the
dots between
Research Team: Carla Abreu, Jorge Teixeira, Prof. Eugénio Oliveira
Domain: News
Research Keywords: Natural Language Processing, Information Extraction, Machine Learning.
Objective
"… larger and larger amounts of news content is published every day.
With this much data, it is often easy to miss the big picture.”
(Shahaf and Guestrin, 2010)
…
Objective: Automatically aggregate similar news and build news chains
(Shahaf and Guestrin, 2010): Connecting the Dots Between News Articles
How to do this ?
Similarity
Keywords Extraction
News group
News group / Keywords
Arch
News chains
Similarity
Aim:
Clustering Similar News
Challenges:
What news data are important for the similarity process? How can we use that data ?
Which methods can we use in this process ?How can we evaluate this process ?
Similarity
Filter:
Revista de imprensa: destaques de "O Jogo"
Jornais do dia
Mourinho diz que os seus brasileiros jogaram muito bem. Quiseram
embraçá-lo com os 6-2 da goleada sofrida por Portugal.
Revista de imprensa: destaques do "Jornal de Notícias”
Jornais do dia
Governo pressiona direcções das escolas. Ministério pondera avaliar
conselhos executivos pelo sistema do sector público.
Normalization:
● remove punctuation marks;
● remove patterns;
● remove stop-words (snowball);
● words stemming (ptstemmer)
Similarity
Title
News comparation:
Similarity:
●
●
●
Teaser
Title - ST*;
Teaser ( S) - STe*;
Content - SC*.
Temporary Window
●
T
Content
* Values between 0 and 1
Similarity
First Approach
Similar Tree (manual threshold assignment; empirical values)
Second Approach
Classification methods (provide by scikit-learn; automatic approach)
●
●
●
●
●
Decision Tree;
Support Vector Classifier (SVC)
SVC Linear
Random Forest
Gaussian
Similarity
Features
1.
2.
3.
Title Similarity
Teaser Similarity
Content Similarity
Variables:
●
●
●
●
S = 0,2
T=1
Algoritm - Levensthein
Stemmer - Porter Stemmer
Similarity
Dataset
3 millions of Portuguese news published between 2008 and 2013
Training Set
●
Select 100 news of each day (between 23 Dec 2012 and 22 Jan 2013)
○ Annotate randomly 371 comparisons
Test Set
1.
2.
TS1: Select 501 distinct news from 19 Nov 2012 - Annotate randomly 5101
comparisons
TS2: Select 210 distinct news from 19 Nov 2012 - Annotate randomly 1047
comparisons
Similarity
Annotation Interface
Similarity
Experimental Setup
Precision (P)
Recall(R)
P=
___ TP_____
TP + FP
Accuracy(A)
R=
___ TP_____
TP + FN
F measure (F)
A=
___ TP_+ TN_ __
TP + TN + FP+ FN
True Positives (TP): number of similar news correctly identify;
False Positives (FP): number of non similar news identified as similar;
True Negatives (TN): number of non similar news correctly identify;
False Negatives (FN): number of similar news identified as non similar.
F = 2 * ___P * R___
P + R
Similarity
Results and Analyses
RandomForest: Random Behaviour
P
R
A
F
DecisionTree
0,958
0,932
0,985
0,945
SVC
0,993
0,963
0,994
0,978
SVC Linear
0,991
0,963
0,994
0,977
RandomForest
0,987
0,960
0,993
0,974
Gaussian
0,701
0,964
0,956
0,812
Similar Tree
0,999
0,839
0,974
0,912
Gaussian: Worst Performance
SVCs results are better than Decision Tree
in all metrics
SVCs have similar results
SVC: Better combination of evaluation
metrics
News Group
News Group
News 2014 (3 April to 20 June)
Number of news: 186 366
Cluster number: 23 047
Average amount of news per cluster: ~ 3,7
March 2014, 10-15
Number of news: 16.747
Number of news in news group: 8278
Keywords extraction
Aim:
Extract relevant terms from text.
Challenges:
Can any word be considered a keyword ? Can a news be described by a simple word ? a compound
word ? or an entity ? How we can extract useful keywords from the news ?
Keywords extraction
Approach
Explicit Keywords
○
Simple (uni-grams)
Governo
○
rebeldes
busca
competição
atentado à
bomba
avião da
Malaysia
Airlines
fase de
grupos
Bagdade
Malásia
Rui Patrício
Compound (n-grams)
Tribunal
Constitucional
Implicit Keywords
Entities
Presidente
República
Keywords extraction
Explicit Keywords
Pos Tagger (Pablo Gamallo) [n-grams]
Normalization:
Remove Patterns
Stemmer [uni-grams]
Term frequency - Inverse document frequency (TF-IDF):
o(W, DOC): number of occurences of WORD in DOCUMENT; npalavras(DOC): number of words in DOCUMENT
docs(ALL): number of documents in the documents collection; docs(W, ALL): number of documents in the documents collection withc contain WORD
Keywords extraction
Implicit Keywords
Normalization
Relation between words ( Ventura, Silva 2013)
Corr(A,B) is based on Pearson’s correlation coefficient; ||D|| is the number of documents of corpus D; di is the i-th document in D; size(di) is its
number of words and f(A, di) the frequency of term A in di. Corr(A, B) ranges -1 (non correlation) to +1(strong correlation)
(Ventura, Silva 2013): Automatic Extraction of Explicit and Implicit Keywords to Build Document Descriptors
Keywords extraction
Entities
Find Entities
A idade média dos entrevistados era de 11 anos no início do estudo, sendo rapazes três quartos do
total
Os jovens que jogam jogos de vídeo têm mais propensão para pensar e agir de forma agressiva,
indica um estudo feito a mais de 3.000 estudantes em Singapura e hoje divulgado.
O estudo, publicado pela revista da American Medical Association e baseado em três anos de
trabalho com 3.034 jovens, concluiu, com base nas respostas dos estudantes, que havia uma
ligação entre o uso frequente de jogos de vídeo e as altas taxas de comportamentos e pensamentos
agressivos.
Keywords extraction
Dataset
4789 news articles from January to December (2012)
Test set:
1.
2.
3.
4.
5.
select one day from each month of 2012
select three hours of each day
extract keywords
select 10 news from each day
check manually the keywords
Keywords extraction
Experimental Setup
PalavrasChaveRepresentativas Number of words that represents the
news
PalavrasChaveAtribuídas Number of words attributed to news
||N|| number of news
Results
Evaluation
Explicit - Simple
0,732
Explicit - Compound
0,762
Implicit
~0
Entity
0,804
News Group / Keywords
Aim: associate keywords to newsgroups according their weight
Arch
Aim:
Connect groups of news
Challenges:
How can we aggregate news clusters ? What fields need to be considered ?
Arch
Approach (explicit simple keywords, entities and personalities)
Normalization
●
●
lowercase
explicit simple keywords - reduce words to their stem
Find Personalities
●
From entities and explicit compound keywords using Verbetes.
Distance:
|ka| number of words in news group a; |kb| number of words in news group b;
Wkja: weigth world j in news group a; Wkib: weigth world i in news group b; D1 and D2: range from 0 to 1
Arch
Approach (explicit compound keywords)
Normalization
●
●
lowercase
remove stop-words
All words have the same weigth
Distance:
●
Edit distance algorithm - qgrams - q=3
Arch
Goldstandard
1408 news (2012, January)
●
131 groups of news
Trainset:
5671 comparisons between groups of news
●
●
277 connections
5394 non connections
Testset:
300 comparisons between groups of news
●
●
26 connections
247 non connections
Arch
Experiences
1.
6 Experiences
Metrics to calculate distance(D1 and D2)
2.
11 Experiences
Constraints to comparisons
- number of entities
- number of personalities
- similarity between explicit simple keywords
Arch
Experimental Setup
Precision (P)
Recall(R)
P=
___ TP_____
TP + FP
True Positives (TP): number of connections correctly identify;
False Positives (FP): number of non connections identified as connections;
True Negatives (TN): number of non connections correctly identify;
False Negatives (FN): number of connections identified as non connections.
R=
___ TP_____
TP + FN
Arch
Results and Analyses
Experiences
1.
2.
Metrics:
a.
b.
c.
Explicit simple keyword: D1
Personalities: D1
Entities: D2
Constrains:
a. Entities >= 3
b. Explicit simple keyword similarity >= 0,2
Best Result
Gaussian
Precision
0,941
Recall
0,308
News Chains
Thanks !
Carla Abreu
([email protected])
Acknowledgement Bruno Tavares
Connecting the dots
between news