slides in pdf

Transcription

slides in pdf
7/26/2015
Successful Data Mining Methods for NLP
Jiawei Han (UIUC), Heng Ji (RPI), Yizhou Sun (NEU)
http://hanj.cs.illinois.edu/slides/dmnlp15.pptx[pdf]
http://nlp.cs.rpi.edu/paper/dmnlp15.pptx[pdf]
1
Introduction: Where do NLP
and DM Meet?
1
7/26/2015
Slightly Different Research Philosophies

NLP: Deep understanding of individual words, phrases and sentences (“micro‐level”); focus on unstructured text data

Data Mining (DM): High‐level (statistical) understanding, discovery and synthesis of the most salient information (“Macro‐
level”); historically more on structured and semi‐structured data
Related to “Health Care Bill”

NewsNet (Tao et al., 2014)
3
DM Solution: Data to Networks to Knowledge (D2N2K)
Advantages of NLP
 Construct graphs/networks with fine‐grained semantics from unstructured texts
 Use large‐scale annotations for real‐world data
 Advantages of DM: Deep understanding through structured/correlation inference
 Using a structured representation (e.g., graph, network) as a bridge to capture interactions between NLP and DM
 Example: Heterogeneous Information Networks [Han et al., 2010; Sun et al., 2012]

4
Data Networks Knowledge
2
7/26/2015
A Promising Direction: Integrating DM and NLP
Major theme of this tutorial
 Applying novel DM methods to solve traditional NLP problems
 Integrating DM and NLP, transforming Data to Networks to Knowledge
 Road Map of this tutorial
 Effective Network Construction by Leveraging Information Redundancy
 Theme I: Phrase Mining and Topic Modeling from Large Corpora
 Theme II: Entity Extraction and Linking by Relational Graph Construction
 Mining Knowledge from Structured Networks
 Theme III: Search and Mining Structured Graphs and Heterogeneous Networks
 Looking forward to the Future

5
Theme I: Phrase Mining and
Topic Modeling from Large
Corpora
6
3
7/26/2015
Why Phrase Mining?

Phrase: Minimal, unambiguous semantic unit; basic building block for information network and knowledge base

Unigrams vs. phrases
 Unigrams (single words) are ambiguous
 Example: “United”: United States? United Airline? United Parcel Service?
 Phrase: A natural, meaningful, unambiguous semantic unit
 Example: “United States” vs. “United Airline”

Mining semantically meaningful phrases
 Transform text data from word granularity to phrase granularity
 Enhance the power and efficiency at manipulating unstructured data using database technology
7
Mining Phrases: Why Not Just Use NLP Methods?

Phrase mining: Originated from the NLP community—“Chunking”
Model it as a sequence labeling problem (B‐NP, I‐NP, O, …)


Need annotation and training Annotate hundreds of POS tagged documents as training data
Train a supervised model based on part‐of‐speech features



Recent trend: Use distributional features based on web n‐grams (Bergsma et al., 2010)
State‐of‐the‐art Performance: ~95% accuracy, ~88% phrase‐level F‐score



Limitations
High annotation cost, not scalable to a new language, domain or genre
May not fit domain‐specific, dynamic, emerging applications




Scientific domains, query logs, or social media, e.g., Yelp, Twitter
Use only local features, no ranking, no links to topics
8
4
7/26/2015
Data Mining Approaches for Phrase Mining
General principle: Corpus‐based; fully exploit information redundancy and data‐
driven criteria to determine phrase boundaries and salience; using local evidence to adjust corpus‐level data statistics
 Phrase Mining and Topic Modeling from Large Corpora
 Strategy 1: Simultaneously Inferring Phrases and Topics


Strategy 2: Post Topic Modeling Phrase Construction


Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et al.’14]
Strategy 3: First Phrase Mining then Topic Modeling: 


Bigram topical model [Wallach’06], topical n‐gram model [Wang, et al.’07], phrase discovering topic model [Lindsey, et al.’12]
ToPMine [El‐kishky, et al., VLDB’15]
Integration of Phrase Mining with Document Segmentation

SegPhrase [Liu, et al., SIGMOD’15]
9
Strategy 1: Simultaneously Inferring Phrases and Topics
Bigram Topic Model [Wallach’06]  Probabilistic generative model that conditions on previous word and topic when drawing next word
 Topical N‐Grams (TNG) [Wang, et al.’07]  Probabilistic model that generates words in textual order
 Create n‐grams by concatenating successive bigrams (a generalization of Bigram Topic Model)
 Phrase‐Discovering LDA (PDLDA) [Lindsey, et al.’12]  Viewing each sentence as a time‐series of words, PDLDA posits that the generative parameter (topic) changes periodically
 Each word is drawn based on previous m words (context) and current phrase topic
 High model complexity: Tends to overfitting; High inference cost: Slow

10
5
7/26/2015
Strategy 2: Post Topic Modeling Phrase Construction

TurboTopics [Blei & Lafferty’09] – Phrase construction as a post‐processing step to Latent Dirichlet Allocation

Perform Latent Dirichlet Allocation on corpus to assign each token a topic label

Merge adjacent unigrams with the same topic label by a distribution‐free permutation test on arbitrary‐length back‐off model

End recursive merging when all significant adjacent unigrams have been merged
KERT [Danilevsky et al.’14] – Phrase construction as a post‐processing step to Latent Dirichlet Allocation


Perform frequent pattern mining on each topic

Perform phrase ranking based on four different criteria
11
Example of TurboTopics


12

Perform LDA on corpus to assign each token a topic label
E.g., … phase11 transition11 …. game153 theory127 …
Then merge adjacent unigrams with same topic label
6
7/26/2015
Framework of KERT
1. Run bag‐of‐words model inference and assign topic label to each token
2. Extract candidate keyphrases within each topic
Frequent pattern mining
3. Rank the keyphrases in each topic
Comparability property: directly compare phrases of mixed lengths
kpRel
[Zhao et al. 11]
learning
KERT (‐popularity)
KERT (‐discriminativeness)
KERT (‐concordance)
KERT [Danilevsky et al. 14]
effective
support vector machines
learning
learning
classification
text
feature selection
classification
support vector machines
selection
probabilistic
reinforcement learning
selection
reinforcement learning
13
models
identification
conditional random fields
feature
feature selection
algorithm
mapping
constraint satisfaction
decision
conditional random fields
features
task
decision trees
bayesian
classification
decision
planning
dimensionality reduction
trees
decision trees
:
:
:
:
:
Strategy 3: First Phrase Mining then Topic Modeling

ToPMine [El‐Kishky et al. VLDB’15]

First phrase construction, then topic mining

Contrast with KERT: topic modeling, then phrase mining

The ToPMine Framework:

Perform frequent contiguous pattern mining to extract candidate phrases and their counts

Perform agglomerative merging of adjacent unigrams as guided by a significance score—This segments each document into a “bag‐of‐phrases”

The newly formed bag‐of‐phrases are passed as input to PhraseLDA, an extension of LDA, that constrains all words in a phrase to each sharing the same latent topic
14
7
7/26/2015
Why First Phrase Mining then Topic Modeling ?
With Strategy 2, tokens in the same phrase may be assigned to different topics
 Ex. knowledge discovery using least squares support vector machine classifiers…
 Knowledge discovery and support vector machine should have coherent topic labels
 Solution: switch the order of phrase mining and topic model inference

[knowledge discovery] using [least squares] [support vector machine] [classifiers] …
[knowledge discovery] using [least squares] [support vector machine] [classifiers] …
Techniques
 Phrase mining and document segmentation
 Topic model inference with phrase constraint

15
Phrase Mining: Frequent Pattern Mining +
Statistical Analysis
Quality phrases
Based on significance score [Church et al.’91]:
α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2) [Markov blanket] [feature selection] for [support vector machines]
[knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]…
Raw freq.
True freq.
[support vector machine]
90
80
[vector machine] 95
0 [support vector]
100
20
Phrase
16
8
7/26/2015
Collocation Mining





Collocation: A sequence of words that occur more frequently than expected
Often “interesting” and due to their non‐compositionality, often relay information not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”)
Many different measures used to extract collocations from a corpus [Dunning 93, Pederson 96]
E.g., mutual information, t‐test, z‐test, chi‐squared test, likelihood ratio
Many of these measures can be used to guide the agglomerative phrase‐
segmentation algorithm
17
ToPMine: Phrase LDA (Constrained Topic Modeling)
The generative model for PhraseLDA is the same as LDA
 Difference: the model incorporates constraints obtained from the “bag‐of‐phrases” input
 Chain‐graph shows that all words in a phrase are constrained to take on the same topic values

[knowledge discovery] using [least squares]
[support vector machine] [classifiers] …
Topic model inference with phrase constraints
18
9
7/26/2015
Example Topical Phrases: A Comparison
social networks
information retrieval
web search
text classification
time series
machine learning
search engine
support vector machines
management system information extraction
real time
neural networks
decision trees
text categorization
:
:
Topic 1
Topic 2
PDLDA [Lindsey et al. 12] Strategy 1 (3.72 hours)
information retrieval
feature selection
social networks
web search
search engine
information extraction
machine learning
semi supervised
large scale
support vector machines
question answering
active learning
web pages
face recognition
:
:
Topic 1
Topic 2
ToPMine [El‐kishky et al. 14] Strategy 3 (67 seconds)
19
ToPMine: Experiments on DBLP Abstracts
20
10
7/26/2015
ToPMine: Topics on Associate Press News (1989)
21
ToPMine: Experiments on Yelp Reviews
22
11
7/26/2015
Efficiency: Running Time of Different Strategies
Running time: strategy 3 > strategy 2 > strategy 1 (“>” means outperforms)



Strategy 1: Generate bag‐of‐words → generate sequence of tokens
Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams
Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model 23
Coherence of Topics: Comparison of Strategies
Coherence measured by z‐score: strategy 3 > strategy 2 > strategy 1



Strategy 1: Generate bag‐of‐words → generate sequence of tokens
Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams
Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model 24
12
7/26/2015
Phrase Intrusion: Comparison of Strategies
Phrase intrusion measured by average number of correct answers: strategy 3 > strategy 2 > strategy 1
25
Phrase Quality: Comparison of Strategies
Phrase quality measured by z‐score: strategy 3 > strategy 2 > strategy 1
26
13
7/26/2015
Mining Phrases:
Why Not Use Raw Frequency Based Methods?
Traditional data‐driven approaches

Frequent pattern mining


If AB is frequent, likely AB could be a phrase
Raw frequency could NOT reflect the quality of phrases


E.g., freq(vector machine) ≥ freq(support vector machine) 
Need to rectify the frequency based on segmentation results
Phrasal segmentation will tell 

Some words should be treated as a whole phrase whereas others are still unigrams
27
SegPhrase: From Raw Corpus
to Quality Phrases and Segmented Corpus
Raw Corpus
Quality Phrases
Segmented Corpus
Document 1
Citation recommendation is an interesting but challenging research problem in data mining area. Document 2
In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique.
Document 3
Principal Component Analysis is a linear dimensionality reduction technique commonly used
in machine learning applications.
Phrase Mining
Input Raw Corpus

E.g., freq(vector machine) ≥ freq(support vector machine) 
Need to rectify the frequency based on segmentation results


Segmented Corpus
Raw frequency could NOT reflect the quality of phrases


28
Phrasal Segmentation
Quality Phrases
Build a candidate phrase set by frequent pattern mining
Mining frequent k‐grams; k is typically small, e.g. 6 in our experiments
Popularity measured by raw frequent words and phrases mined from the corpus
14
7/26/2015
SegPhrase: The Overall Framework



ClassPhrase: Frequent pattern mining, feature extraction, classification
SegPhrase: Phrasal segmentation and phrase quality estimation
SegPhrase+: One more round to enhance mined phrase quality ClassPhrase
SegPhrase(+)
29
What Kind of Phrases Are of “High Quality”?
Judging the quality of phrases
 Popularity
 “information retrieval” vs. “cross‐language information retrieval”
 Concordance
 “powerful tea” vs. “strong tea”
 “active learning” vs. “learning classification”
 Informativeness
 “this paper” (frequent but not discriminative, not informative)
 Completeness
 “vector machine” vs. “support vector machine” 
30
15
7/26/2015
Feature Extraction: Concordance
Partition a phrase into two parts to check whether the co‐occurrence is significantly higher than pure random


support vector machine this paper demonstrates

Pointwise mutual information:

Pointwise KL divergence: The additional p(v) is multiplied with pointwise mutual information, leading to less bias towards rare‐occurred phrases

31
Feature Extraction: Informativeness
Deriving Informativeness

Quality phrases typically start and end with a non‐stopword


“machine learning is” vs. “machine learning”

Use average IDF over words in the phrase to measure the semantics

Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuation information)


“state‐of‐the‐art”
We can also incorporate features using some NLP techniques, such as POS tagging, chunking, and semantic parsing
32
16
7/26/2015
Classifier
Limited Training

Labels: Whether a phrase is a quality one or not


“support vector machine”: 1

“the experiment shows”: 0


For ~1GB corpus, only 300 labels
Random Forest as our classifier

Predicted phrase quality scores lie in [0, 1]

Bootstrap many different datasets from limited labels
33
SegPhrase: Why Do We Need Phrasal
Segmentation in Corpus?
Phrasal segmentation can tell which phrase is more appropriate
 Ex: A standard [feature vector] [machine learning] setup is used to describe...

Not counted towards the rectified frequency
Rectified phrase frequency (expected influence)
 Example:

34
17
7/26/2015
SegPhrase: Segmentation of Phrases
Partition a sequence of words by maximizing the likelihood
 Considering

Phrase quality score


ClassPhrase assigns a quality score for each phrase

Probability in corpus

Length penalty



length penalty : when 1, it favors shorter phrases
Filter out phrases with low rectified frequency
Bad phrases are expected to rarely occur in the segmentation results
35
SegPhrase+: Enhancing Phrasal Segmentation
SegPhrase+: One more round for enhanced phrasal segmentation
 Feedback
 Using rectified frequency, re‐compute those features previously computed based on raw frequency
 Process
 Classification  Phrasal segmentation // SegPhrase
 Classification  Phrasal segmentation // SegPhrase+
 Effects on computing quality scores
 np hard in the strong sense
 np hard in the strong
 data base management system

36
18
7/26/2015
Performance Study: Methods to Be Compared

Other phase mining methods: Methods to be compared
NLP chunking based methods


Chunks as candidates

Sorted by TF‐IDF and C‐value (K. Frantzi et al., 2000)
Unsupervised raw frequency based methods


ConExtr (A. Parameswaran et al., VLDB 2010)

ToPMine (A. El‐Kishky et al., VLDB 2015)
Supervised method


KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten, 2006)
37
Performance Study: Experimental Setting



Datasets
Dataset
#docs
#words
#labels
DBLP
2.77M
91.6M
300
Yelp
4.75M
145.1M
300
Popular Wiki Phrases
 Based on internal links
 ~7K high quality phrases
Pooling
 Sampled 500 * 7 Wiki‐uncovered phrases
 Evaluated by 3 reviewers independently
38
19
7/26/2015
Performance: Precision Recall Curves on DBLP
Compare with other baselines
TF‐IDF
C‐Value
ConExtr
KEA
ToPMine
SegPhrase+
Compare with our 3 variations
TF‐IDF
ClassPhrase
SegPhrase
SegPhrase+
39
39
Performance Study: Processing Efficiency

SegPhrase+ is linear to the size of corpus!
40
20
7/26/2015
Experimental Results: Interesting Phrases
Generated (From the Titles and Abstracts of SIGMOD)
Query
SIGMOD
Method
SegPhrase+
Chunking (TF‐IDF & C‐Value)
1
data base
data base
2
database system
database system
3
relational database
query processing
4
query optimization
query optimization
5
query processing
relational database
…
…
…
51
sql server
database technology
52
relational data
database server
53
data structure
large volume
54
join query
performance study
55
web service
…
…
Only in SegPhrase+ web service
Only in Chunking
…
201
high dimensional data
202
location based service
efficient implementation
sensor network
203
xml schema
large collection
204
two phase locking
important issue
205
deep web
frequent itemset
…
…
…
41
Experimental Results: Interesting Phrases
Generated (From the Titles and Abstracts of SIGKDD)
Query
SIGKDD
Method
SegPhrase+
1
data mining
Chunking (TF‐IDF & C‐Value)
data mining
2
data set
association rule
3
association rule
knowledge discovery
4
knowledge discovery
frequent itemset
5
time series
decision tree
…
…
…
51
association rule mining
search space
52
rule set
domain knowledge
53
concept drift
importnant problem
concurrency control
54
knowledge acquisition
55
gene expression data
conceptual graph
…
…
…
201
web content
Only in SegPhrase+ optimal solution
202
frequent subgraph
semantic relationship
203
intrusion detection
effective way
204
categorical attribute
space complexity
205
user preference
small set
…
…
…
Only in Chunking
42
21
7/26/2015
Experimental Results: Similarity Search
Find high‐quality similar phrases based on user’s phrase query  In response to a user’s phrase query, SegPhrase+ generates high quality, semantically similar phrases
 In DBLP, query on “data mining” and “OLAP”
 In Yelp, query on “blu‐ray”, “noodle”, and “valet parking”

43
Recent Progress
Distant Training: No need of human labeling

Training using general knowledge bases




E.g., Freebase, Wikipedia
Quality Estimation for Unigrams
Integration of phrases and unigrams in one uniform framework

Demo based on DBLP abstract

Multi‐languages: Beyond English corpus

Extensible to mining quality phrases in multiple languages

Recent progress: SegPhrase+ works on Chinese, Arabic and Spanish
44
22
7/26/2015
High Quality Phrases Generated
from Chinese Wikipedia
Rank
Phrase
In English
…
…
…
62
首席_执行官
CEO
63
中间_偏右
Middle‐right
…
…
…
84
百度_百科
Baidu Pedia
85
热带_气旋
Tropical cyclone
86
中国科学院_院士
Fellow of Chinese Academy of Sciences
…
…
…
1001
十大_中文_金曲
Top‐10 Chinese Songs
1002
全球_资讯网
Global News Website
1003
天一阁_藏_明代_科举_录_选刊
A Chinese book name
…
…
…
9934
国家_戏剧_院
National Theater
9935
谢谢_你
Thank you
…
…
…
45
Top-Ranked Phrases Generated
from English Gigaword

Northrop Grumman, Ashfaq Kayani, Sania Mirza, Pius Xii, Shakhtar Donetsk, Kyaw Zaw Lwin

Ratko Mladic, Abdolmalek Rigi, Rubin Kazan, Rajon Rondo, Rubel Hossain, bluefin tuna

Psv Eindhoven, Nicklas Bendtner, Ryo Ishikawa, Desmond Tutu, Landon Donovan, Jannie du Plessis

Zinedine Zidane, Uttar Pradesh, Thor Hushovd, Andhra Pradesh, Jafar_Panahi, Marouane Chamakh

Rahm Emanuel, Yakubu Aiyegbeni, Salva Kiir, Abdelhamid Abou Zeid, Blaise Compaore, Rickie Fowler

Andry Rajoelina, Merck Kgaa, Js Kabylie, Arjun Atwal, Andal Ampatuan Jnr, Reggio Calabria, Ghani Baradar

Mahela Jayawardene, Jemaah Islamiyah, quantitative easing, Nodar Kumaritashvili, Alviro Petersen

Rumiana Jeleva, Helio Castroneves, Koumei Oda, Porfirio Lobo, Anastasia Pavlyuchenkova

Thaksin Shinawatra, Evgeni_Malkin, Salvatore Sirigu, Edoardo Molinari, Yoshito Sengoku

Otago Highlanders, Umar Akmal, Shuaibu Amodu, Nadia Petrova, Jerzy Buzek, Leonid Kuchma, 
Alona Bondarenko, Chosun Ilbo, Kei Nishikori, Nobunari Oda, Kumbh Mela, Santo_Domingo

Nicolae Ceausescu, Yoann Gourcuff, Petr Cech, Mirlande Manigat, Sulieman Benn, Sekouba Konate
46
23
7/26/2015
Summary and Future Work
Strategy 1: Generate bag‐of‐words → generate sequence of tokens
 Integrated complex model; phrase quality and topic inference rely on each other
 Slow and overfitting
 Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams
 Phrase quality relies on topic labels for unigrams
 Can be fast; generally high‐quality topics and phrases
 Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model  Topic inference relies on correct segmentation of documents, but not sensitive
 Can be fast; generally high‐quality topics and phrases

47
Summary and Future Work (Cont’d)

SegPhrase+: A new phrase mining framework

Integrating phrase mining with phrasal segmentation

Requires only limited training or distant training

Generates high‐quality phrases, close to human judgement

Linearly scalable on time and space

Looking forward: High‐quality, scalable phrase mining 
Facilitate entity recognition and typing in large corpora (See the next part of this tutorial)

Combine with linguistic‐rich patterns

Transform massive unstructured data into semi‐structured knowledge networks
48
24
7/26/2015
References for Theme 1: Phrase Mining for Concept Extraction

D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi‐Word Expressions, arXiv:0907.1013, 2009

K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On‐Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991 
M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents“, SDM’14

A. El‐Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15

K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi‐Word Terms: the c‐value/nc‐value Method. Int. Journal on Digital Libraries, 3(2), 2000 
R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase‐discovering topic model using hierarchical pitman‐yor
processes, EMNLP‐CoNLL’12.

J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15

O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06 
Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07

A. Parameswaran, H. Garcia‐Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10

X. Wang, A. McCallum, X. Wei. Topical n‐grams: Phrase and topic discovery, with an application to information retrieval, ICDM’07
49
Theme II: Entity Extraction and
Linking by Relational Graph
Construction and Propagation
50
25
7/26/2015
Session 1. Task and
Traditional NLP Approach
Entity Extraction and Linking
Identifying token span as entity mentions in documents and labeling their types
Target Types
FOOD
LOCATION
JOB_TITLE
EVENT
ORGANIZATION
…
The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. ... The
owner is very nice. …
The best BBQ:Food I’ve tasted in Phoenix:LOC ! I had the [pulled pork sandwich]:Food with coleslaw:Food
and [baked beans]:Food for lunch. …
The owner:JOB_TITLE is very nice. …
Text with typed entities
Plain text
FOOD
LOCATION
EVENT
Enabling structured analysis of
unstructured text corpus
52
26
7/26/2015
Traditional NLP Approach: Data Annotation
Extracting and linking entities can be used in a variety of ways:  serve as primitives for information extraction and knowledge base population
 assist question answering,…
 Traditional named entity recognition systems are designed for major types (e.g., PER, LOC, ORG) and general domains (e.g., news)
 Require additional steps to adapt to new domains/types
 Expensive human labor on annotation
 500 documents for entity extraction; 20,000 queries for entity linking
 Unsatisfying agreement due to various granularity levels and scopes of types
 Entities obtained by entity linking techniques have limited coverage and freshness
 >50% unlinkable entity mentions in Web corpus [Lin et al., EMNLP’12] 
>90% in our experiment corpora: tweets, Yelp reviews, …

53
Traditional NLP Approach: Feature Engineering
Typical Entity Extraction Features (Li et al., 2012)
 N‐gram: Unigram, bigram and trigram token sequences in the context window  Part‐of‐Speech: POS tags of the context words
 Gazetteers: person names, organizations, countries and cities, titles, idioms, etc.
 Word clusters: word clusters / embeddings
 Case and Shape: Capitalization and morphology analysis based features
 Chunking: NP and VP Chunking tags
 Global feature: Sentence level and document level structure/position features

54
27
7/26/2015
Traditional NLP Approach: Feature Engineering

Typical Entity Linking Features (Ji et al., 2011)
Mention/Concept Attribute
Name
Document surface
Entity
Context
Profiling
Concept
Topic
KB Link Mining
Popularity
55
Description
Spelling match
Exact string match, acronym match, alias match, string matching…
KB link mining
Name Gazetteer
Lexical
Name pairs mined from KB text redirect and disambiguation pages
Organization and geo‐political entity abbreviation gazetteers
Position
Genre
Local Context
Type
Relation/Event
Coreference
Web
Frequency
Words in KB facts, KB text, mention name, mention text.
Tf.idf of words and ngrams
Mention name appears early in KB text
Genre of the mention text (newswire, blog, …)
Lexical and part‐of‐speech tags of context words
Mention concept type, subtype
Concepts co‐occurred, attributes/relations/events with mention
Co‐reference links between the source document and the KB text
Slot fills of the mention, concept attributes stored in KB infobox
Ontology extracted from KB text
Topics (identity and lexical similarity) for the mention text and KB text
Attributes extracted from hyperlink graphs of the KB text
Top KB text ranked by search engine and its length
Frequency in KB texts
Session 2. ClusType: Entity
Recognition and Typing by Relation
Phrase-Based Clustering
28
7/26/2015
Entity Recognition and Typing: A Data Mining Solution

Acquire labels for a small amount of instances

Construct a relational graph to connect labeled instances and unlabeled instances

Construct edges based on coarse‐grained data‐driven statistics instead of fine‐
grained linguistic similarity

Mention correlation

Text co‐occurrence

Semantic relatedness based on knowledge graph embeddings

Social networks
Label propagation across the graph

57
Case Study 1: Entity Extraction

Goal: recognizing entity mentions of target types with minimal/no human supervision and with no requirement that entities can be found in a KB.

Two kinds of efforts towards this goal:



Weak supervision: relies on manually selected seed entities in applying pattern‐based bootstrapping methods or label propagation methods to identify more entities
Both assume seeds are unambiguous and sufficiently frequent  requires careful seed selection by human
Distant supervision: leverages entity information in KBs to reduce human supervision (cont.)
58
29
7/26/2015
Typical Workflow of Distant Supervision



Detect entity mentions from text
Map candidate mentions to KB entities of target types
Use confidently mapped {mention, type} to infer types of remaining candidate mentions
59
Problem Definition
Distantly‐supervised entity recognition in a domain‐specific corpus
 Given:  a corpus D
 a knowledge base (e.g., Freebase)
 a set of target types (T) from a KB
 Detect candidate entity mentions from corpus D
 Categorize each candidate mention by target types or Not‐Of‐Interest (NOI), with distant supervision

60
30
7/26/2015
Challenge I: Domain Restriction
Most existing work assume entity mentions are already extracted by existing entity detection tools, e.g., noun phrase chunkers
 Usually trained on general‐domain corpora like news articles (clean, grammatical)  Make use of various linguistics features (e.g., semantic parsing structures)
 Do not work well on specific, dynamic or emerging domains (e.g., tweets, Yelp reviews)
 E.g., “in‐and‐out” from Yelp review may not be properly detected

61
Challenge II: Name Ambiguity

Multiple entities may share the same surface name
While Griffin is not the part of Washington’s plan on Sunday’s game, …
Sport team
…has concern that Kabul is an ally of Washington.
U.S. government
He has office in Washington, Boston and San Francisco U.S. capital city

Previous methods simply output a single type/type distribution for each surface name, instead of an exact type for each entity mention
While Griffin is not the part of Washington’s plan on Sunday’s game, …
… news from Washington indicates that the congress is going to…
It is one of the best state parks in Washington.
Washington
State
or
Washington
Sport Govern State
team ‐ment
…
62
31
7/26/2015
Challenge III: Context Sparsity

A variety of contextual clues are leveraged to find sources of shared semantics across different entities
Keywords, Wiki concepts, linguistic patterns, textual relations, …



There are often many ways to describe even the same relation between two entities
ID
Sentence
Freq
1
The magnitude 9.0 quake caused widespread devastation in [Kesennuma city]
12
2
… tsunami that ravaged [northeastern Japan] last Friday
31
3
The resulting tsunami devastate [Japan]’s northeast
244
Previous methods have difficulties in handling entity mention with sparse (infrequent) context
63
Our Solution
Domain‐agnostic phrase mining algorithm: Extracts candidate entity mentions with
minimal linguistic assumption  address domain restriction
 E.g., part‐of‐speech (POS) tagging << semantic parsing
Do not simply merge entity mentions with identical surface names
 Model each mention based on its surface name and context, in a scalable way
 address name ambiguity
Mine relation phrase co‐occurring with entity mentions; infer synonymous relation phrases
 Helps form connecting bridges among entities that do not share identical context, but share synonymous relation phrases  address context sparsity
64
32
7/26/2015
A Relation Phrase-Based Entity Recognition Framework


POS‐constrained phrase segmentation for mining candidate entity mentions and
relation phrases, simultaneously
Construct a heterogeneous graph to represent available information in a unified form
Entity mentions are kept as
individual objects to be
disambiguated
Linked to entity surface names
& relation phrases
65
A Relation Phrase-Based Entity Recognition Framework

With the constructed graph, formulate a graph‐based semi‐supervised
learning of two tasks jointly:
Type propagation on heterogeneous graph
Multi‐view relation phrase clustering
Derived entity argument types serve
as good feature for clustering
relation phrases
Propagate type information
among entities bridges via
synonymous relation phrases
Mutually enhancing each other; leads to quality
recognition of unlinkable entity mentions
66
33
7/26/2015
Framework Overview
1.
2.
3.
4.
Perform phrase mining on a POS‐tagged corpus to extract candidate entity
mentions and relation phrases
Construct a heterogeneous graph to encode our insights on modeling the type
for each entity mention
Collect seed entity mentions as labels by linking extracted mentions to the KB
Estimate type indicator for unlinkable candidate mentions with the proposed
type propagation integrated with relation phrase clustering on the constructed
graph
67
Candidate Generation
An efficient phrase mining algorithm incorporating both corpus‐level statistics and
syntactic constraints
 Global significance score: Filter low‐quality candidates; generic POS tag
patterns: remove phrases with improper syntactic structure
 By extending TopMine, the algorithm partitions corpus into segments which
meet both significance threshold and POS patterns  candidate entity mentions
& relation phrases
Relation phrase: phrase that denotes a unary
Algorithm workflow:

1. Mine frequent contiguous patterns
2. Performs greedy‐agglomerative merging while
enforcing our syntactic constraints
• Entity mention: consecutive nouns
• Relation phrases: shown in the table
3. Terminates when the next highest‐score merging does not meet a pre‐defined significance threshold
or binary relation in a sentence
68
34
7/26/2015
Candidate Generation

Example output of candidate generation on NYT news articles

Entity detection performance comparison with an NP chunker
Recall is most critical for this step, since later we cannot detect the misses (i.e., false negatives)
69
Construction of Heterogeneous Graphs
With three types of objects extracted from corpus: candidate entity mentions, entity surface names, and relation phrases
 We can construct a heterogeneous graph to enforce several hypotheses for
modeling type of each entity mention (introduced in the following slides)

Basic idea for constructing the graph:
the more two objects are likely to share the same label, the larger the weight will be associated with their connecting edge
Three types of links:
1. Mention‐name link: (many‐to‐one) mappings
between entity mention and surface names
2. Name‐relation phrase links: corpus‐level co‐
occurrence between surface names and relation
phrases
3. Mention correlation links: distributional
similarity between entity mentions
70
35
7/26/2015
Entity Mention-Surface Name Subgraph
Directly modeling type indicator of each entity mention in label propagation
 Intractable size of parameter space
Both the entity name and the surrounding relation phrases provide strong cues on
the types of a candidate entity mention
 Model the type of each entity mention by (1) type indicator of its surface name; (2) the


type signatures of its surrounding relation phrases (more details in the following slides)
…has concerns whether Kabul is an ally of Washington
Washington
Gover‐
nment
M candidate mentions;
n surface names
Use a bi‐adjacency matrix to
represent the mapping
is an ally of
State
…has concerns whether Kabul is an ally of Washington: GOVERNMENT
71
Entity Name-Relation Phrase Subgraph

Aggregated co‐occurrences between entity surface names and relation phrases across
corpus  weight importance of different relation phrases for surface names
 use connected edges as bridges to propagate type information

Left/right entity argument of relation phrase: for each mention, assign it as the left
(right, resp.) argument to the closest relation phrase on its right (left, resp.) in a
sentence

Type signature of relation phrase: Two type indicators for its left and right arguments
l different relation phrases, mapping between
mentions and relation phrases:
Two bi‐adjacency matrices for the subgraph
72
36
7/26/2015
Mention Correlation Subgraph
An entity mention may have ambiguous name and ambiguous relation phrases

E.g., “White house” and “felt” in the first sentence of Figure
 Other co‐occurring mentions may provide good hints to the type of an entity mention

E.g., “birth certificate” and “rose garden” in the Figure

 Propagate type
information between
candidate mentions of
each surface name, based
on following hypothesis:
Construct KNN graph based on the feature vector f‐surface
names of co‐occurring entity mentions
73
Relation Phrase Clustering


Observation: many relation phrases have very few occurrences in the corpus
~37% relation phrases have <3 unique entity surface names (in right or left arguments)
 Hard to model their type signature based on aggregated co‐occurrences with
entity surface names (i.e., Hypothesis 1)

Softly clustering synonymous relation phrases:
 the type signatures of frequent relation phrases can help infer the type signatures of infrequent (sparse) ones that have similar cluster memberships
74
37
7/26/2015
Relation Phrase Clustering
Existing work on relation phrase clustering utilizes strings; context words; entity argument
to cluster synonymous relation phrases
 String similarity and distribution similarity may be insufficient to resolve two relation phrases; type information is particularly helpful in such case
 We propose to leverage type signature of relation phrase, and proposed a general relation
phrase clustering method to incorporate different features
 further integrated with the graph‐based type propagation in a mutually enhancing
framework, based on the following hypothesis

Type signatures
String features
Context features
75
Type Inference: A Joint Optimization Problem
Mention modeling &
mention correlation (Hypo 2)
Type propagation between
entity surface names
and relation phrases
(Hypo 1)
Multi‐view relation phrases clustering (Hypo 3 & 4)
76
38
7/26/2015
The ClusType Algorithm
The ClusType algorithm:
Update type indicators and type signatures
For each view, performs single‐view NMF until converges


Can be efficiently solved by alternative
minimization based on block coordinate
descent algorithm
Algorithm complexity is linear to #entity
mentions, #relation phrases, #cluster,
#clustering features and #target types
Update consensus matrix and relative weights of different views
Until the objective converges
77
Experiment Setting
Datasets: 2013 New York Times news (~110k docs) [event, PER, LOC, ORG]; Yelp
Reviews (~230k) [Food, Job, …]; 2011 Tweets (~300k) [event, product, PER, LOC, …]
 Seed mention sets: < 7% extracted mentions are mapped to Freebase entities
 Evaluation sets: manually annotate mentions of target types for subsets of the corpora
 Evaluation metrics: Follows named entity recognition evaluation (Precision, Recall, F1)
 Compared methods
 Pattern: Stanford pattern‐based learning; SemTagger:bootstrapping method which
trains contextual classifier based on seed mentions; FIGER: distantly‐supervised
sequence labeling method trained on Wiki corpus; NNPLB: label propagation using
ReVerb assertion and seed mention; APOLLO: mention‐level label propagation using
Wiki concepts and KB entities;
 ClusType‐NoWm: ignore mention correlation; ClusType‐NoClus: conducts only type
propagation; ClusType‐TwpStep: first performs hard clustering then type
propagation

78
39
7/26/2015
Comparing ClusType with Other Methods and Its Variants
vs. FIGER: effectiveness of our candidate generation and proposed
hypotheses on type propagation
46.08% and 48.94%
improvement in F1 score  vs. NNPLB and APOLLO: ClusType not only utilizes semantic‐rich
compared to the best
relation phrase as type cues, but only cluster synonymous relation
baseline on the Tweet
phrases to tackle context sparsity
and the Yelp datasets
 vs. variants: (i) models mention correlation for name disambiguation;
(ii) integrates clustering in a mutually enhancing way

79
Comparing on Different Entity Types
Obtains larger gain on organization and person (more entities with ambiguous surface names)
Modeling types on entity mention level is critical for name disambiguation
Superior performance on product and food mainly comes from the domain independence of our method Both NNPLB and SemTagger require sophisticated linguistic feature generation which is hard to adapt to new types
80
40
7/26/2015
Comparing on Trained NER System
Compare with Stanford NER, which is trained on general‐domain corpora including
ACE corpus and MUC corpus, on three types: PER, LOC, ORG


ClusType and its variants outperform Stanford NER on both dynamic corpus (NYT)
and domain‐specific corpus (Yelp)

ClusType has lower precision but higher Recall and F1 score on Tweet  Superior
recall of ClusType mainly come from domain‐independent candidate generation
81
Example Output and Relation Phrase Clusters

Extracts more mentions and predicts types with
higher accuracy

Not only synonymous relation
phrases, but also both sparse
and frequent relation phrase
can be clustered together

 boosts sparse relation
phrases with type information
of frequent relation phrases
82
41
7/26/2015
Testing on Context Sparsity and Surface Name Popularity
Surface name popularity:
Group A: high frequency
surface name
 Group B: infrequent surface
name
 ClusType outperforms its
variants on Group B
  Handles well mentions
with insufficient corpus
statistics




Context sparsity:
 Group A: frequent relation phrases
 Group B: sparse relation phrases
ClusType obtains superior performance over its variants on Group B
  clustering relation phrase is critical for sparse relation phrases
83
Conclusions and Future Work
Study distantly‐supervised entity recognition for domain‐specific corpora and
propose a novel relation phrase‐based framework
 A data‐driven, domain‐agnostic phrase mining algorithm for candidate entity
mentions and relation phrase generation
 Integrate relation phrase clustering with type propagation on heterogeneous
graphs, and solve it by a joint optimization problem.
Ongoing:
 Extend to role discovery for scientific concepts  paper profiling (research/demo)
 Study of relation phrase clustering, such as
 joint entity/relation clustering
 synonymous relation phrase canonicalization
 Study of joint entity and relation phrase extraction with phrase mining

84
42
7/26/2015
Session 3. Entity Linking
Based on Semi-Supervised
Graph Regularization
Entity Linking: A Relational Graph Approach



Stay up Hawk Fans.
We are going through a slump,
but we have to stay positive. Go Hawks!
86
43
7/26/2015
Relevant Mention Detection: Meta Path
A meta‐path is a path defined over a network and composed of a sequence of relations between different object types (Sun et al., 2011)
 Each meta path represent a semantic relation (more in next theme)
 Meta paths between two mentions
 M‐T‐M
 M‐T‐U‐T‐M‐M
 M‐T‐H‐T‐M
 M‐T‐U‐T‐M‐T‐H‐T‐M
 M‐T‐H‐T‐M‐T‐U‐T‐M

Schema of a Heterogeneous Information Network in Twitter
M: mention, T: tweet, U: user, H: hashtag
87
Relational Graph Construction


0.44
0.52

0.68
1.0
0.91
0.43 0.32

0.68



Local Compatibility
Mention Features (e.g., idf, keyphraseness)
Concept Features (e.g., # of incoming/outgoing links)
Mention + Concept Features (e.g., prior popularity, tf)
Context Features (e.g., capitalization, tf‐idf)
 Coreference
Semantic Relatedness (SR)
 At least one meta path exists between
two similar mentions
 SR between two mentions: meta path
 SR between two concepts: link structure in Wikipedia (Milne and Witten, 2008)
Linear Combination of these three graphs
88
44
7/26/2015
A Deep Semantic Relatedness Model (DSRM)
Semantic Knowledge Graphs
Erik Spoelstra
Description
Coach
Miami
Location
Type
Member
Founded
1988
Roster
Miami Titanic
Heat
National Basketball Association
89
Dwyane Wade
Professional Sports Team
The DSRM Architecture
Semantic relatedness
(cosine similarity)
SR(ci , cj)
y
Semantic Layer
Multi-layer nonlinear projections
300
300
300
300
300
300
105k (50k + 50k + 3.2k + 1.6k)
Word Hashing Layer
Feature Vector
x
1m
Di
4m
Ci
3.2k
Ri
Miami
1.6k
CTi
Location
Roster
Dwyane Wade
90
105k (50k + 50k + 3.2k + 1.6k)
1m
Dj
Miami Heat
Titanic
4m
Cj
Type
3.2k
Rj
1.6k
CTj
Professional Sports Team
Member
National Basketball Association
90
45
7/26/2015
Data and Scoring Metric

Data
A Wikipedia dump on May 3, 2013
o A portion of Freebase limited to the Wikipedia concepts
o Wikification: a public data set includes 502 messages from 28 users (Meij et al., 2012) o Semantic relatedness: A benchmark testset includes 3,314 concepts as testing queries (Ceccarelli et al., 2013)
Scoring Metric
o Wikification
 Standard precision, recall and F1
o Semantic relatedness
 nDCG
o

91
Overall Performance



TagMe: unsupervised model
Meij: supervised model; use 100% labeled data
SSRegu: use 50% labeled data
7.5% absolute F1 gain over the
state-of-the-art supervised models
65.0%
59.0%
59.8%
52.5%
47.5%
55.0%
51.6%
44.1%
42.3%
37.0%
39.3%
32.9%
TagMe
Meij
SSRegu
+ M&W
SSRegu
+ DSRM
92
46
7/26/2015
Semantic Relatedness: Examples
Method
M&W
DSRM
New York City
0.92
0.22
New York Knicks
0.78
0.79
Washington, D.C.
0.80
0.30
Washington Wizards 0.60
0.85
Atlanta
0.71
0.39
Atlanta Hawks
0.53
0.83
Houston 0.55
0.37
Houston Rockets 0.49
0.80
Semantic relatedness scores between a sample of concepts and the concept ”National Basketball Association” in sports domain. 93
Summary on Relational Graph-Based Entity Linking











Compared to traditional label propagation‐based methods
Data driven methods to construct relational graphs
Integrate multi‐view clustering with graph‐based label propagation
Compared to traditional distantly‐supervised methods
Domain‐independent entity candidate generation: jointly extract entity mentions and their relation phrases in an unsupervised way
Resolve context (relation phrase) sparsity by integrating type propagation with relation clustering
Fully exploit entries and structures in knowledge bases
Potential Applications to other NLP Tasks
Where data annotation is costly
Where a relational graph among labeled seeds and unlabeled instances can be constructed based on data‐driven metrics Toward fully “Liberal” IE: Discover schema and extract/link entities simultaneously
94
47
7/26/2015
References for Theme 2: Entity Recognition and Typing











X. Ren, A. El‐Kishky, C. Wang, F. Tao, C. R. Voss, H. Ji and J. Han. ClusType: Effective Entity Recognition and Typing by Relation Phrase‐Based Clustering. KDD’15.
H. Huang, Y. Cao, X. Huang, H. Ji and C. Lin. Collective Tweet Wikification based on Semi‐supervised Graph Regularization. ACL’14.
T. Lin, O. Etzioni, et al. No noun phrase left behind: detecting and typing unlinkable entities. EMNLP’12.
N. Nakashole, T. Tylenda, and G. Weikum. Fine‐grained semantic typing of emerging entities. ACL’13.
R. Huang and E. Riloff. Inducing domain‐specific semantic class taggers from (almost) nothing. ACL’10
X. Ling and D. S. Weld. Fine‐grained entity recognition. AAAI’12.
W. Shen, J. Wang, P. Luo, and M. Wang. A graph‐based approach for ontology population with named entities. CIKM’12
S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. CONLL’14 P. P. Talukdar and F. Pereira. Experiments in graph‐based semi‐supervised learning methods for class‐
instance acquisition. ACL’10.
Z. Kozareva and E. Hovy. Not all seeds are equal: Measuring the quality of text mining seeds. NAACL’10
L. Gal´arraga, G. Heitz, K. Murphy, and F. M. Suchanek. Canonicalizing open knowledge bases. CIKM’14
95
Theme III. Search and Mining
Structured Graphs and
Heterogeneous Networks for NLP
96
48
7/26/2015
Outline

Directly converting a large amount of data into knowledge is too ambitious and unrealistic

Once we construct graphs or heterogeneous networks, we can perform many powerful search and mining tasks 
Many NLP problems are about graphs and networks – what’s the DM point of view?

Search: Graph index creation and approximate structural search

NLP Application: Schemaless queries
Classification: Graph pattern mining and pattern‐based classification


NLP Application: Distinguishing authors based on their writing styles
Meta‐path‐based similarity search in heterogeneous networks


NLP Application: Entity morph decoding
Meta‐path‐based mining in heterogeneous networks: Prediction and recommendation


NLP Application: Knowledge base completion; multi‐hop question answering, causal event prediction
97
Section 1: Search Graphs and
Networks
49
7/26/2015
Graph Indexing

Graph query: Find all the graphs in a graph DB containing a given query graph

Index should be a powerful tool

Path‐index may not work well

Solution: Index directly on substructures (i.e., graphs)
Query Q:
Graph (G)
Substructure
Graph DB:
(a)
Only graph (c) contains Q
Query graph (Q)
(c)
(b)
Path‐indices: C, C‐C, C‐C‐C, C‐C‐C‐C cannot prune (a) & (b)
99
Many NLP Problems can be Converted
into Graph Search

Entity Linking (Pan et al., 2015)
Query Graph Constructed from NL douments using Abstract Meaning Representation
Knowledge Graph
100
50
7/26/2015
Many NLP Problems can be Converted
into Graph Search

Bursty Information Networks Decipherment (Tao et al., 2015submission)
Query Graph Constructed from Chinese
English Knowledge Graph
101
gIndex: Indexing Frequent and Discriminative
Substructures
Why index frequent substructures?
 Too many substructures to index
discriminative(~103)
 Size‐increasing support threshold
 Large structures will likely be indexed well by their frequent (~105)
substructures
structure (>106)
 Why discriminative substructures?
 Reduce the index size by an order of magnitude
 Selection: Given a set of selected structures f1, f2, .. fn, and a new structure x, the extra indexing power is measured by

Pr x f 1 , f 2 ,  f n , f i  x
when Pr(x|f1, f2, …, fn) is small enough, x is a discriminative structure and should be included in the index
Experiments show gIndex is small, effective and stable
support

min‐support threshold
size
102
51
7/26/2015
Structure Similarity Search Assisted with Graph Indices

The necessity can be clearly shown in search for similar chemical compounds
(a) caffeine
(b) diurobromine
Data graphs
Query graph
(c) viagra
…
Feature set
How to conduct approximate search?
 Build indices covering all the similar subgraphs?
 No! Idea: (1) keep the index structure
(2) select features in the query space
 Query relaxation measure: Features are more important than strict substructure matching
 Only need to index a set of smaller features

103
Schemaless Queries over Knowledge Graphs



Knowledge Graphs are ubiquitous
Queries on KN are mostly schemaless
NL queries
Challenges: ambiguities on query resolution
Name: Bison
Class: Mammal
Yellowstone NP
Phylum: Chordate
Univeristy
Order: Even‐toed ungulate
follow
Business
join
Bison
Video like
music
Comment: Bison are large even‐toed ungulates within the subfamily ...
follow
Mammal
Football League
Photo
Country
Courtesy of Shengqi Yang, UCSB
104

Ex: When was UIUC founded? And how?
52
7/26/2015
Major Challenges in Schemaless Query Answering


Extracting entities and relationships from natural language queries
Mismatch between knowledge graph and queries: Figure out the best transformation
Knowledge Graph
Query
“the University of Illinois at Urbana-Champaign”
“UIUC”
“neoplasm”
“tumor”
“Doctor”
“Dr.”
“Barack Obama”
“Obama”
“Jeffrey Jacob Abrams”
“J. J. Abrams”
“teacher”
“educator”
“1980”
“~30”
“3 mi”
“4.8 km”
“Hinton” - “DNNresearch” - “Google”
“Hinton” - “Google”
…
…
105
Processing of Schemaless Queries by Pattern Matching
Users freely post queries in natural language, without knowledge on data graphs
 Schemaless query (SLQ) system finds results through a set of transformations
 Ex. 
Query
A Match
Prof., ~70 yrs
Geoffrey Hinton
(Professor, 1947)
UT
Google
DNNresearch
University
of Toronto
Google
106
Acronym transformation: ‘UT’  ‘University of Toronto’
Abbreviation transformation: ‘Prof.’ ‘Professor’
Numeric transformation: ‘~70’  ‘1947’
Structural transformation: an edge  a path
53
7/26/2015
Automatic Generation of Training Data
1.
2.
3.
4.
5.
4. Labeling
the results
Samuel Tom
Tom Cruise
…
Sampling: A set of subgraphs are randomly Knowledge graph
extracted from the data graph
Query generation: queries are generated by randomly adding transformations on the extracted subgraphs
1. Sampling
Searching: search generated queries on the data graph
2. Adding transformations
Labeling: results are labeled based on original subgraphs
Training: queries + labeled results, are used Tom
to estimate the weights of different transformations
training queries
results
5. Training the ranking model
P( (Q) | Q)
107
Online Query Processing Using Knowledge Graphs
Finding the results to a query  Configuration of latent random variables in CRFs
 Inference in Conditional Random Fields (CRFs)
 Top‐1 result: Computing the most likely assignment
 Approximate inference: Loopy Belief Propagation
 Top‐K result generation
 Finding the M Most Probable Configurations Using Loopy Belief Propagation [Yanover et al., NIPS’04]
 Experiments: Queries generated on YAGO2
 SLQ [Yang, et al, VLDB’14] shows high performance
108

54
7/26/2015
Section 2: Graph Pattern
Mining and Pattern-Based
Classification
Application to NLP: Authorship Classification


The writing style is one of the best indicators for original authorship
Substructures: k‐embedded‐edge subtree patterns holds more information than basic syntactic features: function words, POS(Part of Speech) tags, and rewrite rules Ex: A k‐embedded‐edge subtree t mined from NYT journalists Jack Healy and Eric Dash
 On average, 21.2% of Jack’s sentences contained t while only 7.2% of Eric’s contained t

110
55
7/26/2015
Structural Pattern Tells More on Authorship Classification
Binned information gain score distribution of various feature sets
 FW: function words  POS: POS tags
 BPOS: bigram POS tags
 RR: rewrite rules  k‐ee subtrees

# features for FW, POS, RR and k‐ee feature sets
Accuracy comparison on # authors and various datasets
111
Section 3: Similarity Computation
in Heterogeneous Networks
56
7/26/2015
Structures Facilitate Mining Heterogeneous Networks
Network construction: generates structured networks from unstructured text data
 Each node: an entity; each link: a relationship between entities
 Each node/link may have attributes, labels, and weights
 Heterogeneous, multi‐typed networks: e.g., Medical network: Patients, doctors, diseases, contacts, treatments

Movie
Studio
Actor
Venue Paper Author
DBLP Bibliographic Network
Movie
Director
The Facebook Network
The IMDB Movie Network
113
Why Mining Typed Heterogeneous Networks?
A homogeneous network can be derived from its “parent” heterogeneous network
 Ex. Coauthor networks from the original author‐paper‐conference networks
 Heterogeneous networks carry richer info. than the projected homogeneous networks
 Typed nodes & links imply more structures, leading to richer discovery
 Ex.: DBLP: A Computer Science bibliographic database (network)

Knowledge hidden in DBLP Network
Mining Functions
Who are the leading researchers on Web search?
Ranking
Who are the peer researchers of Jure Leskovec?
Similarity Search
Whom will Christos Faloutsos collaborate with?
Relationship Prediction
Which relationships are most influential for an author to decide her topics?
Relation Strength Learning
How was the field of Data Mining emerged or evolving?
Network Evolution
Which authors are rather different from his/her peers in IR?
Outlier/anomaly detection
114
57
7/26/2015
Meta-Path and Similarity Search in Heterogeneous
Networks
Similarity measure/search is the base for cluster analysis
 Who are the most similar to Christos Faloutsos based on the DBLP network?
 Meta‐Path: Meta‐level description of a path between two objects
 A path on network schema
 Denote an existing or concatenated relation between two object types
 Different meta‐paths tell different semantics

Meta‐Path: Author‐Paper‐Author
Meta‐Path: Author‐Paper‐Venue‐Paper‐Author
Co‐authorship Meta‐path: A‐P‐A
Christos’ students or close collaborators
Work in similar fields with similar reputation
115
Existing Popular Similarity Measures for Networks
Random walk (RW):  The probability of random walk starting at x and ending at y, with meta‐path P
s( x, y )   pP prob( p )

X
Used in Personalized PageRank (P‐Pagerank) (Jeh and Widom 2003)
 Favors highly visible objects (i.e., objects with large degrees)
 Pairwise random walk (PRW):  The probability of pairwise random walk starting at (x, y) and ending at a common object (say z), following a meta‐path (P1, P2)
X
s( x, y )   ( p , p )( P , P ) prob( p1 ) prob( p2 )

1
2
1
2
Used in SimRank (Jeh and Widom 2002)
 Favors pure objects (i.e., objects with highly skewed distribution in their in‐links or out‐links)

Y
P
Note: P-PageRank
and SimRank do
not distinguish
object type and
relationship type
P2‐1
P1
Y
Z
116
58
7/26/2015
Which Similarity Measure Is Better for Finding Peers?
PathSim: Favors peers
 Peers: Objects with strong connectivity and similar visibility with a given meta‐path

x
Who is more similar to Mike?
y
Meta‐path: APCPA
Mike publishes similar # of papers as Bob and Mary
Other measures find Mike is closer to Jim



Author\Conf.
SIGMOD
VLDB
ICDM
KDD
Mike
2
1
0
0
Jim
50
20
0
0
Mary
2
0
1
0
Bob
2
1
0
0
Ann
0
0
1
1
Measure\Author
Jim
Mary
Bob
Ann
P‐PageRank
0.376
0.013
0.016
0.005
SimRank
0.716
0.572
0.713
0.184
Random Walk
0.8983
0.0238
0.0390
0
Pairwise R.W.
0.5714
0.4440
0.5556
0
PathSim (APCPA)
0.083
0.8
1
0
Comparison of Multiple Measures: A Toy Example
117
Example with DBLP: Find Academic Peers by PathSim
Anhai Doan
 CS, Wisconsin
 Database area
 PhD: 2002

Amol Deshpande
 CS, Maryland
 Database area
 PhD: 2004

Jignesh Patel
 CS, Wisconsin
 Database area
 PhD: 1998

Meta-Path: Author-Paper-Venue-Paper-Author
Jun Yang
 CS, Duke
 Database area
 PhD: 2001

118
59
7/26/2015
Application to NLP: Entity Morph Resolution
=
“Tender Beef Pentagon”
(嫩牛五方) =
“Yang Mi”
(杨幂) The Hutt
=
“Conquer West King”
(平西王) 119
Chris Christie
=
“Bo Xilai” (薄熙来)
“instant noodles”
(方便面) “Zhou Yongkang”
(周永康) 119
Target Candidate Identification
• Considering all entities will be too overwhelming
120
120
60
7/26/2015
Further Narrow down by Social Correlation
121
121
Further Narrow down by Social Correlation




122
A morph and its target are likely to be posted by two users with strong social correlation at Weibo and Twitter respectively (test data: average social correlation = 0.923)
Explicit Social Correlation between users from re‐tweet, mentioning, reply and follower networks (Wen and Lin, 2010)
 Compute the degree of separation in user interactions and the amount of interactions
Infer Implicit Social Correlation between users by topic modeling and opinion mining
 Users who share similar interests are likely to post similar information and opinions
 Measure content similarity of the messages posted by two users
Temporal + Social Correlation: Narrow down the number of target candidates to 1% with 100% accuracy
61
7/26/2015
Cross-genre Comparable Corpora

Conquer West King from Chongqing fell from power, still need to sing red songs?

There is no difference between that guy’s plagiarism and Buhou’s gang crackdown.
Remember that Buhou said that his family was not rich at the press conference a few days before he fell from power. His son Bo Guagua is supported 
by his scholarship.




Bo Xilai: ten thousand letters of accusation have been received during Chongqing gang crackdown.
The webpage of “Tianze Economic Study Institute” owned by the liberal party has been closed. This is the first affected website of the liberal party after Bo Xilai
fell from power.
Bo Xilai gave an explanation about the source of his
son, Bo Guagua’s tuition.
Bo Xilai led Chongqing city leaders and 40 district and county party and government leaders to sing red songs.
Twitter and Chinese News (uncensored)
Weibo (censored)
123
Constructing Heterogeneous Information Networks
Affiliation
Justice
Children
PER
Top_employee
Bo Xilai
PER
Top‐employee
Best Actor
China
Located
Chongqing
CCP Justice
Member
GPE
Top_employee
Bo Guagua
Children
Earth Common
Conquer
West King
Member
Wen Jiabao
PER
GPE
Top‐employee
ORG
Affiliation
 Each node is an entity
 An edge: a semantic relation, event, sentiment, semantic role, dependency relation or co‐occurrence, associated with confidence values
124
124
62
7/26/2015
Meta-paths
Network Schema
M: Morphs
E: Entities
EV: Events
NP: Non-Entity Noun Phrases

Meta‐path Examples:
o M – E – E
o M – EV – E
o M – NP – E
125
125
Data and Scoring Metric

Data
o
o
o
o
o

Time frame: 05/01/2012‐06/30/2012
1555K Chinese messages from Weibo
66K formal web documents from embedded URL
25K Chinese messages from English Twitter for sensitive morphs
Test on 107 morph entities in Weibo, 23 of them are sensitive
Scoring Metric
A cc @ k  C k / T
o
o
126
Ck: the number of correctly resolved morphs at top position k
T: the total number of morphs in ground truth
126
63
7/26/2015
Overall Performance
65.9%
70.1%
59.4%
47.7%
51.9%
41.6%
37.9%
23.4%
1
5
10
Homogeneous Network
127
20
1
5
10
20
Hetereogeneous Network
127
Section 4: Mining Heterogeneous
Information Networks: Prediction
and Recommendation
64
7/26/2015
Relationship Prediction vs. Link Prediction
Link prediction in homogeneous networks [Liben‐Nowell and Kleinberg, 2003, Hasan et al., 2006]

E.g., friendship prediction

Relationship prediction in heterogeneous networks


Different types of relationships need different prediction models

vs.
Different connection paths need to be treated separately!

Meta‐path‐based approach to define topological features
vs.
129
PathPredict: Meta-Path Based Relationship Prediction
vs.
Meta path‐guided prediction of links and relationships  Philosophy: Meta path relationships among similar typed links share similar semantics and are comparable and inferable
 Co‐author prediction (A—P—A)
 Use topological features encoded by meta paths, e.g., citation relations between authors (A—P→P—A)

Meta‐Path
Semantic Meaning
Meta‐paths between authors under length 4
130
65
7/26/2015
The Power of PathPredict: Experiment on DBLP
Explain the prediction power of each meta‐path
 Wald Test for logistic regression

Higher prediction accuracy than using projected homogeneous network
 11% higher in prediction accuracy

Co‐author prediction for Jian Pei: Only 42 among 4809 candidates are true first‐time co‐authors!
(Feature collected in [1996, 2002]; Test period in [2003,2009])
131
Author Citation Time Prediction in DBLP

Top‐4 meta‐paths for author citation time prediction
Study the same topic
Co‐cited by the same paper
Follow co‐authors’ citation
Social relations are less important in author citation prediction than in co‐
author prediction.

132
Follow the citations of authors who study the same topic
Predict when Philip S. Yu will cite a new author
Under Weibull distribution assumption
66
7/26/2015
Enhancing the Power of Recommender Systems by Heterog.
Info. Network Analysis
Avatar
Aliens
Titanic
A small set of users & items have a large number of ratings
Revolutionary Road
# of ratings
Collaborative filtering methods suffer from the data sparsity issue
Most users & items have a
small number of ratings
# of users or items
Zoe Saldana
Adventure
James Cameron
Leonardo Dicaprio
Kate Winslet
Romance
Personalized recommendation with heterog. Networks [WSDM’14]
Heterogeneous relationships complement each other
 Users and items with limited feedback can be connected to the network by different types of paths
 Connect new users or items in the information network
 Different users may require different models: Relationship heterogeneity makes personalized recommendation models easier to define

133
Relationship Heterogeneity Based Personalized
Recommendation Models
Different users may have different behaviors or preferences
Two levels of personalization
James Cameron fan
Aliens
80s Sci‐fi fan
Sigourney Weaver fan
Different users may be
interested in the same
movie
for
different
reasons
Data level
• Most recommendation methods use
one model for all users and rely on
personal
feedback
to
achieve
personalization
Model level
• With different entity relationships, we
can learn personalized models for
different users to further distinguish
their differences
134
67
7/26/2015
Preference Propagation-Based Latent Features
genre: drama
Bob
Charlie
King Kong
Naomi Watts
tag: Oscar Nomination
Ralph Fiennes
Alice
Titanic
skyfall
Kate Winslet
Generate L different meta‐path (path
types) connecting users and items
revolutionary road
Sam Mendes
Calculate latent‐
features for users and items for each meta‐path with NMF
related method
Propagate user implicit feedback along each meta‐
path
135
Recommendation Models
Observation 1: Different meta‐paths may have different importance
Global Recommendation Model
ranking score
features for user i and item j
(1)
the q‐th meta‐path
Observation 2: Different users may require different models
Personalized Recommendation Model
user‐cluster similarity
L
(2)
136
c total soft user clusters
68
7/26/2015
Parameter Estimation
• Bayesian personalized ranking (Rendle UAI’09)
• Objective function
sigmoid function
min (3)
for each correctly ranked item pair
i.e., gave feedback to but not Soft cluster users with NMF + k‐means
For each user cluster, learn one model with Eq. (3)
Generate personalized model for each user on the fly with Eq. (2)
Learning Personalized Recommendation Model
137
Experiments: Heterogeneous Network Modeling
Enhances the Quality of Recommendation

Datasets
Comparison methods
 Popularity: recommend the most popular items to users
 Co‐click: conditional probabilities between items
 NMF: non‐negative matrix factorization on user feedback
 Hybrid‐SVM: use Rank‐SVM with plain features (utilize
both user feedback and information network)

HeteRec personalized recommendation (HeteRec‐p) leads to the best recommendation
138
69
7/26/2015
Summary of Theme III

Network representation provides more extensive analysis on text corpora

Relatively homogeneous, clean, common substructures appear frequently

Graph indexing + approximate structural search

Graph pattern mining

Heterogeneous, noisy, knowledge sparsity, but clear meta‐path schema

Meta‐path based similarity search

Meta‐path guided prediction, classification and recommendation in heterogeneous networks
139
References for Theme III: Graph Search and Mining

C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures of molecules”, ICDM'02
S. Kim, H. Kim, T. Weninger, J. Han, H. D. Kim, "Authorship Classification: A Discriminative Syntactic Tree Mining Approach", SIGIR'11 M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01
S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference. KDD'04
X. Yan and J. Han, “gSpan: Graph‐Based Substructure Pattern Mining”, ICDM'02
X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03
X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure‐based Approach”, SIGMOD'04
X. Yan, P. S. Yu, and J. Han, "Substructure Similarity Search in Graph Databases", SIGMOD'05

S. Yang, Y. Wu, H. Sun, X. Yan, Schemaless and Structureless Graph Querying, VLDB'14

F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han, and P. S. Yu, "Mining Top‐K Large Structural Patterns in a Massive Network", VLDB'11







140
70
7/26/2015
References for Theme III:
Heterogeneous Network Search & Mining








H. Huang, Z. Wen, D. Yu, H. Ji, Y. Sun, J. Han and H. Li. 2013. “Resolving Entity Morphs in Censored Data”, ACL’13
D. Yu, H. Huang, T. Cassidy, H. Ji, C. Wang, S. Zhi, J. Han, C. Voss and M. Magdon‐Ismail. “The Wisdom of Minority: Unsupervised Slot Filling Validation based on Multi‐dimensional Truth‐
Finding”, COLING’14
Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers, 2012
Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, T. Wu, "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT’09
Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path‐Based Top‐K Similarity Search in Heterogeneous Information Networks”, VLDB'11
Y. Sun, J. Han, C. C. Aggarwal, and N. Chawla, "When Will It Happen? Relationship Prediction in Heterogeneous Information Networks", WSDM'12 X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han, "Personalized Entity Recommendation: A Heterogeneous Information Network Approach", WSDM'14 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co‐Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11
141
Conclusions and Future Research
Directions
142
71
7/26/2015
Conclusions

Data mining and NLP have been working on a common goal: Turning data into knowledge (D2K), but with different methodologies

Data mining explore more on massive amount of data, but not in‐depth analysis on individual documents

Two fields will benefit each other by integrating their technologies/methodologies

We have been proposing a D2N2K (data to network to knowledge) methodology

Phrase mining in massive corpora 
Entity recognition and typing by correlation and cluster analysis

Construction of massive typed heterogeneous information networks

Mining actionable knowledge from such “semi‐structured” information networks
Lots to be done in the future!

143
Looking Forward
Lot of unanswered questions and research issues


What is the best framework for integration and joint inference? 
Is there an ideal common representation, or a layer in between? Even go beyond Heterogeneous Information Networks?

Apply D2N2K framework to other NLP applications

Network search based Collective Entity Linking

Cross‐lingual information network alignment

Semantic graph based Machine Translation

Expert finding, research recommendation, …

…
144
72
7/26/2015
Tools
Checking our research package dissemination portal
 IlliMine http://illimine.cs.uiuc.edu/
 Phrase Mining
 https://github.com/shangjingbo1226/SegPhrase
 Entity Typing
 http://shanzhenren.github.io/ClusType
 Entity Morph Decoding
 http://nlp.cs.rpi.edu/software/morphdecoding.tar.gz
 Graph Mining Tools
 Gspan http://www.cs.ucsb.edu/~xyan/software/gSpan.htm

145
Acknowledgement
NETWORK
SCIENCE
CTA
146
73
7/26/2015
BACKUP
Different Input Focus

NLP: Unstructured Text Data

Data Mining (DM): More on Structured and Semi‐structured Data
148
74
7/26/2015
NLP Challenge 1: Data and Knowledge Sparsity
Data Sparsity: Need to obtain high‐level statistics as global evidence
 Training a typical supervised model needs a large number of instances
 500 documents for phrase chunking, and entity, relation, event extraction
 20,000 queries for entity linking
 Knowledge Sparsity: Require global knowledge acquired from a wider context with low cost

Source
Translation out of hype‐speak: some kook made threatening noises at Brownback and go arrested
KB
Samuel Dale "Sam" Brownback (born September 12, 1956) is an American politician, the 46th and current Governor of Kansas.
Who is Brownback?  background data/knowledge aggregation
149
DM Solution: Leverage Information Redundancy

Acquire a large amount of related documents from multiple sources (genres, languages, data modalities)

Learn high‐level data‐driven statistics to extract, type and link information
Frequent pattern mining, ranking based on different criteria



Popularity: ‘information retrieval’ vs. ‘cross‐language information retrieval’

Concordance: ‘active learning’ vs.‘learning classification’

Completeness: ‘vector machine’ vs. ‘support vector machine’
Construct relational graphs for weakly‐supervised label propagation
150
75
7/26/2015
NLP Challenge 2: From Data to Knowledge (D2K)
Directly converting a large amount of unstructured data into knowledge is too ambitious and unrealistic
 Example 1
 Input: Millions of discussion forum posts under censorship
 Output: Resolve each implicit entity mention to its real target
 Example 2
 Input: 15 years of non‐parallel Chinese and English news articles
 Output: Entity translation pairs
 Example 3
 Input: Millions of multi‐source documents reporting conflicting information
 Output: Track each company’s top employees over time, complete knowledge bases

151
Session 1. Phrase Mining:
A Data Mining Approach
76
7/26/2015
TNG: Experiments on Research Papers
153
Phrase Mining and Topic Modeling: An Example


Motivation: Unigrams (single words) can be difficult to interpret
Ex.: The topic that represents the area of Machine Learning
learning
learning
support vector machines
reinforcement
reinforcement learning
support
feature selection
machine
vector
selection
feature
random
versus
conditional random fields
classification
decision trees
:
:
154
77
7/26/2015
KERT: Topical Keyphrase Extraction & Ranking
knowledge discovery using least squares support vector machine classifiers
learning
support vectors for reinforcement learning
support vector machines
a hybrid approach to feature selection
reinforcement learning
pseudo conditional random fields
feature selection
automatic web page classification in a dynamic and hierarchical way
conditional random fields
inverse time dependency in convex regularized learning
decision trees
postprocessing decision trees to extract actionable knowledge
:
classification
variance minimization least squares support vector machines
…
Topical keyphrase extraction & ranking
Unigram topic assignment: Topic 1 & Topic 2
155
Framework of KERT
1. Run bag‐of‐words model inference and assign topic label to each token
2. Extract candidate keyphrases within each topic
Frequent pattern mining
3. Rank the keyphrases in each topic

Popularity: ‘information retrieval’ vs. ‘cross‐language information retrieval’

Discriminativeness: only frequent in documents about topic t

Concordance: ‘active learning’ vs.‘learning classification’

Completeness: ‘vector machine’ vs. ‘support vector machine’
Comparability property: directly compare phrases of mixed lengths
156
78
7/26/2015
Summary: Strategies on Topical Phrase Mining
Strategy 1: Generate bag‐of‐words → generate sequence of tokens
 Integrated complex model; phrase quality and topic inference rely on each other
 Slow and overfitting
 Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams
 Phrase quality relies on topic labels for unigrams
 Can be fast; generally high‐quality topics and phrases
 Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model  Topic inference relies on correct segmentation of documents, but not sensitive
 Can be fast; generally high‐quality topics and phrases

157
Session 2. Simultaneously
Inferring Phrases and
Topics
79
7/26/2015
Session 3. Post Topic Modeling
Phrase Construction
Session 4. First Phrase
Mining then Topic Modeling
80
7/26/2015
Session 5. SegPhrase:
Integration of Phrase Mining
with Document Segmentation
Mining Phrases:
Why Not Use Raw Frequency Based Methods?
Traditional data‐driven approaches

Frequent pattern mining



If AB is frequent, likely AB could be a phrase
Raw frequency could NOT reflect the quality of phrases

E.g., freq(vector machine) ≥ freq(support vector machine) 
Need to rectify the frequency based on segmentation results


Phrasal segmentation will tell Some words should be treated as a whole phrase whereas others are still unigrams
162
81
7/26/2015
ClassPhrase I: Pattern Mining for Candidate Set
Build a candidate phrase set by frequent pattern mining
 Mining frequent k‐grams
 k is typically small, e.g. 6 in our experiments
 Popularity measured by raw frequent words and phrases mined from the corpus

163
Classifier
Limited Training

Labels: Whether a phrase is a quality one or not


“support vector machine”: 1

“the experiment shows”: 0


For ~1GB corpus, only 300 labels
Random Forest as our classifier

Predicted phrase quality scores lie in [0, 1]

Bootstrap many different datasets from limited labels
164
82
7/26/2015
SegPhrase: Why Do We Need Phrasal
Segmentation in Corpus?
Phrasal segmentation can tell which phrase is more appropriate
 Ex: A standard feature vector machine learning setup is used to describe...

Not counted towards the rectified frequency
Rectified phrase frequency (expected influence)
 Example:

165
SegPhrase: Segmentation of Phrases
Partition a sequence of words by maximizing the likelihood
 Considering

Phrase quality score


ClassPhrase assigns a quality score for each phrase

Probability in corpus

Length penalty



length penalty : when 1, it favors shorter phrases
Filter out phrases with low rectified frequency
Bad phrases are expected to rarely occur in the segmentation results
166
83
7/26/2015
SegPhrase+: Enhancing Phrasal Segmentation
SegPhrase+: One more round for enhanced phrasal segmentation
 Feedback
 Using rectified frequency, re‐compute those features previously computed based on raw frequency
 Process
 Classification  Phrasal segmentation // SegPhrase
 Classification  Phrasal segmentation // SegPhrase+
 Effects on computing quality scores
 np hard in the strong sense
 np hard in the strong
 data base management system

167
Linking Entities to Knowledge Base



Stay up Hawk Fans.
We are going through a slump,
but we have to stay positive. Go Hawks!
168
84
7/26/2015
Recognizing Typed Entities
Identifying token span as entity mentions in documents and labeling their types
Target Types
The best BBQ:Food I’ve tasted in Phoenix:LOC ! I had the [pulled pork sandwich]:Food with coleslaw:Food
and [baked beans]:Food for lunch. …
The owner:JOB_TITLE is very nice. …
The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. ... The
owner is very nice. …
FOOD
LOCATION
JOB_TITLE
EVENT
ORGANIZATION
Text with typed entities
Plain text
…
FOOD
EVENT
LOCATION
Enabling structured analysis of
unstructured text corpus
169
Entity Linking: A Relational Graph Approach

Relational graph: Each pair of mention m and concept c as a node
0
1
0
Local Compatability
Coreference
1
1
0
1
1

The model (Adapted from Zhu2003) Semantic Relatedness


yi: the label of node i
W: weight matrix of the relational graph
170
85
7/26/2015
Relational Graph Construction (con’t)
0.44
1.0
0.62
0.52
0.68
0.91
1.0
0.89
0.43 0.32
0.55
0.87
0.68

Semantic Relatedness (SR)
 SR between two mentions: meta path
 SR between two concepts: link structure in Wikipedia (Milne and Witten, 2008)

Linear Combination of these three graphs
171
Models for Comparison
TagMe: an unsupervised model based on prior popularity and semantic relatedness of a single message (Ferragina and Scaiella, 2010)
 Meij: the state‐of‐the‐art supervised approach based on the random forest model (Meij et al., 2012)
 SSRegu: our proposed semi‐supervised graph regularization model with all three types of relations

172
86
7/26/2015
Quality of Semantic Relatedness
DSRM
Standard Relatedness Method M&W
(Milne and Witten, 2008)
173
Impact of Semantic Relatedness on Concept
Disambiguation



News dataset: 4,485 mentions (Hoffart et al., 2011)
AIDA: a unsupervised collective inference method (Hoffart et al., 2011)
Our methods are completely unsupervised
TagMe
Meij
SSRegu+M&W SSRegu+DSRM
Tweet Set
AIDA
SSRegu+M&W SSRegu+DSRM
News Dataset
174
87
7/26/2015
Remaining Challenges
Mention detection is performance bottleneck
 Mention disambiguation: city and country names that refer to sports teams (e.g., “Miami” ‐> “Miami Heat”)
 Incorporate user interests
 Non‐linkable entity mention recognization and clustering

175
Error Distribution
175
Transformation-Based Graph Querying
Transformation‐based graph querying produces many results
 How to suggest the “best” results to users?
 Different transformations, should be weighted differently

Ex.: Given a single node query, “Geoffrey Hinton” Nodes with “G. Hinton” (Abbreviation transformation) shall be ranked higher than
Nodes with “Hinton” (Last token transformation)
 How to determine them? Weights shall be learned
 Evaluate a Candidate Match: Ranking Function
 Features
FV (v,  ( v ))   i f i ( v,  (v ))
i
 Node matching feature: FE ( e,  ( e))    j g j ( e,  ( e))
 Edge matching feature: j
 Matching Score: P ( (Q ) | Q )  exp(  FV ( v,  ( v ))   FE ( e,  ( e)))
176
vVQ
eEQ
88
7/26/2015
Potential Challenges in Applying
Graph Indices + Search to NLP
The graphs automatically constructed from natural language texts might: 

Include many more diverse types and weights

Include more ambiguity

Knowledge sparsity; hard to generalize unique nodes/substructures

Include more noise and errors
177
Mining Frequent (Sub)Graph Patterns

Data mining research has developed many scalable graph pattern mining methods

Given a labeled graph dataset D = {G1, G2, …, Gn), the supporting graph set of a subgraph g is Dg = {Gi | g  Gi, Gi D}. 
support(g) = |Dg|/ |D|
A (sub)graph g is frequent if support(g) ≥ min_sup

OH
Ex.: Chemical structures

N
S
N
O



178
Alternative:
Mining frequent subgraph patterns from a single large graph or network
Documents can be viewed as graph structures as well
Graph Dataset
O
O
HO
O
(A)
N
O
(B)
min_sup = 2
O
N
(C)
Frequent Graph Patterns
O
(1)
support = 67%
(2)
N
O
N
N
89
7/26/2015
Efficient Graph Pattern Mining Methods

Effective Graph pattern mining methods have been developed for different scenarios
Mining graph “transaction” datasets


Ex. gSpan (Yan et al, ICDM’02), CloseGraph (Yan et al., KDD’03), … Mining frequent large subgraph structures in single massive network


Ex. Spider‐Mine (Zhu, et al., VLDB’11)

Graph pattern mining forms building blocks for graph classification, clustering, compression, comparison, and correlation analysis

Graph indexing and graph similarity search
gIndex (SIGMOD’05): Graph indexing by graph pattern mining


Help precise as well as similarity‐based graph query search and answering
179
Mining Collaboration Patterns in DBLP Networks
Data description: 600 top confs, 9 major CS areas, 15071 authors in DB/DM
 Author labeled by # of papers published in DB/DM
 Prolific (P): >=50, Senior (S): 20~49, Junior (J): 10~19, Beginner(B): 5~9

Patterns found Instance in data
Instance in data
Patterns found Instance in data
Instance in data
Patterns found 180
90
7/26/2015
Applications of Graph Pattern Mining
Bioinformatics


Gene networks, protein interactions, metabolic pathways

Chem‐informatics: Mining chemical compound structures

Social networks, web communities, tweets, …

Cell phone networks, computer networks, …

Web graphs, XML structures, semantic Web, information networks 
Software engineering: Program execution flow analysis

Building blocks for graph classification, clustering, compression, comparison, and correlation analysis

Graph indexing and graph similarity search
181
Logistic Regression-Based Prediction Model
Training and test pair: <xi, yi> = <history feature list, future relationship label>

A‐P‐A‐P‐A
A‐P‐V‐P‐A
A‐P‐T‐P‐A
A‐P‐>P‐A
A‐P‐A
<Mike, Ann>
4
5
100
3
Yes = 1
<Mike, Jim>
0
1
20
2
No = 0
Logistic Regression Model
 Model the probability for each relationship as





is the coefficients for each feature (including a constant 1)
MLE estimation Maximize the likelihood of observing all the relationships in the training data

182
91
7/26/2015
When Will It Happen?


From “whether” to “when”
“Whether”: Will Jim rent the movie “Avatar” in Netflix?
Output: P(X=1)=?

“When”: When will Jim rent the movie “Avatar”?
Output: A distribution of time!
What is the probability Jim will rent “Avatar” within 2 months? By when Jim will rent “Avatar” with 90% probability? :
0.9
 What is the expected time it will take for Jim to rent “Avatar”? 

2
May provide useful information to supply chain management
183
The Relationship Building Time Prediction Model
Solution
 Directly model relationship building time: P(Y=t)
 Geometric distribution, Exponential distribution, Weibull distribution
 Use generalized linear model
 Deal with censoring (relationship builds beyond the observed time T: Right interval)

Censoring
Generalized Linear Model
under Weibull Distribution Assumption
Training Framework
184
92