slides in pdf
Transcription
slides in pdf
7/26/2015 Successful Data Mining Methods for NLP Jiawei Han (UIUC), Heng Ji (RPI), Yizhou Sun (NEU) http://hanj.cs.illinois.edu/slides/dmnlp15.pptx[pdf] http://nlp.cs.rpi.edu/paper/dmnlp15.pptx[pdf] 1 Introduction: Where do NLP and DM Meet? 1 7/26/2015 Slightly Different Research Philosophies NLP: Deep understanding of individual words, phrases and sentences (“micro‐level”); focus on unstructured text data Data Mining (DM): High‐level (statistical) understanding, discovery and synthesis of the most salient information (“Macro‐ level”); historically more on structured and semi‐structured data Related to “Health Care Bill” NewsNet (Tao et al., 2014) 3 DM Solution: Data to Networks to Knowledge (D2N2K) Advantages of NLP Construct graphs/networks with fine‐grained semantics from unstructured texts Use large‐scale annotations for real‐world data Advantages of DM: Deep understanding through structured/correlation inference Using a structured representation (e.g., graph, network) as a bridge to capture interactions between NLP and DM Example: Heterogeneous Information Networks [Han et al., 2010; Sun et al., 2012] 4 Data Networks Knowledge 2 7/26/2015 A Promising Direction: Integrating DM and NLP Major theme of this tutorial Applying novel DM methods to solve traditional NLP problems Integrating DM and NLP, transforming Data to Networks to Knowledge Road Map of this tutorial Effective Network Construction by Leveraging Information Redundancy Theme I: Phrase Mining and Topic Modeling from Large Corpora Theme II: Entity Extraction and Linking by Relational Graph Construction Mining Knowledge from Structured Networks Theme III: Search and Mining Structured Graphs and Heterogeneous Networks Looking forward to the Future 5 Theme I: Phrase Mining and Topic Modeling from Large Corpora 6 3 7/26/2015 Why Phrase Mining? Phrase: Minimal, unambiguous semantic unit; basic building block for information network and knowledge base Unigrams vs. phrases Unigrams (single words) are ambiguous Example: “United”: United States? United Airline? United Parcel Service? Phrase: A natural, meaningful, unambiguous semantic unit Example: “United States” vs. “United Airline” Mining semantically meaningful phrases Transform text data from word granularity to phrase granularity Enhance the power and efficiency at manipulating unstructured data using database technology 7 Mining Phrases: Why Not Just Use NLP Methods? Phrase mining: Originated from the NLP community—“Chunking” Model it as a sequence labeling problem (B‐NP, I‐NP, O, …) Need annotation and training Annotate hundreds of POS tagged documents as training data Train a supervised model based on part‐of‐speech features Recent trend: Use distributional features based on web n‐grams (Bergsma et al., 2010) State‐of‐the‐art Performance: ~95% accuracy, ~88% phrase‐level F‐score Limitations High annotation cost, not scalable to a new language, domain or genre May not fit domain‐specific, dynamic, emerging applications Scientific domains, query logs, or social media, e.g., Yelp, Twitter Use only local features, no ranking, no links to topics 8 4 7/26/2015 Data Mining Approaches for Phrase Mining General principle: Corpus‐based; fully exploit information redundancy and data‐ driven criteria to determine phrase boundaries and salience; using local evidence to adjust corpus‐level data statistics Phrase Mining and Topic Modeling from Large Corpora Strategy 1: Simultaneously Inferring Phrases and Topics Strategy 2: Post Topic Modeling Phrase Construction Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et al.’14] Strategy 3: First Phrase Mining then Topic Modeling: Bigram topical model [Wallach’06], topical n‐gram model [Wang, et al.’07], phrase discovering topic model [Lindsey, et al.’12] ToPMine [El‐kishky, et al., VLDB’15] Integration of Phrase Mining with Document Segmentation SegPhrase [Liu, et al., SIGMOD’15] 9 Strategy 1: Simultaneously Inferring Phrases and Topics Bigram Topic Model [Wallach’06] Probabilistic generative model that conditions on previous word and topic when drawing next word Topical N‐Grams (TNG) [Wang, et al.’07] Probabilistic model that generates words in textual order Create n‐grams by concatenating successive bigrams (a generalization of Bigram Topic Model) Phrase‐Discovering LDA (PDLDA) [Lindsey, et al.’12] Viewing each sentence as a time‐series of words, PDLDA posits that the generative parameter (topic) changes periodically Each word is drawn based on previous m words (context) and current phrase topic High model complexity: Tends to overfitting; High inference cost: Slow 10 5 7/26/2015 Strategy 2: Post Topic Modeling Phrase Construction TurboTopics [Blei & Lafferty’09] – Phrase construction as a post‐processing step to Latent Dirichlet Allocation Perform Latent Dirichlet Allocation on corpus to assign each token a topic label Merge adjacent unigrams with the same topic label by a distribution‐free permutation test on arbitrary‐length back‐off model End recursive merging when all significant adjacent unigrams have been merged KERT [Danilevsky et al.’14] – Phrase construction as a post‐processing step to Latent Dirichlet Allocation Perform frequent pattern mining on each topic Perform phrase ranking based on four different criteria 11 Example of TurboTopics 12 Perform LDA on corpus to assign each token a topic label E.g., … phase11 transition11 …. game153 theory127 … Then merge adjacent unigrams with same topic label 6 7/26/2015 Framework of KERT 1. Run bag‐of‐words model inference and assign topic label to each token 2. Extract candidate keyphrases within each topic Frequent pattern mining 3. Rank the keyphrases in each topic Comparability property: directly compare phrases of mixed lengths kpRel [Zhao et al. 11] learning KERT (‐popularity) KERT (‐discriminativeness) KERT (‐concordance) KERT [Danilevsky et al. 14] effective support vector machines learning learning classification text feature selection classification support vector machines selection probabilistic reinforcement learning selection reinforcement learning 13 models identification conditional random fields feature feature selection algorithm mapping constraint satisfaction decision conditional random fields features task decision trees bayesian classification decision planning dimensionality reduction trees decision trees : : : : : Strategy 3: First Phrase Mining then Topic Modeling ToPMine [El‐Kishky et al. VLDB’15] First phrase construction, then topic mining Contrast with KERT: topic modeling, then phrase mining The ToPMine Framework: Perform frequent contiguous pattern mining to extract candidate phrases and their counts Perform agglomerative merging of adjacent unigrams as guided by a significance score—This segments each document into a “bag‐of‐phrases” The newly formed bag‐of‐phrases are passed as input to PhraseLDA, an extension of LDA, that constrains all words in a phrase to each sharing the same latent topic 14 7 7/26/2015 Why First Phrase Mining then Topic Modeling ? With Strategy 2, tokens in the same phrase may be assigned to different topics Ex. knowledge discovery using least squares support vector machine classifiers… Knowledge discovery and support vector machine should have coherent topic labels Solution: switch the order of phrase mining and topic model inference [knowledge discovery] using [least squares] [support vector machine] [classifiers] … [knowledge discovery] using [least squares] [support vector machine] [classifiers] … Techniques Phrase mining and document segmentation Topic model inference with phrase constraint 15 Phrase Mining: Frequent Pattern Mining + Statistical Analysis Quality phrases Based on significance score [Church et al.’91]: α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2) [Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]… Raw freq. True freq. [support vector machine] 90 80 [vector machine] 95 0 [support vector] 100 20 Phrase 16 8 7/26/2015 Collocation Mining Collocation: A sequence of words that occur more frequently than expected Often “interesting” and due to their non‐compositionality, often relay information not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”) Many different measures used to extract collocations from a corpus [Dunning 93, Pederson 96] E.g., mutual information, t‐test, z‐test, chi‐squared test, likelihood ratio Many of these measures can be used to guide the agglomerative phrase‐ segmentation algorithm 17 ToPMine: Phrase LDA (Constrained Topic Modeling) The generative model for PhraseLDA is the same as LDA Difference: the model incorporates constraints obtained from the “bag‐of‐phrases” input Chain‐graph shows that all words in a phrase are constrained to take on the same topic values [knowledge discovery] using [least squares] [support vector machine] [classifiers] … Topic model inference with phrase constraints 18 9 7/26/2015 Example Topical Phrases: A Comparison social networks information retrieval web search text classification time series machine learning search engine support vector machines management system information extraction real time neural networks decision trees text categorization : : Topic 1 Topic 2 PDLDA [Lindsey et al. 12] Strategy 1 (3.72 hours) information retrieval feature selection social networks web search search engine information extraction machine learning semi supervised large scale support vector machines question answering active learning web pages face recognition : : Topic 1 Topic 2 ToPMine [El‐kishky et al. 14] Strategy 3 (67 seconds) 19 ToPMine: Experiments on DBLP Abstracts 20 10 7/26/2015 ToPMine: Topics on Associate Press News (1989) 21 ToPMine: Experiments on Yelp Reviews 22 11 7/26/2015 Efficiency: Running Time of Different Strategies Running time: strategy 3 > strategy 2 > strategy 1 (“>” means outperforms) Strategy 1: Generate bag‐of‐words → generate sequence of tokens Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model 23 Coherence of Topics: Comparison of Strategies Coherence measured by z‐score: strategy 3 > strategy 2 > strategy 1 Strategy 1: Generate bag‐of‐words → generate sequence of tokens Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model 24 12 7/26/2015 Phrase Intrusion: Comparison of Strategies Phrase intrusion measured by average number of correct answers: strategy 3 > strategy 2 > strategy 1 25 Phrase Quality: Comparison of Strategies Phrase quality measured by z‐score: strategy 3 > strategy 2 > strategy 1 26 13 7/26/2015 Mining Phrases: Why Not Use Raw Frequency Based Methods? Traditional data‐driven approaches Frequent pattern mining If AB is frequent, likely AB could be a phrase Raw frequency could NOT reflect the quality of phrases E.g., freq(vector machine) ≥ freq(support vector machine) Need to rectify the frequency based on segmentation results Phrasal segmentation will tell Some words should be treated as a whole phrase whereas others are still unigrams 27 SegPhrase: From Raw Corpus to Quality Phrases and Segmented Corpus Raw Corpus Quality Phrases Segmented Corpus Document 1 Citation recommendation is an interesting but challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique. Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications. Phrase Mining Input Raw Corpus E.g., freq(vector machine) ≥ freq(support vector machine) Need to rectify the frequency based on segmentation results Segmented Corpus Raw frequency could NOT reflect the quality of phrases 28 Phrasal Segmentation Quality Phrases Build a candidate phrase set by frequent pattern mining Mining frequent k‐grams; k is typically small, e.g. 6 in our experiments Popularity measured by raw frequent words and phrases mined from the corpus 14 7/26/2015 SegPhrase: The Overall Framework ClassPhrase: Frequent pattern mining, feature extraction, classification SegPhrase: Phrasal segmentation and phrase quality estimation SegPhrase+: One more round to enhance mined phrase quality ClassPhrase SegPhrase(+) 29 What Kind of Phrases Are of “High Quality”? Judging the quality of phrases Popularity “information retrieval” vs. “cross‐language information retrieval” Concordance “powerful tea” vs. “strong tea” “active learning” vs. “learning classification” Informativeness “this paper” (frequent but not discriminative, not informative) Completeness “vector machine” vs. “support vector machine” 30 15 7/26/2015 Feature Extraction: Concordance Partition a phrase into two parts to check whether the co‐occurrence is significantly higher than pure random support vector machine this paper demonstrates Pointwise mutual information: Pointwise KL divergence: The additional p(v) is multiplied with pointwise mutual information, leading to less bias towards rare‐occurred phrases 31 Feature Extraction: Informativeness Deriving Informativeness Quality phrases typically start and end with a non‐stopword “machine learning is” vs. “machine learning” Use average IDF over words in the phrase to measure the semantics Usually, the probabilities of a quality phrase in quotes, brackets, or connected by dash should be higher (punctuation information) “state‐of‐the‐art” We can also incorporate features using some NLP techniques, such as POS tagging, chunking, and semantic parsing 32 16 7/26/2015 Classifier Limited Training Labels: Whether a phrase is a quality one or not “support vector machine”: 1 “the experiment shows”: 0 For ~1GB corpus, only 300 labels Random Forest as our classifier Predicted phrase quality scores lie in [0, 1] Bootstrap many different datasets from limited labels 33 SegPhrase: Why Do We Need Phrasal Segmentation in Corpus? Phrasal segmentation can tell which phrase is more appropriate Ex: A standard [feature vector] [machine learning] setup is used to describe... Not counted towards the rectified frequency Rectified phrase frequency (expected influence) Example: 34 17 7/26/2015 SegPhrase: Segmentation of Phrases Partition a sequence of words by maximizing the likelihood Considering Phrase quality score ClassPhrase assigns a quality score for each phrase Probability in corpus Length penalty length penalty : when 1, it favors shorter phrases Filter out phrases with low rectified frequency Bad phrases are expected to rarely occur in the segmentation results 35 SegPhrase+: Enhancing Phrasal Segmentation SegPhrase+: One more round for enhanced phrasal segmentation Feedback Using rectified frequency, re‐compute those features previously computed based on raw frequency Process Classification Phrasal segmentation // SegPhrase Classification Phrasal segmentation // SegPhrase+ Effects on computing quality scores np hard in the strong sense np hard in the strong data base management system 36 18 7/26/2015 Performance Study: Methods to Be Compared Other phase mining methods: Methods to be compared NLP chunking based methods Chunks as candidates Sorted by TF‐IDF and C‐value (K. Frantzi et al., 2000) Unsupervised raw frequency based methods ConExtr (A. Parameswaran et al., VLDB 2010) ToPMine (A. El‐Kishky et al., VLDB 2015) Supervised method KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten, 2006) 37 Performance Study: Experimental Setting Datasets Dataset #docs #words #labels DBLP 2.77M 91.6M 300 Yelp 4.75M 145.1M 300 Popular Wiki Phrases Based on internal links ~7K high quality phrases Pooling Sampled 500 * 7 Wiki‐uncovered phrases Evaluated by 3 reviewers independently 38 19 7/26/2015 Performance: Precision Recall Curves on DBLP Compare with other baselines TF‐IDF C‐Value ConExtr KEA ToPMine SegPhrase+ Compare with our 3 variations TF‐IDF ClassPhrase SegPhrase SegPhrase+ 39 39 Performance Study: Processing Efficiency SegPhrase+ is linear to the size of corpus! 40 20 7/26/2015 Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGMOD) Query SIGMOD Method SegPhrase+ Chunking (TF‐IDF & C‐Value) 1 data base data base 2 database system database system 3 relational database query processing 4 query optimization query optimization 5 query processing relational database … … … 51 sql server database technology 52 relational data database server 53 data structure large volume 54 join query performance study 55 web service … … Only in SegPhrase+ web service Only in Chunking … 201 high dimensional data 202 location based service efficient implementation sensor network 203 xml schema large collection 204 two phase locking important issue 205 deep web frequent itemset … … … 41 Experimental Results: Interesting Phrases Generated (From the Titles and Abstracts of SIGKDD) Query SIGKDD Method SegPhrase+ 1 data mining Chunking (TF‐IDF & C‐Value) data mining 2 data set association rule 3 association rule knowledge discovery 4 knowledge discovery frequent itemset 5 time series decision tree … … … 51 association rule mining search space 52 rule set domain knowledge 53 concept drift importnant problem concurrency control 54 knowledge acquisition 55 gene expression data conceptual graph … … … 201 web content Only in SegPhrase+ optimal solution 202 frequent subgraph semantic relationship 203 intrusion detection effective way 204 categorical attribute space complexity 205 user preference small set … … … Only in Chunking 42 21 7/26/2015 Experimental Results: Similarity Search Find high‐quality similar phrases based on user’s phrase query In response to a user’s phrase query, SegPhrase+ generates high quality, semantically similar phrases In DBLP, query on “data mining” and “OLAP” In Yelp, query on “blu‐ray”, “noodle”, and “valet parking” 43 Recent Progress Distant Training: No need of human labeling Training using general knowledge bases E.g., Freebase, Wikipedia Quality Estimation for Unigrams Integration of phrases and unigrams in one uniform framework Demo based on DBLP abstract Multi‐languages: Beyond English corpus Extensible to mining quality phrases in multiple languages Recent progress: SegPhrase+ works on Chinese, Arabic and Spanish 44 22 7/26/2015 High Quality Phrases Generated from Chinese Wikipedia Rank Phrase In English … … … 62 首席_执行官 CEO 63 中间_偏右 Middle‐right … … … 84 百度_百科 Baidu Pedia 85 热带_气旋 Tropical cyclone 86 中国科学院_院士 Fellow of Chinese Academy of Sciences … … … 1001 十大_中文_金曲 Top‐10 Chinese Songs 1002 全球_资讯网 Global News Website 1003 天一阁_藏_明代_科举_录_选刊 A Chinese book name … … … 9934 国家_戏剧_院 National Theater 9935 谢谢_你 Thank you … … … 45 Top-Ranked Phrases Generated from English Gigaword Northrop Grumman, Ashfaq Kayani, Sania Mirza, Pius Xii, Shakhtar Donetsk, Kyaw Zaw Lwin Ratko Mladic, Abdolmalek Rigi, Rubin Kazan, Rajon Rondo, Rubel Hossain, bluefin tuna Psv Eindhoven, Nicklas Bendtner, Ryo Ishikawa, Desmond Tutu, Landon Donovan, Jannie du Plessis Zinedine Zidane, Uttar Pradesh, Thor Hushovd, Andhra Pradesh, Jafar_Panahi, Marouane Chamakh Rahm Emanuel, Yakubu Aiyegbeni, Salva Kiir, Abdelhamid Abou Zeid, Blaise Compaore, Rickie Fowler Andry Rajoelina, Merck Kgaa, Js Kabylie, Arjun Atwal, Andal Ampatuan Jnr, Reggio Calabria, Ghani Baradar Mahela Jayawardene, Jemaah Islamiyah, quantitative easing, Nodar Kumaritashvili, Alviro Petersen Rumiana Jeleva, Helio Castroneves, Koumei Oda, Porfirio Lobo, Anastasia Pavlyuchenkova Thaksin Shinawatra, Evgeni_Malkin, Salvatore Sirigu, Edoardo Molinari, Yoshito Sengoku Otago Highlanders, Umar Akmal, Shuaibu Amodu, Nadia Petrova, Jerzy Buzek, Leonid Kuchma, Alona Bondarenko, Chosun Ilbo, Kei Nishikori, Nobunari Oda, Kumbh Mela, Santo_Domingo Nicolae Ceausescu, Yoann Gourcuff, Petr Cech, Mirlande Manigat, Sulieman Benn, Sekouba Konate 46 23 7/26/2015 Summary and Future Work Strategy 1: Generate bag‐of‐words → generate sequence of tokens Integrated complex model; phrase quality and topic inference rely on each other Slow and overfitting Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams Phrase quality relies on topic labels for unigrams Can be fast; generally high‐quality topics and phrases Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model Topic inference relies on correct segmentation of documents, but not sensitive Can be fast; generally high‐quality topics and phrases 47 Summary and Future Work (Cont’d) SegPhrase+: A new phrase mining framework Integrating phrase mining with phrasal segmentation Requires only limited training or distant training Generates high‐quality phrases, close to human judgement Linearly scalable on time and space Looking forward: High‐quality, scalable phrase mining Facilitate entity recognition and typing in large corpora (See the next part of this tutorial) Combine with linguistic‐rich patterns Transform massive unstructured data into semi‐structured knowledge networks 48 24 7/26/2015 References for Theme 1: Phrase Mining for Concept Extraction D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi‐Word Expressions, arXiv:0907.1013, 2009 K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition: Exploiting On‐Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991 M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents“, SDM’14 A. El‐Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15 K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi‐Word Terms: the c‐value/nc‐value Method. Int. Journal on Digital Libraries, 3(2), 2000 R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase‐discovering topic model using hierarchical pitman‐yor processes, EMNLP‐CoNLL’12. J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15 O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06 Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07 A. Parameswaran, H. Garcia‐Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from Large Datasets. VLDB’10 X. Wang, A. McCallum, X. Wei. Topical n‐grams: Phrase and topic discovery, with an application to information retrieval, ICDM’07 49 Theme II: Entity Extraction and Linking by Relational Graph Construction and Propagation 50 25 7/26/2015 Session 1. Task and Traditional NLP Approach Entity Extraction and Linking Identifying token span as entity mentions in documents and labeling their types Target Types FOOD LOCATION JOB_TITLE EVENT ORGANIZATION … The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. ... The owner is very nice. … The best BBQ:Food I’ve tasted in Phoenix:LOC ! I had the [pulled pork sandwich]:Food with coleslaw:Food and [baked beans]:Food for lunch. … The owner:JOB_TITLE is very nice. … Text with typed entities Plain text FOOD LOCATION EVENT Enabling structured analysis of unstructured text corpus 52 26 7/26/2015 Traditional NLP Approach: Data Annotation Extracting and linking entities can be used in a variety of ways: serve as primitives for information extraction and knowledge base population assist question answering,… Traditional named entity recognition systems are designed for major types (e.g., PER, LOC, ORG) and general domains (e.g., news) Require additional steps to adapt to new domains/types Expensive human labor on annotation 500 documents for entity extraction; 20,000 queries for entity linking Unsatisfying agreement due to various granularity levels and scopes of types Entities obtained by entity linking techniques have limited coverage and freshness >50% unlinkable entity mentions in Web corpus [Lin et al., EMNLP’12] >90% in our experiment corpora: tweets, Yelp reviews, … 53 Traditional NLP Approach: Feature Engineering Typical Entity Extraction Features (Li et al., 2012) N‐gram: Unigram, bigram and trigram token sequences in the context window Part‐of‐Speech: POS tags of the context words Gazetteers: person names, organizations, countries and cities, titles, idioms, etc. Word clusters: word clusters / embeddings Case and Shape: Capitalization and morphology analysis based features Chunking: NP and VP Chunking tags Global feature: Sentence level and document level structure/position features 54 27 7/26/2015 Traditional NLP Approach: Feature Engineering Typical Entity Linking Features (Ji et al., 2011) Mention/Concept Attribute Name Document surface Entity Context Profiling Concept Topic KB Link Mining Popularity 55 Description Spelling match Exact string match, acronym match, alias match, string matching… KB link mining Name Gazetteer Lexical Name pairs mined from KB text redirect and disambiguation pages Organization and geo‐political entity abbreviation gazetteers Position Genre Local Context Type Relation/Event Coreference Web Frequency Words in KB facts, KB text, mention name, mention text. Tf.idf of words and ngrams Mention name appears early in KB text Genre of the mention text (newswire, blog, …) Lexical and part‐of‐speech tags of context words Mention concept type, subtype Concepts co‐occurred, attributes/relations/events with mention Co‐reference links between the source document and the KB text Slot fills of the mention, concept attributes stored in KB infobox Ontology extracted from KB text Topics (identity and lexical similarity) for the mention text and KB text Attributes extracted from hyperlink graphs of the KB text Top KB text ranked by search engine and its length Frequency in KB texts Session 2. ClusType: Entity Recognition and Typing by Relation Phrase-Based Clustering 28 7/26/2015 Entity Recognition and Typing: A Data Mining Solution Acquire labels for a small amount of instances Construct a relational graph to connect labeled instances and unlabeled instances Construct edges based on coarse‐grained data‐driven statistics instead of fine‐ grained linguistic similarity Mention correlation Text co‐occurrence Semantic relatedness based on knowledge graph embeddings Social networks Label propagation across the graph 57 Case Study 1: Entity Extraction Goal: recognizing entity mentions of target types with minimal/no human supervision and with no requirement that entities can be found in a KB. Two kinds of efforts towards this goal: Weak supervision: relies on manually selected seed entities in applying pattern‐based bootstrapping methods or label propagation methods to identify more entities Both assume seeds are unambiguous and sufficiently frequent requires careful seed selection by human Distant supervision: leverages entity information in KBs to reduce human supervision (cont.) 58 29 7/26/2015 Typical Workflow of Distant Supervision Detect entity mentions from text Map candidate mentions to KB entities of target types Use confidently mapped {mention, type} to infer types of remaining candidate mentions 59 Problem Definition Distantly‐supervised entity recognition in a domain‐specific corpus Given: a corpus D a knowledge base (e.g., Freebase) a set of target types (T) from a KB Detect candidate entity mentions from corpus D Categorize each candidate mention by target types or Not‐Of‐Interest (NOI), with distant supervision 60 30 7/26/2015 Challenge I: Domain Restriction Most existing work assume entity mentions are already extracted by existing entity detection tools, e.g., noun phrase chunkers Usually trained on general‐domain corpora like news articles (clean, grammatical) Make use of various linguistics features (e.g., semantic parsing structures) Do not work well on specific, dynamic or emerging domains (e.g., tweets, Yelp reviews) E.g., “in‐and‐out” from Yelp review may not be properly detected 61 Challenge II: Name Ambiguity Multiple entities may share the same surface name While Griffin is not the part of Washington’s plan on Sunday’s game, … Sport team …has concern that Kabul is an ally of Washington. U.S. government He has office in Washington, Boston and San Francisco U.S. capital city Previous methods simply output a single type/type distribution for each surface name, instead of an exact type for each entity mention While Griffin is not the part of Washington’s plan on Sunday’s game, … … news from Washington indicates that the congress is going to… It is one of the best state parks in Washington. Washington State or Washington Sport Govern State team ‐ment … 62 31 7/26/2015 Challenge III: Context Sparsity A variety of contextual clues are leveraged to find sources of shared semantics across different entities Keywords, Wiki concepts, linguistic patterns, textual relations, … There are often many ways to describe even the same relation between two entities ID Sentence Freq 1 The magnitude 9.0 quake caused widespread devastation in [Kesennuma city] 12 2 … tsunami that ravaged [northeastern Japan] last Friday 31 3 The resulting tsunami devastate [Japan]’s northeast 244 Previous methods have difficulties in handling entity mention with sparse (infrequent) context 63 Our Solution Domain‐agnostic phrase mining algorithm: Extracts candidate entity mentions with minimal linguistic assumption address domain restriction E.g., part‐of‐speech (POS) tagging << semantic parsing Do not simply merge entity mentions with identical surface names Model each mention based on its surface name and context, in a scalable way address name ambiguity Mine relation phrase co‐occurring with entity mentions; infer synonymous relation phrases Helps form connecting bridges among entities that do not share identical context, but share synonymous relation phrases address context sparsity 64 32 7/26/2015 A Relation Phrase-Based Entity Recognition Framework POS‐constrained phrase segmentation for mining candidate entity mentions and relation phrases, simultaneously Construct a heterogeneous graph to represent available information in a unified form Entity mentions are kept as individual objects to be disambiguated Linked to entity surface names & relation phrases 65 A Relation Phrase-Based Entity Recognition Framework With the constructed graph, formulate a graph‐based semi‐supervised learning of two tasks jointly: Type propagation on heterogeneous graph Multi‐view relation phrase clustering Derived entity argument types serve as good feature for clustering relation phrases Propagate type information among entities bridges via synonymous relation phrases Mutually enhancing each other; leads to quality recognition of unlinkable entity mentions 66 33 7/26/2015 Framework Overview 1. 2. 3. 4. Perform phrase mining on a POS‐tagged corpus to extract candidate entity mentions and relation phrases Construct a heterogeneous graph to encode our insights on modeling the type for each entity mention Collect seed entity mentions as labels by linking extracted mentions to the KB Estimate type indicator for unlinkable candidate mentions with the proposed type propagation integrated with relation phrase clustering on the constructed graph 67 Candidate Generation An efficient phrase mining algorithm incorporating both corpus‐level statistics and syntactic constraints Global significance score: Filter low‐quality candidates; generic POS tag patterns: remove phrases with improper syntactic structure By extending TopMine, the algorithm partitions corpus into segments which meet both significance threshold and POS patterns candidate entity mentions & relation phrases Relation phrase: phrase that denotes a unary Algorithm workflow: 1. Mine frequent contiguous patterns 2. Performs greedy‐agglomerative merging while enforcing our syntactic constraints • Entity mention: consecutive nouns • Relation phrases: shown in the table 3. Terminates when the next highest‐score merging does not meet a pre‐defined significance threshold or binary relation in a sentence 68 34 7/26/2015 Candidate Generation Example output of candidate generation on NYT news articles Entity detection performance comparison with an NP chunker Recall is most critical for this step, since later we cannot detect the misses (i.e., false negatives) 69 Construction of Heterogeneous Graphs With three types of objects extracted from corpus: candidate entity mentions, entity surface names, and relation phrases We can construct a heterogeneous graph to enforce several hypotheses for modeling type of each entity mention (introduced in the following slides) Basic idea for constructing the graph: the more two objects are likely to share the same label, the larger the weight will be associated with their connecting edge Three types of links: 1. Mention‐name link: (many‐to‐one) mappings between entity mention and surface names 2. Name‐relation phrase links: corpus‐level co‐ occurrence between surface names and relation phrases 3. Mention correlation links: distributional similarity between entity mentions 70 35 7/26/2015 Entity Mention-Surface Name Subgraph Directly modeling type indicator of each entity mention in label propagation Intractable size of parameter space Both the entity name and the surrounding relation phrases provide strong cues on the types of a candidate entity mention Model the type of each entity mention by (1) type indicator of its surface name; (2) the type signatures of its surrounding relation phrases (more details in the following slides) …has concerns whether Kabul is an ally of Washington Washington Gover‐ nment M candidate mentions; n surface names Use a bi‐adjacency matrix to represent the mapping is an ally of State …has concerns whether Kabul is an ally of Washington: GOVERNMENT 71 Entity Name-Relation Phrase Subgraph Aggregated co‐occurrences between entity surface names and relation phrases across corpus weight importance of different relation phrases for surface names use connected edges as bridges to propagate type information Left/right entity argument of relation phrase: for each mention, assign it as the left (right, resp.) argument to the closest relation phrase on its right (left, resp.) in a sentence Type signature of relation phrase: Two type indicators for its left and right arguments l different relation phrases, mapping between mentions and relation phrases: Two bi‐adjacency matrices for the subgraph 72 36 7/26/2015 Mention Correlation Subgraph An entity mention may have ambiguous name and ambiguous relation phrases E.g., “White house” and “felt” in the first sentence of Figure Other co‐occurring mentions may provide good hints to the type of an entity mention E.g., “birth certificate” and “rose garden” in the Figure Propagate type information between candidate mentions of each surface name, based on following hypothesis: Construct KNN graph based on the feature vector f‐surface names of co‐occurring entity mentions 73 Relation Phrase Clustering Observation: many relation phrases have very few occurrences in the corpus ~37% relation phrases have <3 unique entity surface names (in right or left arguments) Hard to model their type signature based on aggregated co‐occurrences with entity surface names (i.e., Hypothesis 1) Softly clustering synonymous relation phrases: the type signatures of frequent relation phrases can help infer the type signatures of infrequent (sparse) ones that have similar cluster memberships 74 37 7/26/2015 Relation Phrase Clustering Existing work on relation phrase clustering utilizes strings; context words; entity argument to cluster synonymous relation phrases String similarity and distribution similarity may be insufficient to resolve two relation phrases; type information is particularly helpful in such case We propose to leverage type signature of relation phrase, and proposed a general relation phrase clustering method to incorporate different features further integrated with the graph‐based type propagation in a mutually enhancing framework, based on the following hypothesis Type signatures String features Context features 75 Type Inference: A Joint Optimization Problem Mention modeling & mention correlation (Hypo 2) Type propagation between entity surface names and relation phrases (Hypo 1) Multi‐view relation phrases clustering (Hypo 3 & 4) 76 38 7/26/2015 The ClusType Algorithm The ClusType algorithm: Update type indicators and type signatures For each view, performs single‐view NMF until converges Can be efficiently solved by alternative minimization based on block coordinate descent algorithm Algorithm complexity is linear to #entity mentions, #relation phrases, #cluster, #clustering features and #target types Update consensus matrix and relative weights of different views Until the objective converges 77 Experiment Setting Datasets: 2013 New York Times news (~110k docs) [event, PER, LOC, ORG]; Yelp Reviews (~230k) [Food, Job, …]; 2011 Tweets (~300k) [event, product, PER, LOC, …] Seed mention sets: < 7% extracted mentions are mapped to Freebase entities Evaluation sets: manually annotate mentions of target types for subsets of the corpora Evaluation metrics: Follows named entity recognition evaluation (Precision, Recall, F1) Compared methods Pattern: Stanford pattern‐based learning; SemTagger:bootstrapping method which trains contextual classifier based on seed mentions; FIGER: distantly‐supervised sequence labeling method trained on Wiki corpus; NNPLB: label propagation using ReVerb assertion and seed mention; APOLLO: mention‐level label propagation using Wiki concepts and KB entities; ClusType‐NoWm: ignore mention correlation; ClusType‐NoClus: conducts only type propagation; ClusType‐TwpStep: first performs hard clustering then type propagation 78 39 7/26/2015 Comparing ClusType with Other Methods and Its Variants vs. FIGER: effectiveness of our candidate generation and proposed hypotheses on type propagation 46.08% and 48.94% improvement in F1 score vs. NNPLB and APOLLO: ClusType not only utilizes semantic‐rich compared to the best relation phrase as type cues, but only cluster synonymous relation baseline on the Tweet phrases to tackle context sparsity and the Yelp datasets vs. variants: (i) models mention correlation for name disambiguation; (ii) integrates clustering in a mutually enhancing way 79 Comparing on Different Entity Types Obtains larger gain on organization and person (more entities with ambiguous surface names) Modeling types on entity mention level is critical for name disambiguation Superior performance on product and food mainly comes from the domain independence of our method Both NNPLB and SemTagger require sophisticated linguistic feature generation which is hard to adapt to new types 80 40 7/26/2015 Comparing on Trained NER System Compare with Stanford NER, which is trained on general‐domain corpora including ACE corpus and MUC corpus, on three types: PER, LOC, ORG ClusType and its variants outperform Stanford NER on both dynamic corpus (NYT) and domain‐specific corpus (Yelp) ClusType has lower precision but higher Recall and F1 score on Tweet Superior recall of ClusType mainly come from domain‐independent candidate generation 81 Example Output and Relation Phrase Clusters Extracts more mentions and predicts types with higher accuracy Not only synonymous relation phrases, but also both sparse and frequent relation phrase can be clustered together boosts sparse relation phrases with type information of frequent relation phrases 82 41 7/26/2015 Testing on Context Sparsity and Surface Name Popularity Surface name popularity: Group A: high frequency surface name Group B: infrequent surface name ClusType outperforms its variants on Group B Handles well mentions with insufficient corpus statistics Context sparsity: Group A: frequent relation phrases Group B: sparse relation phrases ClusType obtains superior performance over its variants on Group B clustering relation phrase is critical for sparse relation phrases 83 Conclusions and Future Work Study distantly‐supervised entity recognition for domain‐specific corpora and propose a novel relation phrase‐based framework A data‐driven, domain‐agnostic phrase mining algorithm for candidate entity mentions and relation phrase generation Integrate relation phrase clustering with type propagation on heterogeneous graphs, and solve it by a joint optimization problem. Ongoing: Extend to role discovery for scientific concepts paper profiling (research/demo) Study of relation phrase clustering, such as joint entity/relation clustering synonymous relation phrase canonicalization Study of joint entity and relation phrase extraction with phrase mining 84 42 7/26/2015 Session 3. Entity Linking Based on Semi-Supervised Graph Regularization Entity Linking: A Relational Graph Approach Stay up Hawk Fans. We are going through a slump, but we have to stay positive. Go Hawks! 86 43 7/26/2015 Relevant Mention Detection: Meta Path A meta‐path is a path defined over a network and composed of a sequence of relations between different object types (Sun et al., 2011) Each meta path represent a semantic relation (more in next theme) Meta paths between two mentions M‐T‐M M‐T‐U‐T‐M‐M M‐T‐H‐T‐M M‐T‐U‐T‐M‐T‐H‐T‐M M‐T‐H‐T‐M‐T‐U‐T‐M Schema of a Heterogeneous Information Network in Twitter M: mention, T: tweet, U: user, H: hashtag 87 Relational Graph Construction 0.44 0.52 0.68 1.0 0.91 0.43 0.32 0.68 Local Compatibility Mention Features (e.g., idf, keyphraseness) Concept Features (e.g., # of incoming/outgoing links) Mention + Concept Features (e.g., prior popularity, tf) Context Features (e.g., capitalization, tf‐idf) Coreference Semantic Relatedness (SR) At least one meta path exists between two similar mentions SR between two mentions: meta path SR between two concepts: link structure in Wikipedia (Milne and Witten, 2008) Linear Combination of these three graphs 88 44 7/26/2015 A Deep Semantic Relatedness Model (DSRM) Semantic Knowledge Graphs Erik Spoelstra Description Coach Miami Location Type Member Founded 1988 Roster Miami Titanic Heat National Basketball Association 89 Dwyane Wade Professional Sports Team The DSRM Architecture Semantic relatedness (cosine similarity) SR(ci , cj) y Semantic Layer Multi-layer nonlinear projections 300 300 300 300 300 300 105k (50k + 50k + 3.2k + 1.6k) Word Hashing Layer Feature Vector x 1m Di 4m Ci 3.2k Ri Miami 1.6k CTi Location Roster Dwyane Wade 90 105k (50k + 50k + 3.2k + 1.6k) 1m Dj Miami Heat Titanic 4m Cj Type 3.2k Rj 1.6k CTj Professional Sports Team Member National Basketball Association 90 45 7/26/2015 Data and Scoring Metric Data A Wikipedia dump on May 3, 2013 o A portion of Freebase limited to the Wikipedia concepts o Wikification: a public data set includes 502 messages from 28 users (Meij et al., 2012) o Semantic relatedness: A benchmark testset includes 3,314 concepts as testing queries (Ceccarelli et al., 2013) Scoring Metric o Wikification Standard precision, recall and F1 o Semantic relatedness nDCG o 91 Overall Performance TagMe: unsupervised model Meij: supervised model; use 100% labeled data SSRegu: use 50% labeled data 7.5% absolute F1 gain over the state-of-the-art supervised models 65.0% 59.0% 59.8% 52.5% 47.5% 55.0% 51.6% 44.1% 42.3% 37.0% 39.3% 32.9% TagMe Meij SSRegu + M&W SSRegu + DSRM 92 46 7/26/2015 Semantic Relatedness: Examples Method M&W DSRM New York City 0.92 0.22 New York Knicks 0.78 0.79 Washington, D.C. 0.80 0.30 Washington Wizards 0.60 0.85 Atlanta 0.71 0.39 Atlanta Hawks 0.53 0.83 Houston 0.55 0.37 Houston Rockets 0.49 0.80 Semantic relatedness scores between a sample of concepts and the concept ”National Basketball Association” in sports domain. 93 Summary on Relational Graph-Based Entity Linking Compared to traditional label propagation‐based methods Data driven methods to construct relational graphs Integrate multi‐view clustering with graph‐based label propagation Compared to traditional distantly‐supervised methods Domain‐independent entity candidate generation: jointly extract entity mentions and their relation phrases in an unsupervised way Resolve context (relation phrase) sparsity by integrating type propagation with relation clustering Fully exploit entries and structures in knowledge bases Potential Applications to other NLP Tasks Where data annotation is costly Where a relational graph among labeled seeds and unlabeled instances can be constructed based on data‐driven metrics Toward fully “Liberal” IE: Discover schema and extract/link entities simultaneously 94 47 7/26/2015 References for Theme 2: Entity Recognition and Typing X. Ren, A. El‐Kishky, C. Wang, F. Tao, C. R. Voss, H. Ji and J. Han. ClusType: Effective Entity Recognition and Typing by Relation Phrase‐Based Clustering. KDD’15. H. Huang, Y. Cao, X. Huang, H. Ji and C. Lin. Collective Tweet Wikification based on Semi‐supervised Graph Regularization. ACL’14. T. Lin, O. Etzioni, et al. No noun phrase left behind: detecting and typing unlinkable entities. EMNLP’12. N. Nakashole, T. Tylenda, and G. Weikum. Fine‐grained semantic typing of emerging entities. ACL’13. R. Huang and E. Riloff. Inducing domain‐specific semantic class taggers from (almost) nothing. ACL’10 X. Ling and D. S. Weld. Fine‐grained entity recognition. AAAI’12. W. Shen, J. Wang, P. Luo, and M. Wang. A graph‐based approach for ontology population with named entities. CIKM’12 S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. CONLL’14 P. P. Talukdar and F. Pereira. Experiments in graph‐based semi‐supervised learning methods for class‐ instance acquisition. ACL’10. Z. Kozareva and E. Hovy. Not all seeds are equal: Measuring the quality of text mining seeds. NAACL’10 L. Gal´arraga, G. Heitz, K. Murphy, and F. M. Suchanek. Canonicalizing open knowledge bases. CIKM’14 95 Theme III. Search and Mining Structured Graphs and Heterogeneous Networks for NLP 96 48 7/26/2015 Outline Directly converting a large amount of data into knowledge is too ambitious and unrealistic Once we construct graphs or heterogeneous networks, we can perform many powerful search and mining tasks Many NLP problems are about graphs and networks – what’s the DM point of view? Search: Graph index creation and approximate structural search NLP Application: Schemaless queries Classification: Graph pattern mining and pattern‐based classification NLP Application: Distinguishing authors based on their writing styles Meta‐path‐based similarity search in heterogeneous networks NLP Application: Entity morph decoding Meta‐path‐based mining in heterogeneous networks: Prediction and recommendation NLP Application: Knowledge base completion; multi‐hop question answering, causal event prediction 97 Section 1: Search Graphs and Networks 49 7/26/2015 Graph Indexing Graph query: Find all the graphs in a graph DB containing a given query graph Index should be a powerful tool Path‐index may not work well Solution: Index directly on substructures (i.e., graphs) Query Q: Graph (G) Substructure Graph DB: (a) Only graph (c) contains Q Query graph (Q) (c) (b) Path‐indices: C, C‐C, C‐C‐C, C‐C‐C‐C cannot prune (a) & (b) 99 Many NLP Problems can be Converted into Graph Search Entity Linking (Pan et al., 2015) Query Graph Constructed from NL douments using Abstract Meaning Representation Knowledge Graph 100 50 7/26/2015 Many NLP Problems can be Converted into Graph Search Bursty Information Networks Decipherment (Tao et al., 2015submission) Query Graph Constructed from Chinese English Knowledge Graph 101 gIndex: Indexing Frequent and Discriminative Substructures Why index frequent substructures? Too many substructures to index discriminative(~103) Size‐increasing support threshold Large structures will likely be indexed well by their frequent (~105) substructures structure (>106) Why discriminative substructures? Reduce the index size by an order of magnitude Selection: Given a set of selected structures f1, f2, .. fn, and a new structure x, the extra indexing power is measured by Pr x f 1 , f 2 , f n , f i x when Pr(x|f1, f2, …, fn) is small enough, x is a discriminative structure and should be included in the index Experiments show gIndex is small, effective and stable support min‐support threshold size 102 51 7/26/2015 Structure Similarity Search Assisted with Graph Indices The necessity can be clearly shown in search for similar chemical compounds (a) caffeine (b) diurobromine Data graphs Query graph (c) viagra … Feature set How to conduct approximate search? Build indices covering all the similar subgraphs? No! Idea: (1) keep the index structure (2) select features in the query space Query relaxation measure: Features are more important than strict substructure matching Only need to index a set of smaller features 103 Schemaless Queries over Knowledge Graphs Knowledge Graphs are ubiquitous Queries on KN are mostly schemaless NL queries Challenges: ambiguities on query resolution Name: Bison Class: Mammal Yellowstone NP Phylum: Chordate Univeristy Order: Even‐toed ungulate follow Business join Bison Video like music Comment: Bison are large even‐toed ungulates within the subfamily ... follow Mammal Football League Photo Country Courtesy of Shengqi Yang, UCSB 104 Ex: When was UIUC founded? And how? 52 7/26/2015 Major Challenges in Schemaless Query Answering Extracting entities and relationships from natural language queries Mismatch between knowledge graph and queries: Figure out the best transformation Knowledge Graph Query “the University of Illinois at Urbana-Champaign” “UIUC” “neoplasm” “tumor” “Doctor” “Dr.” “Barack Obama” “Obama” “Jeffrey Jacob Abrams” “J. J. Abrams” “teacher” “educator” “1980” “~30” “3 mi” “4.8 km” “Hinton” - “DNNresearch” - “Google” “Hinton” - “Google” … … 105 Processing of Schemaless Queries by Pattern Matching Users freely post queries in natural language, without knowledge on data graphs Schemaless query (SLQ) system finds results through a set of transformations Ex. Query A Match Prof., ~70 yrs Geoffrey Hinton (Professor, 1947) UT Google DNNresearch University of Toronto Google 106 Acronym transformation: ‘UT’ ‘University of Toronto’ Abbreviation transformation: ‘Prof.’ ‘Professor’ Numeric transformation: ‘~70’ ‘1947’ Structural transformation: an edge a path 53 7/26/2015 Automatic Generation of Training Data 1. 2. 3. 4. 5. 4. Labeling the results Samuel Tom Tom Cruise … Sampling: A set of subgraphs are randomly Knowledge graph extracted from the data graph Query generation: queries are generated by randomly adding transformations on the extracted subgraphs 1. Sampling Searching: search generated queries on the data graph 2. Adding transformations Labeling: results are labeled based on original subgraphs Training: queries + labeled results, are used Tom to estimate the weights of different transformations training queries results 5. Training the ranking model P( (Q) | Q) 107 Online Query Processing Using Knowledge Graphs Finding the results to a query Configuration of latent random variables in CRFs Inference in Conditional Random Fields (CRFs) Top‐1 result: Computing the most likely assignment Approximate inference: Loopy Belief Propagation Top‐K result generation Finding the M Most Probable Configurations Using Loopy Belief Propagation [Yanover et al., NIPS’04] Experiments: Queries generated on YAGO2 SLQ [Yang, et al, VLDB’14] shows high performance 108 54 7/26/2015 Section 2: Graph Pattern Mining and Pattern-Based Classification Application to NLP: Authorship Classification The writing style is one of the best indicators for original authorship Substructures: k‐embedded‐edge subtree patterns holds more information than basic syntactic features: function words, POS(Part of Speech) tags, and rewrite rules Ex: A k‐embedded‐edge subtree t mined from NYT journalists Jack Healy and Eric Dash On average, 21.2% of Jack’s sentences contained t while only 7.2% of Eric’s contained t 110 55 7/26/2015 Structural Pattern Tells More on Authorship Classification Binned information gain score distribution of various feature sets FW: function words POS: POS tags BPOS: bigram POS tags RR: rewrite rules k‐ee subtrees # features for FW, POS, RR and k‐ee feature sets Accuracy comparison on # authors and various datasets 111 Section 3: Similarity Computation in Heterogeneous Networks 56 7/26/2015 Structures Facilitate Mining Heterogeneous Networks Network construction: generates structured networks from unstructured text data Each node: an entity; each link: a relationship between entities Each node/link may have attributes, labels, and weights Heterogeneous, multi‐typed networks: e.g., Medical network: Patients, doctors, diseases, contacts, treatments Movie Studio Actor Venue Paper Author DBLP Bibliographic Network Movie Director The Facebook Network The IMDB Movie Network 113 Why Mining Typed Heterogeneous Networks? A homogeneous network can be derived from its “parent” heterogeneous network Ex. Coauthor networks from the original author‐paper‐conference networks Heterogeneous networks carry richer info. than the projected homogeneous networks Typed nodes & links imply more structures, leading to richer discovery Ex.: DBLP: A Computer Science bibliographic database (network) Knowledge hidden in DBLP Network Mining Functions Who are the leading researchers on Web search? Ranking Who are the peer researchers of Jure Leskovec? Similarity Search Whom will Christos Faloutsos collaborate with? Relationship Prediction Which relationships are most influential for an author to decide her topics? Relation Strength Learning How was the field of Data Mining emerged or evolving? Network Evolution Which authors are rather different from his/her peers in IR? Outlier/anomaly detection 114 57 7/26/2015 Meta-Path and Similarity Search in Heterogeneous Networks Similarity measure/search is the base for cluster analysis Who are the most similar to Christos Faloutsos based on the DBLP network? Meta‐Path: Meta‐level description of a path between two objects A path on network schema Denote an existing or concatenated relation between two object types Different meta‐paths tell different semantics Meta‐Path: Author‐Paper‐Author Meta‐Path: Author‐Paper‐Venue‐Paper‐Author Co‐authorship Meta‐path: A‐P‐A Christos’ students or close collaborators Work in similar fields with similar reputation 115 Existing Popular Similarity Measures for Networks Random walk (RW): The probability of random walk starting at x and ending at y, with meta‐path P s( x, y ) pP prob( p ) X Used in Personalized PageRank (P‐Pagerank) (Jeh and Widom 2003) Favors highly visible objects (i.e., objects with large degrees) Pairwise random walk (PRW): The probability of pairwise random walk starting at (x, y) and ending at a common object (say z), following a meta‐path (P1, P2) X s( x, y ) ( p , p )( P , P ) prob( p1 ) prob( p2 ) 1 2 1 2 Used in SimRank (Jeh and Widom 2002) Favors pure objects (i.e., objects with highly skewed distribution in their in‐links or out‐links) Y P Note: P-PageRank and SimRank do not distinguish object type and relationship type P2‐1 P1 Y Z 116 58 7/26/2015 Which Similarity Measure Is Better for Finding Peers? PathSim: Favors peers Peers: Objects with strong connectivity and similar visibility with a given meta‐path x Who is more similar to Mike? y Meta‐path: APCPA Mike publishes similar # of papers as Bob and Mary Other measures find Mike is closer to Jim Author\Conf. SIGMOD VLDB ICDM KDD Mike 2 1 0 0 Jim 50 20 0 0 Mary 2 0 1 0 Bob 2 1 0 0 Ann 0 0 1 1 Measure\Author Jim Mary Bob Ann P‐PageRank 0.376 0.013 0.016 0.005 SimRank 0.716 0.572 0.713 0.184 Random Walk 0.8983 0.0238 0.0390 0 Pairwise R.W. 0.5714 0.4440 0.5556 0 PathSim (APCPA) 0.083 0.8 1 0 Comparison of Multiple Measures: A Toy Example 117 Example with DBLP: Find Academic Peers by PathSim Anhai Doan CS, Wisconsin Database area PhD: 2002 Amol Deshpande CS, Maryland Database area PhD: 2004 Jignesh Patel CS, Wisconsin Database area PhD: 1998 Meta-Path: Author-Paper-Venue-Paper-Author Jun Yang CS, Duke Database area PhD: 2001 118 59 7/26/2015 Application to NLP: Entity Morph Resolution = “Tender Beef Pentagon” (嫩牛五方) = “Yang Mi” (杨幂) The Hutt = “Conquer West King” (平西王) 119 Chris Christie = “Bo Xilai” (薄熙来) “instant noodles” (方便面) “Zhou Yongkang” (周永康) 119 Target Candidate Identification • Considering all entities will be too overwhelming 120 120 60 7/26/2015 Further Narrow down by Social Correlation 121 121 Further Narrow down by Social Correlation 122 A morph and its target are likely to be posted by two users with strong social correlation at Weibo and Twitter respectively (test data: average social correlation = 0.923) Explicit Social Correlation between users from re‐tweet, mentioning, reply and follower networks (Wen and Lin, 2010) Compute the degree of separation in user interactions and the amount of interactions Infer Implicit Social Correlation between users by topic modeling and opinion mining Users who share similar interests are likely to post similar information and opinions Measure content similarity of the messages posted by two users Temporal + Social Correlation: Narrow down the number of target candidates to 1% with 100% accuracy 61 7/26/2015 Cross-genre Comparable Corpora Conquer West King from Chongqing fell from power, still need to sing red songs? There is no difference between that guy’s plagiarism and Buhou’s gang crackdown. Remember that Buhou said that his family was not rich at the press conference a few days before he fell from power. His son Bo Guagua is supported by his scholarship. Bo Xilai: ten thousand letters of accusation have been received during Chongqing gang crackdown. The webpage of “Tianze Economic Study Institute” owned by the liberal party has been closed. This is the first affected website of the liberal party after Bo Xilai fell from power. Bo Xilai gave an explanation about the source of his son, Bo Guagua’s tuition. Bo Xilai led Chongqing city leaders and 40 district and county party and government leaders to sing red songs. Twitter and Chinese News (uncensored) Weibo (censored) 123 Constructing Heterogeneous Information Networks Affiliation Justice Children PER Top_employee Bo Xilai PER Top‐employee Best Actor China Located Chongqing CCP Justice Member GPE Top_employee Bo Guagua Children Earth Common Conquer West King Member Wen Jiabao PER GPE Top‐employee ORG Affiliation Each node is an entity An edge: a semantic relation, event, sentiment, semantic role, dependency relation or co‐occurrence, associated with confidence values 124 124 62 7/26/2015 Meta-paths Network Schema M: Morphs E: Entities EV: Events NP: Non-Entity Noun Phrases Meta‐path Examples: o M – E – E o M – EV – E o M – NP – E 125 125 Data and Scoring Metric Data o o o o o Time frame: 05/01/2012‐06/30/2012 1555K Chinese messages from Weibo 66K formal web documents from embedded URL 25K Chinese messages from English Twitter for sensitive morphs Test on 107 morph entities in Weibo, 23 of them are sensitive Scoring Metric A cc @ k C k / T o o 126 Ck: the number of correctly resolved morphs at top position k T: the total number of morphs in ground truth 126 63 7/26/2015 Overall Performance 65.9% 70.1% 59.4% 47.7% 51.9% 41.6% 37.9% 23.4% 1 5 10 Homogeneous Network 127 20 1 5 10 20 Hetereogeneous Network 127 Section 4: Mining Heterogeneous Information Networks: Prediction and Recommendation 64 7/26/2015 Relationship Prediction vs. Link Prediction Link prediction in homogeneous networks [Liben‐Nowell and Kleinberg, 2003, Hasan et al., 2006] E.g., friendship prediction Relationship prediction in heterogeneous networks Different types of relationships need different prediction models vs. Different connection paths need to be treated separately! Meta‐path‐based approach to define topological features vs. 129 PathPredict: Meta-Path Based Relationship Prediction vs. Meta path‐guided prediction of links and relationships Philosophy: Meta path relationships among similar typed links share similar semantics and are comparable and inferable Co‐author prediction (A—P—A) Use topological features encoded by meta paths, e.g., citation relations between authors (A—P→P—A) Meta‐Path Semantic Meaning Meta‐paths between authors under length 4 130 65 7/26/2015 The Power of PathPredict: Experiment on DBLP Explain the prediction power of each meta‐path Wald Test for logistic regression Higher prediction accuracy than using projected homogeneous network 11% higher in prediction accuracy Co‐author prediction for Jian Pei: Only 42 among 4809 candidates are true first‐time co‐authors! (Feature collected in [1996, 2002]; Test period in [2003,2009]) 131 Author Citation Time Prediction in DBLP Top‐4 meta‐paths for author citation time prediction Study the same topic Co‐cited by the same paper Follow co‐authors’ citation Social relations are less important in author citation prediction than in co‐ author prediction. 132 Follow the citations of authors who study the same topic Predict when Philip S. Yu will cite a new author Under Weibull distribution assumption 66 7/26/2015 Enhancing the Power of Recommender Systems by Heterog. Info. Network Analysis Avatar Aliens Titanic A small set of users & items have a large number of ratings Revolutionary Road # of ratings Collaborative filtering methods suffer from the data sparsity issue Most users & items have a small number of ratings # of users or items Zoe Saldana Adventure James Cameron Leonardo Dicaprio Kate Winslet Romance Personalized recommendation with heterog. Networks [WSDM’14] Heterogeneous relationships complement each other Users and items with limited feedback can be connected to the network by different types of paths Connect new users or items in the information network Different users may require different models: Relationship heterogeneity makes personalized recommendation models easier to define 133 Relationship Heterogeneity Based Personalized Recommendation Models Different users may have different behaviors or preferences Two levels of personalization James Cameron fan Aliens 80s Sci‐fi fan Sigourney Weaver fan Different users may be interested in the same movie for different reasons Data level • Most recommendation methods use one model for all users and rely on personal feedback to achieve personalization Model level • With different entity relationships, we can learn personalized models for different users to further distinguish their differences 134 67 7/26/2015 Preference Propagation-Based Latent Features genre: drama Bob Charlie King Kong Naomi Watts tag: Oscar Nomination Ralph Fiennes Alice Titanic skyfall Kate Winslet Generate L different meta‐path (path types) connecting users and items revolutionary road Sam Mendes Calculate latent‐ features for users and items for each meta‐path with NMF related method Propagate user implicit feedback along each meta‐ path 135 Recommendation Models Observation 1: Different meta‐paths may have different importance Global Recommendation Model ranking score features for user i and item j (1) the q‐th meta‐path Observation 2: Different users may require different models Personalized Recommendation Model user‐cluster similarity L (2) 136 c total soft user clusters 68 7/26/2015 Parameter Estimation • Bayesian personalized ranking (Rendle UAI’09) • Objective function sigmoid function min (3) for each correctly ranked item pair i.e., gave feedback to but not Soft cluster users with NMF + k‐means For each user cluster, learn one model with Eq. (3) Generate personalized model for each user on the fly with Eq. (2) Learning Personalized Recommendation Model 137 Experiments: Heterogeneous Network Modeling Enhances the Quality of Recommendation Datasets Comparison methods Popularity: recommend the most popular items to users Co‐click: conditional probabilities between items NMF: non‐negative matrix factorization on user feedback Hybrid‐SVM: use Rank‐SVM with plain features (utilize both user feedback and information network) HeteRec personalized recommendation (HeteRec‐p) leads to the best recommendation 138 69 7/26/2015 Summary of Theme III Network representation provides more extensive analysis on text corpora Relatively homogeneous, clean, common substructures appear frequently Graph indexing + approximate structural search Graph pattern mining Heterogeneous, noisy, knowledge sparsity, but clear meta‐path schema Meta‐path based similarity search Meta‐path guided prediction, classification and recommendation in heterogeneous networks 139 References for Theme III: Graph Search and Mining C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures of molecules”, ICDM'02 S. Kim, H. Kim, T. Weninger, J. Han, H. D. Kim, "Authorship Classification: A Discriminative Syntactic Tree Mining Approach", SIGIR'11 M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01 S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference. KDD'04 X. Yan and J. Han, “gSpan: Graph‐Based Substructure Pattern Mining”, ICDM'02 X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03 X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure‐based Approach”, SIGMOD'04 X. Yan, P. S. Yu, and J. Han, "Substructure Similarity Search in Graph Databases", SIGMOD'05 S. Yang, Y. Wu, H. Sun, X. Yan, Schemaless and Structureless Graph Querying, VLDB'14 F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han, and P. S. Yu, "Mining Top‐K Large Structural Patterns in a Massive Network", VLDB'11 140 70 7/26/2015 References for Theme III: Heterogeneous Network Search & Mining H. Huang, Z. Wen, D. Yu, H. Ji, Y. Sun, J. Han and H. Li. 2013. “Resolving Entity Morphs in Censored Data”, ACL’13 D. Yu, H. Huang, T. Cassidy, H. Ji, C. Wang, S. Zhi, J. Han, C. Voss and M. Magdon‐Ismail. “The Wisdom of Minority: Unsupervised Slot Filling Validation based on Multi‐dimensional Truth‐ Finding”, COLING’14 Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers, 2012 Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, T. Wu, "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT’09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path‐Based Top‐K Similarity Search in Heterogeneous Information Networks”, VLDB'11 Y. Sun, J. Han, C. C. Aggarwal, and N. Chawla, "When Will It Happen? Relationship Prediction in Heterogeneous Information Networks", WSDM'12 X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han, "Personalized Entity Recommendation: A Heterogeneous Information Network Approach", WSDM'14 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co‐Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11 141 Conclusions and Future Research Directions 142 71 7/26/2015 Conclusions Data mining and NLP have been working on a common goal: Turning data into knowledge (D2K), but with different methodologies Data mining explore more on massive amount of data, but not in‐depth analysis on individual documents Two fields will benefit each other by integrating their technologies/methodologies We have been proposing a D2N2K (data to network to knowledge) methodology Phrase mining in massive corpora Entity recognition and typing by correlation and cluster analysis Construction of massive typed heterogeneous information networks Mining actionable knowledge from such “semi‐structured” information networks Lots to be done in the future! 143 Looking Forward Lot of unanswered questions and research issues What is the best framework for integration and joint inference? Is there an ideal common representation, or a layer in between? Even go beyond Heterogeneous Information Networks? Apply D2N2K framework to other NLP applications Network search based Collective Entity Linking Cross‐lingual information network alignment Semantic graph based Machine Translation Expert finding, research recommendation, … … 144 72 7/26/2015 Tools Checking our research package dissemination portal IlliMine http://illimine.cs.uiuc.edu/ Phrase Mining https://github.com/shangjingbo1226/SegPhrase Entity Typing http://shanzhenren.github.io/ClusType Entity Morph Decoding http://nlp.cs.rpi.edu/software/morphdecoding.tar.gz Graph Mining Tools Gspan http://www.cs.ucsb.edu/~xyan/software/gSpan.htm 145 Acknowledgement NETWORK SCIENCE CTA 146 73 7/26/2015 BACKUP Different Input Focus NLP: Unstructured Text Data Data Mining (DM): More on Structured and Semi‐structured Data 148 74 7/26/2015 NLP Challenge 1: Data and Knowledge Sparsity Data Sparsity: Need to obtain high‐level statistics as global evidence Training a typical supervised model needs a large number of instances 500 documents for phrase chunking, and entity, relation, event extraction 20,000 queries for entity linking Knowledge Sparsity: Require global knowledge acquired from a wider context with low cost Source Translation out of hype‐speak: some kook made threatening noises at Brownback and go arrested KB Samuel Dale "Sam" Brownback (born September 12, 1956) is an American politician, the 46th and current Governor of Kansas. Who is Brownback? background data/knowledge aggregation 149 DM Solution: Leverage Information Redundancy Acquire a large amount of related documents from multiple sources (genres, languages, data modalities) Learn high‐level data‐driven statistics to extract, type and link information Frequent pattern mining, ranking based on different criteria Popularity: ‘information retrieval’ vs. ‘cross‐language information retrieval’ Concordance: ‘active learning’ vs.‘learning classification’ Completeness: ‘vector machine’ vs. ‘support vector machine’ Construct relational graphs for weakly‐supervised label propagation 150 75 7/26/2015 NLP Challenge 2: From Data to Knowledge (D2K) Directly converting a large amount of unstructured data into knowledge is too ambitious and unrealistic Example 1 Input: Millions of discussion forum posts under censorship Output: Resolve each implicit entity mention to its real target Example 2 Input: 15 years of non‐parallel Chinese and English news articles Output: Entity translation pairs Example 3 Input: Millions of multi‐source documents reporting conflicting information Output: Track each company’s top employees over time, complete knowledge bases 151 Session 1. Phrase Mining: A Data Mining Approach 76 7/26/2015 TNG: Experiments on Research Papers 153 Phrase Mining and Topic Modeling: An Example Motivation: Unigrams (single words) can be difficult to interpret Ex.: The topic that represents the area of Machine Learning learning learning support vector machines reinforcement reinforcement learning support feature selection machine vector selection feature random versus conditional random fields classification decision trees : : 154 77 7/26/2015 KERT: Topical Keyphrase Extraction & Ranking knowledge discovery using least squares support vector machine classifiers learning support vectors for reinforcement learning support vector machines a hybrid approach to feature selection reinforcement learning pseudo conditional random fields feature selection automatic web page classification in a dynamic and hierarchical way conditional random fields inverse time dependency in convex regularized learning decision trees postprocessing decision trees to extract actionable knowledge : classification variance minimization least squares support vector machines … Topical keyphrase extraction & ranking Unigram topic assignment: Topic 1 & Topic 2 155 Framework of KERT 1. Run bag‐of‐words model inference and assign topic label to each token 2. Extract candidate keyphrases within each topic Frequent pattern mining 3. Rank the keyphrases in each topic Popularity: ‘information retrieval’ vs. ‘cross‐language information retrieval’ Discriminativeness: only frequent in documents about topic t Concordance: ‘active learning’ vs.‘learning classification’ Completeness: ‘vector machine’ vs. ‘support vector machine’ Comparability property: directly compare phrases of mixed lengths 156 78 7/26/2015 Summary: Strategies on Topical Phrase Mining Strategy 1: Generate bag‐of‐words → generate sequence of tokens Integrated complex model; phrase quality and topic inference rely on each other Slow and overfitting Strategy 2: Post bag‐of‐words model inference, visualize topics with n‐grams Phrase quality relies on topic labels for unigrams Can be fast; generally high‐quality topics and phrases Strategy 3: Prior bag‐of‐words model inference, mine phrases and impose to the bag‐of‐words model Topic inference relies on correct segmentation of documents, but not sensitive Can be fast; generally high‐quality topics and phrases 157 Session 2. Simultaneously Inferring Phrases and Topics 79 7/26/2015 Session 3. Post Topic Modeling Phrase Construction Session 4. First Phrase Mining then Topic Modeling 80 7/26/2015 Session 5. SegPhrase: Integration of Phrase Mining with Document Segmentation Mining Phrases: Why Not Use Raw Frequency Based Methods? Traditional data‐driven approaches Frequent pattern mining If AB is frequent, likely AB could be a phrase Raw frequency could NOT reflect the quality of phrases E.g., freq(vector machine) ≥ freq(support vector machine) Need to rectify the frequency based on segmentation results Phrasal segmentation will tell Some words should be treated as a whole phrase whereas others are still unigrams 162 81 7/26/2015 ClassPhrase I: Pattern Mining for Candidate Set Build a candidate phrase set by frequent pattern mining Mining frequent k‐grams k is typically small, e.g. 6 in our experiments Popularity measured by raw frequent words and phrases mined from the corpus 163 Classifier Limited Training Labels: Whether a phrase is a quality one or not “support vector machine”: 1 “the experiment shows”: 0 For ~1GB corpus, only 300 labels Random Forest as our classifier Predicted phrase quality scores lie in [0, 1] Bootstrap many different datasets from limited labels 164 82 7/26/2015 SegPhrase: Why Do We Need Phrasal Segmentation in Corpus? Phrasal segmentation can tell which phrase is more appropriate Ex: A standard feature vector machine learning setup is used to describe... Not counted towards the rectified frequency Rectified phrase frequency (expected influence) Example: 165 SegPhrase: Segmentation of Phrases Partition a sequence of words by maximizing the likelihood Considering Phrase quality score ClassPhrase assigns a quality score for each phrase Probability in corpus Length penalty length penalty : when 1, it favors shorter phrases Filter out phrases with low rectified frequency Bad phrases are expected to rarely occur in the segmentation results 166 83 7/26/2015 SegPhrase+: Enhancing Phrasal Segmentation SegPhrase+: One more round for enhanced phrasal segmentation Feedback Using rectified frequency, re‐compute those features previously computed based on raw frequency Process Classification Phrasal segmentation // SegPhrase Classification Phrasal segmentation // SegPhrase+ Effects on computing quality scores np hard in the strong sense np hard in the strong data base management system 167 Linking Entities to Knowledge Base Stay up Hawk Fans. We are going through a slump, but we have to stay positive. Go Hawks! 168 84 7/26/2015 Recognizing Typed Entities Identifying token span as entity mentions in documents and labeling their types Target Types The best BBQ:Food I’ve tasted in Phoenix:LOC ! I had the [pulled pork sandwich]:Food with coleslaw:Food and [baked beans]:Food for lunch. … The owner:JOB_TITLE is very nice. … The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. ... The owner is very nice. … FOOD LOCATION JOB_TITLE EVENT ORGANIZATION Text with typed entities Plain text … FOOD EVENT LOCATION Enabling structured analysis of unstructured text corpus 169 Entity Linking: A Relational Graph Approach Relational graph: Each pair of mention m and concept c as a node 0 1 0 Local Compatability Coreference 1 1 0 1 1 The model (Adapted from Zhu2003) Semantic Relatedness yi: the label of node i W: weight matrix of the relational graph 170 85 7/26/2015 Relational Graph Construction (con’t) 0.44 1.0 0.62 0.52 0.68 0.91 1.0 0.89 0.43 0.32 0.55 0.87 0.68 Semantic Relatedness (SR) SR between two mentions: meta path SR between two concepts: link structure in Wikipedia (Milne and Witten, 2008) Linear Combination of these three graphs 171 Models for Comparison TagMe: an unsupervised model based on prior popularity and semantic relatedness of a single message (Ferragina and Scaiella, 2010) Meij: the state‐of‐the‐art supervised approach based on the random forest model (Meij et al., 2012) SSRegu: our proposed semi‐supervised graph regularization model with all three types of relations 172 86 7/26/2015 Quality of Semantic Relatedness DSRM Standard Relatedness Method M&W (Milne and Witten, 2008) 173 Impact of Semantic Relatedness on Concept Disambiguation News dataset: 4,485 mentions (Hoffart et al., 2011) AIDA: a unsupervised collective inference method (Hoffart et al., 2011) Our methods are completely unsupervised TagMe Meij SSRegu+M&W SSRegu+DSRM Tweet Set AIDA SSRegu+M&W SSRegu+DSRM News Dataset 174 87 7/26/2015 Remaining Challenges Mention detection is performance bottleneck Mention disambiguation: city and country names that refer to sports teams (e.g., “Miami” ‐> “Miami Heat”) Incorporate user interests Non‐linkable entity mention recognization and clustering 175 Error Distribution 175 Transformation-Based Graph Querying Transformation‐based graph querying produces many results How to suggest the “best” results to users? Different transformations, should be weighted differently Ex.: Given a single node query, “Geoffrey Hinton” Nodes with “G. Hinton” (Abbreviation transformation) shall be ranked higher than Nodes with “Hinton” (Last token transformation) How to determine them? Weights shall be learned Evaluate a Candidate Match: Ranking Function Features FV (v, ( v )) i f i ( v, (v )) i Node matching feature: FE ( e, ( e)) j g j ( e, ( e)) Edge matching feature: j Matching Score: P ( (Q ) | Q ) exp( FV ( v, ( v )) FE ( e, ( e))) 176 vVQ eEQ 88 7/26/2015 Potential Challenges in Applying Graph Indices + Search to NLP The graphs automatically constructed from natural language texts might: Include many more diverse types and weights Include more ambiguity Knowledge sparsity; hard to generalize unique nodes/substructures Include more noise and errors 177 Mining Frequent (Sub)Graph Patterns Data mining research has developed many scalable graph pattern mining methods Given a labeled graph dataset D = {G1, G2, …, Gn), the supporting graph set of a subgraph g is Dg = {Gi | g Gi, Gi D}. support(g) = |Dg|/ |D| A (sub)graph g is frequent if support(g) ≥ min_sup OH Ex.: Chemical structures N S N O 178 Alternative: Mining frequent subgraph patterns from a single large graph or network Documents can be viewed as graph structures as well Graph Dataset O O HO O (A) N O (B) min_sup = 2 O N (C) Frequent Graph Patterns O (1) support = 67% (2) N O N N 89 7/26/2015 Efficient Graph Pattern Mining Methods Effective Graph pattern mining methods have been developed for different scenarios Mining graph “transaction” datasets Ex. gSpan (Yan et al, ICDM’02), CloseGraph (Yan et al., KDD’03), … Mining frequent large subgraph structures in single massive network Ex. Spider‐Mine (Zhu, et al., VLDB’11) Graph pattern mining forms building blocks for graph classification, clustering, compression, comparison, and correlation analysis Graph indexing and graph similarity search gIndex (SIGMOD’05): Graph indexing by graph pattern mining Help precise as well as similarity‐based graph query search and answering 179 Mining Collaboration Patterns in DBLP Networks Data description: 600 top confs, 9 major CS areas, 15071 authors in DB/DM Author labeled by # of papers published in DB/DM Prolific (P): >=50, Senior (S): 20~49, Junior (J): 10~19, Beginner(B): 5~9 Patterns found Instance in data Instance in data Patterns found Instance in data Instance in data Patterns found 180 90 7/26/2015 Applications of Graph Pattern Mining Bioinformatics Gene networks, protein interactions, metabolic pathways Chem‐informatics: Mining chemical compound structures Social networks, web communities, tweets, … Cell phone networks, computer networks, … Web graphs, XML structures, semantic Web, information networks Software engineering: Program execution flow analysis Building blocks for graph classification, clustering, compression, comparison, and correlation analysis Graph indexing and graph similarity search 181 Logistic Regression-Based Prediction Model Training and test pair: <xi, yi> = <history feature list, future relationship label> A‐P‐A‐P‐A A‐P‐V‐P‐A A‐P‐T‐P‐A A‐P‐>P‐A A‐P‐A <Mike, Ann> 4 5 100 3 Yes = 1 <Mike, Jim> 0 1 20 2 No = 0 Logistic Regression Model Model the probability for each relationship as is the coefficients for each feature (including a constant 1) MLE estimation Maximize the likelihood of observing all the relationships in the training data 182 91 7/26/2015 When Will It Happen? From “whether” to “when” “Whether”: Will Jim rent the movie “Avatar” in Netflix? Output: P(X=1)=? “When”: When will Jim rent the movie “Avatar”? Output: A distribution of time! What is the probability Jim will rent “Avatar” within 2 months? By when Jim will rent “Avatar” with 90% probability? : 0.9 What is the expected time it will take for Jim to rent “Avatar”? 2 May provide useful information to supply chain management 183 The Relationship Building Time Prediction Model Solution Directly model relationship building time: P(Y=t) Geometric distribution, Exponential distribution, Weibull distribution Use generalized linear model Deal with censoring (relationship builds beyond the observed time T: Right interval) Censoring Generalized Linear Model under Weibull Distribution Assumption Training Framework 184 92