Semantic Annotation of Microblog Topics Using Wikipedia Temporal
Transcription
Semantic Annotation of Microblog Topics Using Wikipedia Temporal
Semantic Annotation of Microblog Topics Using Wikipedia Temporal Information Tuan Tran 2nd International Alexandria Workshop, Hannover, Germany Hannover, Germany, 2-3 November 2015 1 Each Russian medalist at the Sochi Games rewarded with a Mercedes-Benz thestar.com/sports/sochi20… #Sochi #Olympics Notifications Messages Michael Felger @MikeFleger · 27 Feb 2014 Or it could have been food poisoning from #Sochi food. #bruinstalk Research Goal #sochi lang:en since:2014-02-01 until:2014-02-28 Fogo Von Slack @FogoVonSlack · 27 Feb 2014 In #Sochi, Russia, did Vladimir a stately pleasure dome decree . . . . Understanding meanings of past trending hashtags Delilah Nichols @killerdee187 · 27 Feb 2014 Tell our #Olympics champions to be role models and reject @McDonalds sponsorships act.stopcorporateabuse.org/p/dia/action3/… via @StopCorpAbuse #Sochi What was KHL Blog @KHLblog · 27 Feb 2014 #sochi about in Pavel Bure could run a #KHL team in #Sochi prohockeytalk.nbcsports.com/2014/02/24/pav…February 2014 ? Yahoo Labs and 1 other follow Flickr @Flickr · 27 Feb 2014 Photographer Diego Jimenez sums up his #Sochi experience and tips for event #photography blog.flickr.net/en/2014/02/27/… Hannover, Germany, 2-3 November 2015 2 Each Russian medalist at the Sochi Games rewarded with a Mercedes-Benz thestar.com/sports/sochi20… #Sochi #Olympics Notifications Messages Michael Felger @MikeFleger · 27 Feb 2014 Or it could have been food poisoning from #Sochi food. #bruinstalk Research Goal #sochi lang:en since:2014-02-01 until:2014-02-28 Fogo Von Slack @FogoVonSlack · 27 Feb 2014 In #Sochi, Russia, did Vladimir a stately pleasure dome decree . . . . Understanding meanings of past trending hashtags • Annotation is a crucial step ! • Goal: Mapping hashtag to Wikipedia pages Delilah Nichols @killerdee187 · 27 Feb 2014 Tell our #Olympics champions to be role models and reject @McDonalds sponsorships act.stopcorporateabuse.org/p/dia/action3/… via @StopCorpAbuse #Sochi What was KHL Blog @KHLblog · 27 Feb 2014 #sochi about in Pavel Bure could run a #KHL team in #Sochi prohockeytalk.nbcsports.com/2014/02/24/pav…February 2014 ? Yahoo Labs and 1 other follow Flickr @Flickr · 27 Feb 2014 Photographer Diego Jimenez sums up his #Sochi experience and tips for event #photography blog.flickr.net/en/2014/02/27/… Hannover, Germany, 2-3 November 2015 3 Research Goal Semantic annotation for hash tag is not easy: • Hashtags are peculiar #sochi #jan25 #mh370 #crimea vs. vs. vs. vs. 2014_Winter_Olympics Egypt_Revolution_of_2011 Malaysia_Airlines_Flight_370 Ukrainian_crisis • Time matters #germany : Greek_government-debt_crisis European_migrant_crisis 2014_FiFA_World_Cup July 2014 July 2015 Sep 2015 Hannover, Germany, 2-3 November 2015 4 Semantic Annotations in Twitter Mostly in tweet level: • Tweak the similarities (TAGME, CIKM’10; Liu, ACL’13) • Employ Twitter-specific features with human supervision (Meij, WSDM’12; Guo, NAACL’13) • Expand context to users (Cassidy, COLING’12; KAURI; KDD’13), time (Fang, TACL’14; Hua, SIGMOD’15) Annotating hashtags is limited to general topic models (Ma, CIKM’14; Bansal, ECIR’15) Hannover, Germany, 2-3 November 2015 5 Main Idea Align contexts from both sides: Hard%to%believe%anyone%can%do%worse%than%Russia%in%#Sochi.%Brazil% seems%to%be%trying%pre;y%hard%though!%spor=ngnews.com…%%%% #sochi(Sochi%2014:%Record%number%of%posi=ve%tests% F%SkySports:%q.gs/6nbAA% • Twitter: All constituent tweets of the hashtag • Wikipedia: Temporal signals from page views and edit history #Sochi%Sea%Port.%What%a% beau=ful%site!%#Russia( 2014_Winter_Olympics% Port_of_Sochi% Hannover, Germany, 2-3 November 2015 Framework Outline Problem: Given a trending hashtag, its burst time period T, identify top-k most prominent entities to describe the hashtag in T. Three steps: 1. Candidate Entities Identification 2. Entity – Hashtag Similarities 3. Entity Prominence Ranking Hannover, Germany, 2-3 November 2015 7 Candidate Entities Identification Mine from tweets contents via lexical matching. • Twitter side: Extract n-grams from tweets (n ≤ 5) • Wikipedia side: Build a lexicon for each entity: Anchors of incoming links, Redirects, Titles, Disambiguation pages • Start with sample text, expand via links to increase recall Hannover, Germany, 2-3 November 2015 8 Entity – Hashtag Similarities: Link-based • Build upon phase – entity similarities P(t|m): “Commonness” Commonness(m t) count(m t) count(m t ') t' W • Use Commonness • Aggregate linearly to P(Title|”Chicago”) hashtag – entity similarity 44 (Slide from Dan Roth: “Wikification and Beyond: The Challenges of Entity and Concept Grounding”. Tutorial at ACL 2014) Hannover, Germany, 2-3 November 2015 9 during the time period T , plus one day to acknowlking nd 1, edge possible time lags. We compute the differable. ence between two consecutive revisions, and exHashtag Similarities: e pa- Entity tract–only the added text snippets.Text-based These snippets data, are accumulated to form the temporal context of an entity the e during T , denoted by Cover The dis-and texts T (e). • Compare distributions of words hashtags tribution of a word w for the entity e is estimated ofbythe entity between the probability of generating a mixture • Consider both entity’s context static text and temporal edits: w from the temporal and from the general meacontext C(e) of the entity: uted. dual robenmate P̂ (w|e) = P̂ (w|MCT (e) )+(1 Language model of e’s CT (e)edited text during C(e)T )P̂ (w|MC(e) ) Language model of e’s latest text where M and M are the language models of e based on CT (e) and C(e), respectively. The probability P̂ (w|MC(e) ) can be regarded as corresponding to the background model, Hannover, Germany, 2-3 November 2015 10 Figure 2: Excerpt of tweets about ice hockey results in the 2014 Winte linking process between time-aligned revisions of candidate Wikipedia from prominent entities to marginal ones to provide background, or m starting from prominent entities, we can reach more entities in the grap Entity – Hashtag Similarities: Collective Attention-based effective metric that is agnostic to the scaling and as they are large time lag of time series (Yang and Leskovec, 2011). attention to each Comparethe thedistance temporal correlation collective It measures between two of time series attention hashtags calls fo between the hashtag entity: by finding optimal shiftingand andthe scaling parameters the idea of cohe to match the shape of two time series: Influence Max kT Sh dq (T Se )k new approach t ft (e, h) = min (4) q, kT Sh k use an observe time series shifted from TSe by qTunits where dq (T Se ) is the time series derived from Se Wikipedia page tity prominence by shifting q time units, and k·k is the L2 norm. It are prominent f has been proven that Equation 4 has a closed-form 1 and s updated, [1] Jaewon Yang, Jure Leskovec. “Patterns of Temporal Variation in Online Media”. WSDM 2011 solution for given fixed q, thus we can design an related entities. Hannover, Germany, 2-3 November 2015 11 spond3.3 Ranking Entity Prominence extract within While each similarity measure captures one eviMeij et dence of the entity prominence, we need to unify Entity ms and all Prominence scores to obtain Ranking a global ranking function. In erwise this work, we propose to combine the individual grams. similarities using a similarity linear function: • Rank by the unified score: exiconf (e, h) = ↵fm (e, h)+ fc (e, h)+ ft (e, h) (1) tween the entity e and the hashtag h. We further entity context. W ts, apduring the time p constrain that ↵ + + = 1, so that the ranking mplicawhere ↵, , are model weights and f , f , f are c 1,t scores of entities are normalized betweenm 0 and edge possible tim ecially similarity measures mentions, con-ence of • To the learn theour model ω =algorithm (α,based β, γ),ison we need a premise what tw and that learning more tractable. between text, temporal information, expenTheand algorithm, which automaticallyrespectively, learns the pa- be-tract only the ad makes an entity prominent ! rameters without the need of human-labeled data, are accumulated [2] is not is explained inpremise detail in Section 5. applicable at topic an entity • The coherence levele durin [2] tribution of a wo 4 Similarity Measures by a mixture betw from the temp inAlgorithms detail how the similarity mea- ACLw2012 RatinovWe et al.now “Localdiscuss and Global for Disambiguation to Wikipedia”. context C(e) of t sures between hashtags and entities are computed. Hannover, Germany, 2-3 November 2015 12 Influence Maximization Observation: Prominent pages are first created / updated with texts, then linked to other pages • Reflect the shifting attentions of users in social events[3] #love(#Sochi(2014:%Russia's%ice%hockey%dream% ends%as%Vladimir%Pu=n%watches%on%…% Vladimir_Pu>n3 Sochi3 #sochi%Sochi:%Team%USA%takes%3%more%medals,% tops%leaderboard%|%h;p://abc7.com% h;p://adf.ly/dp8Hn%% Russia3 2014_Winter_Olympics3 Russia_men’s_na>onal _ice_3hockey_team3 #Sochi%bear%aWer%#Russia's%hockey%team% eliminated%with%loss%to%#Finland( I'm%s=ll%happy%because%Finland%won.%Is%that%too% stupid..?%#Hockey%#Sochi( …3 [3] Ice_hockey_at_the_2014_ Winter_Olympics3 Finland3 United_States3 Ice_hockey3 Keegan et al. “Hot off the wiki: Dynamics, Practices, and structures in Wikipedia..” WikiSym 2011 Hannover, Germany, 2-3 November 2015 13 Influence Maximization Premise: Rank top-k entities so as to maximize the spreading to all other candidates. #love(#Sochi(2014:%Russia's%ice%hockey%dream% ends%as%Vladimir%Pu=n%watches%on%…% Vladimir_Pu>n3 Sochi3 #sochi%Sochi:%Team%USA%takes%3%more%medals,% tops%leaderboard%|%h;p://abc7.com% h;p://adf.ly/dp8Hn%% Russia3 2014_Winter_Olympics3 Russia_men’s_na>onal _ice_3hockey_team3 #Sochi%bear%aWer%#Russia's%hockey%team% eliminated%with%loss%to%#Finland( I'm%s=ll%happy%because%Finland%won.%Is%that%too% stupid..?%#Hockey%#Sochi( …3 Ice_hockey_at_the_2014_ Winter_Olympics3 Finland3 United_States3 Ice_hockey3 Hannover, Germany, 2-3 November 2015 14 Experiments Data: • Collect 500 million tweets for 4 months (Jan-Apr 2014) via Streaming API. • Process and sample distinct trending hashtags o Several heuristics + clustering methods[4] used à 2444 trendings o 3 inspectors chose 30 meaningful trending hashtags [4] Lehmann et al. “Dynamical Classes of Collective Attention in Twitter”. WWW 2012 Hannover, Germany, 2-3 November 2015 15 Experiments Baselines: • Wikiminer (Milne & Witten, CIKM 2008) • Tagme (Ferragina et al., IEEE Software 2012) • KAURI (Shen et al., KDD 2013) • Meij method (Meij et al., WSDM 2012) • Individual similarities : M (link), C (text), T (temporal) Evaluation: 6,965 entity-hashtag pairs are evaluated from 0-1-2 scales (5 evaluators, inter-agreement 0.6) Hannover, Germany, 2-3 November 2015 16 Experiments P@5 P@15 MAP Tagme Wikiminer Meij Kauri M C T IPL 0.284 0.253 0.148 0.253 0.147 0.096 0.500 0.670 0.375 0.305 0.319 0.162 0.453 0.312 0.211 0.263 0.245 0.140 0.474 0.378 0.291 0.642 0.495 0.439 Table 2: Experimental results on the sampled trending hashtags. ne We compare IPL with other entity ann methods. Our first group of baselines inentity linking systems in domains of genxt, Wikiminer (Milne and Witten, 2008), ort text, Tagme (Ferragina and Scaiella, For each method, we use the default param- in general 6.2Better Results and Discussion signal is Table 2 shows theNon-verbal performance comparison good methods using the standard metrics for a ra system (precision at 5 and 15 and MAP at 1 general, all baselines perform worse than re Hannover, Germany, 2-3 November 2015 in the literature, confirming the higher 17comp Experiments Better at top P@5 P@15 MAP Tagme Wikiminer Meij Kauri M C T IPL 0.284 0.253 0.148 0.253 0.147 0.096 0.500 0.670 0.375 0.305 0.319 0.162 0.453 0.312 0.211 0.263 0.245 0.140 0.474 0.378 0.291 0.642 0.495 0.439 Table 2: Experimental results on the sampled trending hashtags. Better when ne We compare IPL with other entity an6.2 Results and Discussion including n methods. Our first group of baselines in-entities. low-ranked Table 2 shows the performance comparison entity linking systems in domains of genxt, Wikiminer (Milne and Witten, 2008), ort text, Tagme (Ferragina and Scaiella, For each method, we use the default param- methods using the standard metrics for a ra system (precision at 5 and 15 and MAP at 1 general, all baselines perform worse than re Hannover, Germany, 2-3 November 2015 in the literature, confirming the higher 18comp Experiments Manually class events (hash tags’ peaks) to endogenous / exogenous via checking tweets’ contents 0.6" 0.5" Exogenous" 0.4" MAP dow of 2 mont constructed an method is stabl lines by a larg hashtag has bee noise, our meth nent entities w Endogenous" 0.3" 0.2" 0.1" 0" Tagme" WM" Meij" Kauri" M" C" T" IPL" Figure 3: Performance of the methods for different Hannover, Germany, 2-3 November 2015 7 Conclusio 19 In this work, Experiments Influence of size of burst time period: • Larger window à more noisy introduced Hannover, Germany, 2-3 November 2015 20 Conclusions • Semantic Annotation in topic level is difficult • Lesson learnt: Can be improved by exploiting temporal and contexts from both sides (non verbal evidences are promising) • Future direction: Improve efficiency, text-based similarities Hannover, Germany, 2-3 November 2015 21 Thank you J Question ? Hannover, Germany, 2-3 November 2015 22 Terminology Trending hashtag: There are peaks[1] from daily tweet count time series: • Variance score > 900 • Highest peak > 15 x median of 2-month window sample Burst time periods: w-window around one peak Entities: Wikipedia pages, no redirects, disambiguation, lists • Entity text, view count per day, edits during T [1] Lehman et al. “Dynamical classes of collective attention in Twitter”. WWW 2012 Hannover, Germany, 2-3 November 2015 23 Candidate Entities Identification Mine from tweets contents, via lexical matching. • Twitter side: Extract n-grams from tweets (n ≤ 5) o Parse POS tags for tokens, filter patterns using rules • Wikipedia side: Build a lexicon (anchors, redirects, titles, disambiguation pages) • Practical issue: o Start from sample tweets o Expand to incoming / outgoing linked entities Hannover, Germany, 2-3 November 2015 24 mation spreading Input : h,the T, D learning rate µ, threshold ✏ T (h), B, k, where B is influence transition matrix, sh are the sequel we de!, top-k most prominent entities. weets of the hash- the Output: initial influence scores that are based on the enm of ranking the tity Initialize: prominence ! :=model ! (0) (Step 1 of IPL), and ⌧ is the es identifying the damping Calculate fm , fc , ft , f! := f!(0) using Eqs. 1, 2, 3, 4 factor. while true do thenumber entity influest of enf̂! := normalize f! Learning Algorithm e entity influence problem is known 5.3Maximization-based Influence Set sh := f̂! , calculate rh using Eq. 6Learning g h, wemaximizaconstruct fluence rh , get theIPL top-kalgorithm. entities E(h,The k) objective Now weSort detail the P if the L(f!(e,= h),(↵, r(e, h)) ✏ then where the nodes e2E(h,k) is to learn model , )<of the global Stop entitiesMeasure (cf. Sec- the observed activities function (Equationspreading 1). The general idea is thatvia we entities end eVLearning (IPL) P that scores h indicates influence : ! = ! µ (e, h), r(e, h)) with find an optimal ! such that rL(f the average error e2E(h,k) m (Kempe et al., dia article to ei ’s. end to the top influencing entities is minimized respect an approximation !, E(h, k)the loss w.r.t. influence score r: • inversed Learn ω toreturn minimize raph are X earn influence a, asthe such a link ! = arg min L(f (e, h), r(e, h)) ntity prominence ” from the destiE(h,k) ed IPL) contains baseline method suggested by Liu et al. (2014): score is estimated via random walks: ng of • twoInfluence steps: where r(e, h) is the influence score of e and h, k, we assume that compute the entity E(h, k) :=the rh is ⌧ Br (6) h+ h set of(1top-k⌧ )sentities with highest nfluence score to e influence scores, r(e, h), and L is the squared error loss function, wer related ones. (x the y)2 influence transition matrix, sh are where B is the sequel we de-and • r(e,h) f(e,h) is jointly learnt via gradient descent method L(x, y) = . ess measure sug2 the initial thatinare based on1.theWe enThe maininfluence steps arescores depicted Algorithm 8): tity prominence model (Step IPL), and ⌧ isthe the start with an initial guess for 1!,ofand compute I2 |) log(|I1 \I2 |))) damping factor. similarities for the candidate entities. Here fm , fc , og(min(|I1 |,|I2 |)) the entity influ- f , and f represent the similarity Hannover, Germany, 2-3 November 2015 25 score vectors. We t ! tively. ink-based Mention abilistic priorSimilarity in mapping the mention to the engarded where M an C tity via the lexicon. One common way to estimate T (e) milarity of an entity with one individual while els of e based the entity prior exploits the anchor statistics from n in a tweet can be interpreted as the probgroun tively. The prob Wikipedia links, and has been proven to work well c prior in mapping the mention to the enEntity – Hashtag Similarities: Link-based setting in different domains of text. We follow this apgarded as correspo the lexicon. One common way to estimate lihood P |lm (e)| while proach and define LP (e|m) = as the P̂ (w|M CT ty prior exploits the anchor statistics from m0 |lm0 (e)| P̂ (w|M Builtlink upon direct similarities of tweets – entities: prior of the entity e given a mention wheremodel in ground dia links, and has been proven to work well m, tfw,cT lm (e) on is the set of links(Meij, withWSDM12; anchor m Fang, that point settings. Here w • Based commonness TACL14) |CT (e) rent domains of text. We follow this apquenc to e. The mention similarity fm is measured as theestimation lihood |l (e)| No. of incoming links m P and define LP (e|m) = as entity the to eewith anchor |l (e)| 0 0 aggregation of link priors of the all m )and=C m m P̂over (w|M C(e) or of thementions entity e given a mention m, where are th in all tweets with the hashtag h: tfw,c T , where tf Aggregate hashtag level, weighted by the frequency: s the• set of links to with anchor m that point use th |C (e)| T X w,D(h quencies in he mention similarity the fm (e, h) f=m is measured (LP (e|m)as · q(m)) (2) of tfw |D(h) and CT (e), all respe ation of link priors of them entity e over all tw No. of times m the lengths of are ns in all where tweetsq(m) with isthe hashtag h: Kullba in h’sm tweets the frequency of theappears mention over use the same est tributi allX mentions of e in all tweets of h. tfw,D(h) Hannover, Germany, 2-3 November 2015 26 tion 3.1), and an edge (ei , ej ) 2 Vh indicates that find an o in direction Wikipedia, as such article a link to ei ’s. ! = arg min there istoalinks link in from ej ’s Wikipedia respect t gives Note an “influence endorsement” fromgraph the destithat edges of the influence are inversed nationinentity to the to source direction linksentity. in Wikipedia, as such a link != where r(e, h) is gives an “influence endorsement” fromthat the destiEntity Relatedness In this work, we assume Influence nation Graph entity to the of source entity. score to E(h, k) is the se an entity endorses more its influence r(e, h), where and L is highlyEntity relatedRelatedness entities than toInlower related ones. r this work, we assume that (x y)2 y) = use popular entity relatedness measure sug-scoreL(x, E(h,2k) • A linkWe from to b indicates an “influence endorsement” anaaentity endorses more of its influence to The main gestedhighly by Milne and Witten r(e,steps h), related entities(2008): than to lower related ones. start withL(x, an initi from b to a y) log(max(|I We use a popular entity relatedness measure sug1 |,|I 2 |) log(|I1 \I 2 |))) M W (e1 , e2 ) = 1 similarities for th log(|E|) log(min(|I1 |,|I2 |)) The m gested by Milneisand Witten (2008): • Level of endorsement proportional to the relation weight: ft , and f! represen start wit where I1 and I2 are sets of entities having links to log(max(|I1 |,|I2 |) log(|I1 \I2use |)))matrix multip (e1 , e2 ) = 1and E is similarit e1 andMe2W , respectively, the set log(min(|I of all entilog(|E|) 1 |,|I2 |)) ties efficiently. In ft , and f! ties in Wikipedia. The influence transition from ei f such that the e ! where I1 and I2 are sets of entities having links to use matr to ej is to defined the normalized value: • Normalize haveasinfluence matrix: dom walk is perf e1 and e2 , respectively, and E is the set of all entities effic score r . Then w h M W (eThe ties in Wikipedia. i , ej )influence transition from ei f! sucho bi,j = P (5) descent method to ej is defined the value: W normalized (ei , ek ) (ei ,ek )2VasM dom wa derive the gradien score Influence Score Let rM be (e the score remark that ourrhra i , einfluence j) h W bi,j = P Germany, 2-3 November 2015 (5) 27 descent m vector of entities in G . We can Hannover, estimate r effito context-sensiti Iterative Influence-Propagation Learning nce as its Algorithm 1: Entity Influence-Prominence Learning spreading Input : h, T, DT (h), B, k, learning rate µ, threshold ✏ Output: !, top-k most prominent entities. the hashnking the Initialize: ! := ! (0) ifying the Calculate fm , fc , ft , f! := f!(0) using Eqs. 1, 2, 3, 4 while true do ber of enf̂! := normalize f! is known Set sh := f̂! , calculate rh using Eq. 6 maximizaSort Prh , get the top-k entities E(h, k) ing (IPL) mpe et al., oximation influence ominence if e2E(h,k) Stop end ! := ! µ L(f (e, h), r(e, h)) < ✏ then P end return !, E(h, k) e2E(h,k) rL(f (e, h), r(e, h)) Hannover, Germany, 2-3 November 2015 28 Hashtag Sampling • Calculate for each peak, the vector (fa,fb,fc) of portion of tweets before, during, and after the peak time. • Clustering with EM, choose 4 most plausible clusters. • Sample separately from each cluster Hannover, Germany, 2-3 November 2015 29