FS-NER: A Lightweight Filter-Stream Approach to Named Entity
Transcription
FS-NER: A Lightweight Filter-Stream Approach to Named Entity
FS-NER: A Lightweight Filter-Stream Approach to Named Entity Recognition on Twitter Data Diego Marinho de OliveiraƩ, Alberto H. F. LaenderƩ, Adriano VelosoƩ, Altigran S. da SilvaƱ ƩDepartamento de Ciência da Computação Universidade Federal de Minas Gerais {dmoliveira, laender, adrianov}@dcc.ufmg.br ƱInstituto de Computação Universidade Federal do Amazonas [email protected] Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Introduction • Microblogging activity is reshaping the way people communicate • Dominant NER approaches are either based on linguistic grammar-based techniques or on statistical models • They are ill suited when it comes to Twitter data [Gimpel et al., 2011; Ritter et al., 2011] 2 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Introduction • We propose a novel NER approach, called FS-NER (Filter Stream Named Entity Recognition) • FS-NER is characterized by the use of filters that process Twitter messages • We show that FS-NER performs 3% better than a CRF-based baseline, besides being orders of magnitude faster and much more practical 3 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Named Entity Recognition • Aims to recognize and extract determined terms known as entity (e.g., person, organization, location) Gates was born in Seattle, Washington and was the founder of Microsoft 4 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Named Entity Recognition • Aims to recognize and extract determined terms known as entity (e.g., person, organization, location) [PER Gates] was born in [LOC Seattle], [LOC Washington] and was the founder of [ORG Microsoft] 5 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Twitter and it challenges It is an online social networking service that allow users to write and read text-based messages up to 140 characters • • • • • • Large volume of data Lack of formalism Language diversity Real-time nature Lack of contextualization Data stream orientation 6 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Related Work • Ritter et al., 2011 Considered techniques usually applied to traditional NER and adapted them to Twitter • Liu et al., 2011 Use the k-Nearest Neighbors algorithm and a CRF-based approach for composing a semi-supervised system • Li et al., 2012 Have proposed a two-step, unsupervised NER approach targeted to Twitter data, called TwiNER Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 7 Proposed Approach Filter Stream Named Entity Recognition (FS-NER) 8 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Filter-Stream Named Entity Recogntion • Let S = <m1,m2,…> be a stream of messages (i.e., tweets) each mi in S is expressed by a pair (X,Y) X =[x1, x2, …, xn] and Y = [y1, y2, …, yn], yi assumes one value of {Beginning, Inside, Last, Outside, UnitToken} While X is known in advance for all messages in S, the values for the labels in Y are unknown and must be predicted For example “I love NEW YORK” → ([x1=I, x2=love, x3=NEW, x4=YORK], [y1=Outside, y2=Outside, y3=Beginning, y4=Last]). 9 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Filter-Stream Named Entity Recogntion • Filter Likelihood 10 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Filter-Stream Named Entity Recogntion • Recognition – The set {l, ϴl} is used to choose the most likely label for the term xi – In isolation, filters may not capture specific patterns that can be used for recognition 11 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Filter-Stream Named Entity Recogntion • Recognition 12 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Filter Engineering • Term The term filter estimates the probability of a certain term being an entity Distribution P(Yi|X=“new”^FT) Distribution P(Yi|X=“Nashville”^FT) 0,9 0,7 0,8 0,6 0,7 0,5 0,6 0,5 0,4 0,4 0,3 0,3 0,2 0,2 0,1 0,1 0 0 B I L O U B I L O U 13 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Filter Engineering • Context The context filter is able to capture unknown entities Apple → It is an entity or not? 14 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Filter Engineering • Context The context filter is able to capture unknown entities Apple fail should be a lesson for Microsoft [ORG] 15 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Filter Engineering • Affix The affix filter uses the fragments of an observation xi to infer if it is an entity Examples of affixes -ville: Nashville, Knoxville, Jacksonville -an (-ian): Italian, urban, African -ese: Chinese, Congolese, Vietnamese 16 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Filter Engineering • Dictionary The dictionary filter uses a list of names of correlated entities to infer whether the observed term is an entity ? TRAINING SET TEST SET DICTIONARY 17 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Filter Engineering • Noun The noun filter only considers terms that have just the first letter capitalized to infer if the observed term is an entity Capitalized John Apple Nintendo Not capitalized bill new york microsoft 18 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 FS-NER in Action: Example 19 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 FS-NER in Action: Example • Recognition model 20 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 FS-NER in Action: Example • Training Term filter Context filter Considering only Xi where ϴI > 0,00 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Noun filter 21 FS-NER in Action: Example • Test Precision 0.88 Recall 1.00 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 F1 0.93 22 Experimental Evaluation • Baseline CRF-based approach from Sarawagi1 • Metrics NRC NRC 2 PR P ; R ; F1 NAA NES PR 23 1. Avaible at http://crf.sourceforge.net/ Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Experimental Evaluation • Collections OW Tweets are related to soccer teams playing in the Brazilian National League Entities of interest: Player, Venue and Teams 2.000 Tweets; language Portuguese 24 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Experimental Evaluation • Collections ETZ Tweets were randomly crawled and are all in English Entities of interest: Company, Place and Person 2.400 Tweets; language English 25 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Experimental Evaluation • Collections WT Tweets semi-manually labeled and supplied by the WePS3 task 2 Entity of interest: Organization ~44.000 Tweets; language English, Spanish, Japanese, ... 26 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Performance Analysis • Analysis Individual Filters 27 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Performance Analysis • Analysis Individual Filters The best precise filters Term, context and dictionary The best recall filters Affix and noun 28 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Performance Analysis • Analysis of Specific Filter Combinations Recognition centered on the term filter (TRM) 29 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Performance Analysis • Analysis of Specific Filter Combinations Recognition centered on the term filter with generalization (GTRM) 30 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Performance Analysis • Analysis of Specific Filter Combinations Recognition centered on the noun filter (NON) 31 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Performance Analysis • Analysis of Specific Filter Combinations Recognition centered on the context filter (CTX) 32 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Performance Analysis • Analysis of Specific Filter Combinations 33 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Performance Analysis • Comparison with CRF-based approaches 34 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Performance Analysis • Runtime Comparison 35 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Conclusion • We have introduced FS-NER (Filter-Stream Named Entity Recognition) • The proposed approach is based on a efficient structure composed of lightweight filters • We show that FS-NER performs 3% better than a CRF-based baseline, besides being orders of magnitude faster and much more practical 36 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 Future Work • Alleviate the dependence on manually annotated data • Automate the process of filter combination • Identify other application environments in which FS-NER is suitable 37 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 ThAnks! This work was partially funded by InWeb - The Brazilian National Institute of Science and Technology for the Web (grant MCT/CNPq 573871/2008-6), and by the authors’ individual grants from CNPq, FAPEMIG and FAPEAM. 38 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013 References • K. Gimpel, N. Schneider, B. O'Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments. In Proc. of ACL (Short Papers), pages 42-47, 2011 • A. Ritter, S. Clark, Mausam, and O. Etzioni. Named Entity Recognition in Tweets: An Experimental Study. In Proc. of EMNLP, pages 1524-1534, 2011 • C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. TwiNER: named entity recognition in targeted twitter stream. In Proc. of SIGIR, pages 721730, 2012. • X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing Named Entities in Tweets. In Proc. of ACL, pages 359-367, 2011. 39 Diego Marinho de Oliveira ([email protected]) – MSM@WWW - 2013