slides
Transcription
slides
Topic Modeling on Historical Newspapers Tze-I Yang*, Andrew J. Torget**, and Rada Mihalcea* *Dept. of Computer Science & Engineering **Dept. of History University of North Texas Friday, June 24, 2011 LaTeCH 2011 Introduction We have been working with Stanford's Bill Lane Center for the American West UNT – “Mapping Historical Texts: Combining Text-Mining and Geo-Visualization to Unlock the Research Potential of Historical Newspapers” – UNT: Text mining newspaper – Stanford: Visualization & spatial mapping LaTeCH 2011 2 Motivation How can we find information in a large text corpus? – search terms ● ● ● UNT too many results! restrict by location restrict by time – need prior knowledge – might overlook unexpected/interesting information LaTeCH 2011 3 Pointwise Mutual Information & Associative Rule Learning ● ● UNT Generates many values and rules – Too unstable – Have to pick cutoff points for values/rules – Hard for our history expert to evaluate – Hard to defend choice of cutoff points Abandoned these types of techniques LaTeCH 2011 4 Topic Modeling ● Assumes document is a mixture of topics ● Gives a ranked list of words per topic ● Spelling agnostic UNT LaTeCH 2011 5 Topic Modeling ● ● Compared different topic modeling techniques - Boyd-Graber et al. (2009) – latent Dirichlet allocation (LDA) - best – correlated topic model (CTM) – probabilistic latent semantic indexing (pLSI) MAchine Learning for LanguagE Toolkit (MALLET) - UMass Amherst – UNT parallel threaded SparseLDA LaTeCH 2011 6 Previous Works ● Mining the Dispatch - Nelson (2010) ● Martha Ballard's Diary - Blevins (2010) ● Our corpus – has not been cleaned ● ● ● ● – UNT contains multiple sources contains OCR errors (inconsistent) is not segmented into articles missing punctuations will topic modeling techniques still work? LaTeCH 2011 7 Corpus UNT LaTeCH 2011 8 Corpus Baggarly, Herbert Milton, editor. The Tulia Herald (Tulia, Tex), Vol. 48, No. 19, Ed. 1, Thursday, May 12, 1955, Newspaper, May 12, 1955; digital images, University of North Texas Libraries, The Portal to Texas History, http://texashistory.unt.edu; crediting Swisher County Public Library , Tulia, Texas. UNT 955 u wHE MAN who trims himself to T suit everybody will soon whit tie himself away This bit of wisdom comes from the bulletin of the Happy First Baptist church Its true not only of ministers but also of newspaper 3 TOWN TOPICS ij ARM COOPERATIVES are in the spotlight during these days of high taxes For almost a generation now many eyebrows have been raised by advocates of private enterprise as they saw what J they believed to be an unfair tax [...] LaTeCH 2011 9 Corpus ● Texas newspapers from 1829-2008 ● Texas Digital Newspaper Program (UNT) ● Scanned and converted to text through optical character recognition (OCR) – ● UNT clear guidelines from the Library of Congress Metadata accompanies each issue LaTeCH 2011 10 Corpus UNT LaTeCH 2011 11 Corpus UNT LaTeCH 2011 12 Corpus UNT LaTeCH 2011 13 Work Flow ● ● ● UNT Dictionary - Aspell Named entity tagger - Stanford Named Entity Recognizer Stemmer Snowball LaTeCH 2011 14 Evaluations Does topic modeling give history experts good information about the content? – Our history expert ● ● UNT looked at three types of outputs evaluated relevancy of results on the basic type of output over years that are important to the cotton industry LaTeCH 2011 15 Evaluations UNT LaTeCH 2011 16 Evaluations UNT LaTeCH 2011 17 Evaluations ● ● UNT 3 out of 4 of the irrelevant topic groups contain mainly misspelled stop words 1 out of 4 of the irrelevant topic groups is uninteresting LaTeCH 2011 18 Evaluations ● ● ● ● UNT 1865-1901 San Jacinto (Texas revolution) 125 out of 220 snippets pertain to the memorialization Interesting for a historian! LaTeCH 2011 19 Challenges ● ● UNT Misspelled stop words become topics – Augment the stop list until no misspelling topic groups appear – (Adaptive removal using topic groups?) Choosing the number of topics – We could go with hierarchical LDA – But lose the parallel threading advantage – Currently run LDA separately for different number of topic groups LaTeCH 2011 20 Conclusions ● ● ● ● UNT Topic modeling delivers useful keywords despite noise in corpus Keywords can spawn interesting questions Usefulness ultimately depends on historian Mapping Texts (http://mappingtexts.org/) – Interactive visualization tools – Final datasets should be available when finished LaTeCH 2011 21 Acknowledgment ● ● We would like to thank our partners at Stanford's Bill Lane Center for the American West We have been supported by the NEH under Digital Humanities Start-Up Grant (HD-51188-10). Any views, findings, conclusions or recommendations expressed in this publication do not necessarily represent those of the National Endowment for the Humanities. ● Thank you for listening! Questions or Comments? UNT LaTeCH 2011 22