slides

Transcription

slides
Topic Modeling on Historical
Newspapers
Tze-I Yang*, Andrew J. Torget**, and Rada
Mihalcea*
*Dept. of Computer Science & Engineering
**Dept. of History
University of North Texas
Friday, June 24, 2011
LaTeCH 2011
Introduction
We have been working with Stanford's Bill
Lane Center for the American West
UNT
–
“Mapping Historical Texts: Combining
Text-Mining and Geo-Visualization to
Unlock the Research Potential of
Historical Newspapers”
–
UNT: Text mining newspaper
–
Stanford: Visualization & spatial mapping
LaTeCH 2011
2
Motivation
How can we find information in a large text
corpus?
–
search terms
●
●
●
UNT
too many results!
restrict by location
restrict by time
–
need prior knowledge
–
might overlook unexpected/interesting
information
LaTeCH 2011
3
Pointwise Mutual Information &
Associative Rule Learning
●
●
UNT
Generates many values and rules
–
Too unstable
–
Have to pick cutoff points for values/rules
–
Hard for our history expert to evaluate
–
Hard to defend choice of cutoff points
Abandoned these types of techniques
LaTeCH 2011
4
Topic Modeling
●
Assumes document is a mixture of topics
●
Gives a ranked list of words per topic
●
Spelling agnostic
UNT
LaTeCH 2011
5
Topic Modeling
●
●
Compared different topic modeling
techniques - Boyd-Graber et al. (2009)
–
latent Dirichlet allocation (LDA) - best
–
correlated topic model (CTM)
–
probabilistic latent semantic indexing
(pLSI)
MAchine Learning for LanguagE Toolkit
(MALLET) - UMass Amherst
–
UNT
parallel threaded SparseLDA
LaTeCH 2011
6
Previous Works
●
Mining the Dispatch - Nelson (2010)
●
Martha Ballard's Diary - Blevins (2010)
●
Our corpus
–
has not been cleaned
●
●
●
●
–
UNT
contains multiple sources
contains OCR errors (inconsistent)
is not segmented into articles
missing punctuations
will topic modeling techniques still work?
LaTeCH 2011
7
Corpus
UNT
LaTeCH 2011
8
Corpus
Baggarly, Herbert Milton, editor.
The Tulia Herald (Tulia, Tex), Vol.
48, No. 19, Ed. 1, Thursday, May
12, 1955, Newspaper, May 12,
1955; digital images, University of
North Texas Libraries, The Portal
to Texas History,
http://texashistory.unt.edu;
crediting Swisher County Public
Library , Tulia, Texas.
UNT
955
u
wHE MAN who trims himself to
T
suit everybody will soon whit
tie himself away
This bit of wisdom comes from
the bulletin of the Happy First Baptist
church Its true not only of
ministers but also of newspaper
3 TOWN TOPICS
ij ARM COOPERATIVES are in
the spotlight during these
days of high taxes For almost a
generation now many eyebrows
have been raised by advocates of
private enterprise as they saw what
J they believed to be an unfair tax
[...]
LaTeCH 2011
9
Corpus
●
Texas newspapers from 1829-2008
●
Texas Digital Newspaper Program (UNT)
●
Scanned and converted to text through
optical character recognition (OCR)
–
●
UNT
clear guidelines from the Library of
Congress
Metadata accompanies each issue
LaTeCH 2011
10
Corpus
UNT
LaTeCH 2011
11
Corpus
UNT
LaTeCH 2011
12
Corpus
UNT
LaTeCH 2011
13
Work Flow
●
●
●
UNT
Dictionary - Aspell
Named entity
tagger - Stanford
Named Entity
Recognizer
Stemmer Snowball
LaTeCH 2011
14
Evaluations
Does topic modeling give history experts
good information about the content?
–
Our history expert
●
●
UNT
looked at three types of outputs
evaluated relevancy of results on the basic
type of output over years that are
important to the cotton industry
LaTeCH 2011
15
Evaluations
UNT
LaTeCH 2011
16
Evaluations
UNT
LaTeCH 2011
17
Evaluations
●
●
UNT
3 out of 4 of the
irrelevant topic
groups contain
mainly misspelled
stop words
1 out of 4 of the
irrelevant topic
groups is
uninteresting
LaTeCH 2011
18
Evaluations
●
●
●
●
UNT
1865-1901
San Jacinto (Texas
revolution)
125 out of 220
snippets pertain to
the memorialization
Interesting for a
historian!
LaTeCH 2011
19
Challenges
●
●
UNT
Misspelled stop words become topics
–
Augment the stop list until no misspelling
topic groups appear
–
(Adaptive removal using topic groups?)
Choosing the number of topics
–
We could go with hierarchical LDA
–
But lose the parallel threading advantage
–
Currently run LDA separately for different
number of topic groups
LaTeCH 2011
20
Conclusions
●
●
●
●
UNT
Topic modeling delivers useful keywords
despite noise in corpus
Keywords can spawn interesting questions
Usefulness ultimately depends on
historian
Mapping Texts (http://mappingtexts.org/)
–
Interactive visualization tools
–
Final datasets should be available when
finished
LaTeCH 2011
21
Acknowledgment
●
●
We would like to thank our partners at Stanford's Bill
Lane Center for the American West
We have been supported by the NEH under Digital
Humanities Start-Up Grant (HD-51188-10). Any views,
findings, conclusions or recommendations expressed in this publication do
not necessarily represent those of the National Endowment for the
Humanities.
●
Thank you for listening!
Questions or Comments?
UNT
LaTeCH 2011
22