Today - UCLA Statistics

Transcription

Today - UCLA Statistics
Today
• We are going to consider the statistical properties of text;
simply, the use of text as data
• Our readings brought up basic “laws” of text; somewhat stable
patterns in word frequencies, for example
Fifth meeting: Text as data
• We also read about mathematical models for text and how they
were applied to text compression and authorship attribution
• In the slides that follow, we’ll consider information visualizations
(and here the line between info viz and media art is at its
blurriest) and then have a discussion about the readings
• To frame the discussion a little...
Triggerhappy, Thompson & Craighead, 1998
"Triggerhappy” is a simple re-working of the classic
arcade game, “Space Invaders.” Rather than
defending against wave after wave of pixelated
aliens, players must shoot up a series of text extracts
taken from Michel Foucault’s essay, “What is an
Anuthor?” The game has nine levels each with their
own soundtrack taken from anonymous shortwave
radio broadcast sometimes referred to as Numbers
Stations.
"In the web environment, as in that of Trigger Happy,
the reader"s focus on text seems constantly and
thoroughly aborted, perpetually distracted by the
prospect of more specialised, more scintillating, more
apropos information. Thus, in the midst of this play on
hits and clicks, Trigger Happy is gesturing towards
the basis of a future information economy, where
attention, precisely because of its scarcity, may
become a central commodity."
Jamie King, IF/THEN Published by The Netherlands
Design Institute
grooves
heng
khanikaaa
eyebay
helloho
stopping
fnork
surprising
fydaan
docs
liggy
indicates
musicman
originally
elroy
limp
nightmare
yoru
eer
peein
emm
zebra
libs
onesweetangel
snuggle
offense
hadd
curte
worm
girlieee
barbie
coast
wage
naj
bitt
naj
bitt
christianne
challenge
kusura
polpot,
planting
anlmyo
netwerk
flavor
vodkas
tiamaria
stark
straw
14w
efforts
nitroglycerin
beadyeyed
religiontm
hugggss
chitter
unsuccessful
exwife
givin
melting
urinals
probation
whiz
imbeciles
sitto
sascham_152
ismple
crazyyet
nelbula
titles
lettuce
fruit
tink
fc5
vania_7800
winded
goood
d2ydx
balloon
cupids
eey
asylum
2131
posionous
ausser
wagon
hellz
hte
pwns
sau
mtu
avoiding
rooted
ginger
gist
imprison
unauthorized
mujer
tek
mjwell,
cooties
lameness
everychance
lindsi
heidi
everybudy
000
bikdik
skipped
alor
mounted
lo
abe360
plbbbt
aller
adys
relatives
charged
membrane
springs
me2
dumbass
keinen
derail
births
rocketman?
unisex
fizzy
koolchick
futt
struggling
ought
4eva!!
bakym
november
bir
sonic
18000
manipulating
boarding
scribbled
gomes
hyperness
ta77an
lionin
fella
nvm
eveninng
_514
fetish
4010
bipolar...
wuzup
schonweg
drank
chickies
8021q
stgirl
stealth
6mo
staring
espnola
jamale
accomplish
oki
tardis
expectations
titles
lettuce
tzu
tink
fc5
interupt
winded
goood
somebidy
christ
balloon
cupids
slang
asylum
2131
shelf
ausser
wagon
hillbillies
hte
pwns
usawx
i am
i'm ok except mtu
i have to go to class
on memorial riverrr
day
avoiding
I am 47
i am the light
heavy wheight champion
of the world
ginger
gist
cocoa
i'm man
I'm unableunauthorized
to begin to makemujer
sense of your reply
islamists
i'm cool
i'm not 'buff'
or anything,cooties
but i'm doing okay.
mjwell,
shakalka
I'm tiny.
I'm perfect,
but
I
don't
have
bad
intentions.
everychance
lindsi
giess
i'm gay too
i am 28
i
am
hot
i
am
male
looking
for
chat
everybudy
000
forgettable
i'm at work
i'm taking a morealor
quiet stance lately
skipped
shabby
i am a angel
I'm happy as a pig
in sunshine Nakie
lo.
abe360
undercover
aller
adys
clothings
I am 18 years old
I am in my boxer
briefs and a shirt
charged
pood
i am from Maryland
I am glad ofmembrane
his and also proud.
me2
asks
i am working there
I'm gonnadumbass
go back to bed soon
derail
births
aufeinmal
i am ok emma honest
I am being
serenaded by mail
unisex
fizzy
I am too, bi I mean
i'm doing php fulltime theoretical
now
futt
struggling
mid1980s
I am the anti-christ.
i m from illinois melissa
4eva!!
bakym
railroad
I am a bit slow though
I am reporting you fool
bir
sonic
australian
I am a 35 year veteran
i'm
a girl, you stupid
manipulating
boarding
downhypo??
I'm horny all the time
I am in St. Catherines
gomes
hyperness
geeeze
I'm from ancient Babylon...
I'm ok what bout u???
lionin
fella
marooon
i am getting worse as a talker
i
am
a
capricorn
eveninng
_514
sumfin
i'm just repeatin the hearsay.
I
m
not
yelling
4010
bipolar...
smellyrepulsive,and
I am happy with my 512 at home
i'm
off
too
bed
schonweg
drank
zinc
I am 35/male from Sweden, and
you?
I
M
FROM
TURKEY
8021q
stgirl
orny
I am curious if anyone else6mo
saw it
staringi m from india
regulations
I'm good too, just struggling
a bit
I'm a teacher
jamale
accomplish
lieben
i'm too much young to have tardis
children
I'm tepid.
expectations
sonny
I am a Christian, I love homosexuals.
i'm alone
lettuce
tzu
germnay
fc5chat with
interupt
i am looking for some men to
i am mitch
not
goood
i'm happy to have you on top
MightyMidget somebidy
i amroomany
14
cupids
slang
forum
I'm quite tempted to buy their piano back
i'm 19
2131
shelf
volunteers
I'm jealous. He got it on with Eleanor Mondale.
i am
hillbillies
flagstaff
I'm seeing a pattern here. wagon
It's called inflation.
i m
pwns
usawx
514
avoiding
riverrr
transfer
gist
cocoa
kelly22
mujer
islamists
arghhhhhhh
cooties
shakalka
completeing
lindsi
giess
nightrider
nelbula
ianne
fruit
challenge vania_7800
kusura
d2ydx
polpot,
eey
planting posionous
anlmyo
hellz
netwerk
sau
flavor
rooted
vodkas imprison
tiamaria
tek
stark
lameness
straw
heidi
14w
bikdik
efforts mounted
nitroglycerin plbbbt
beadyeyed relatives
religiontm) springs
hugggss
keinen
chitter rocketman?
unsuccessful koolchick
exwife
ought
1. Words or, rather, databases of words
A small point
• This is just one of an untold number of examples in which text
is escaping the confines of the page; our urban spaces are rich
with text, with words
• In computer mediated settings, we are having more and more
opportunities to contribute text; even our actions are described
in text
• The next few slides are a biased collection of works; there are
some noticeable omissions (Holzer, Acconci) that I can proved
references for after class or over coffee
http://www.wordcount.org/main.php
http://www.wordcount.org/querycount.php
WordCount Conspiracy.
What follows are WordCount
sequences that people have
emailed me, discovered in the
data rankings.
7964-7967
homosexual loses papal
schooling
Conspiracists unite!
8562-8565
conspicuous brutal snake
tomatoes
992-995
america ensure oil opportunity
53425 - 53426
backstreet leotards
30523-30525
despotism clinching internet
4304-4307
microsoft aquire salary
tremendous
17244-17246
neon porn convict
5283-5285
angel seeks supper
3046-3051
iraq winner, fucking smooth,
nick votes (GWB's election
strategy?)
78963-78964
toucan tonsillectomy
4136-4139
temple plot establishing
courage
3474-3476
apple formula: imagination
23134-23138
manipulative fruity adolf waived
munitions (WW2 Story?)
17032-17037
unwitting fashion crimes in
glasgow:
hushed caledonian jock
embraces innocently polyester
1443-1445
conservative reduce vote
7964-7967
homosexual loses papal
schooling
8562-8565
conspicuous brutal snake
tomatoes
53425 - 53426
backstreet leotards
1443-1445
conservative reduce vote
1941-1945
faith establish facts requires
membership
2629-2634
bush admit specifically agents
smell denied
30591-30594
halloween pastimes rebuffed
tranquillizers
30613-30615
wealthiest redefine stalwarts
20652-20654
angelica howled orgasm
38599-38606
Hijackers underpaid, incurs
ministration oakes legato,
jeopardized NYSE
9515-9523
sexy stalin thee lethal limb
registrar manages monuments
indoors
1224-1226
environmental damage
proposed
728-729
Cheney's master plan..
Dark talking... thinking
success
1941-1945
faith establish facts requires
membership
2629-2634
bush admit specifically agents
smell denied
30591-30594
halloween pastimes rebuffed
tranquillizers
6456-6459
Problems for the Catholic
Church in the Boston..
Legally, priests lacked
financing
13915-13918
The future of reproduction?
Defiant clone stung coupling
78963-78964
toucan tonsillectomy
4136-4139
temple plot establishing
courage
3474-3476
apple formula: imagination
23134-23138
manipulative fruity adolf
waived munitions (WW2
Story?)
372-405
Hidden message from God on
the role of women??...
30613-30615
wealthiest redefine stalwarts
20652-20654
angelica howled orgasm
38599-38606
Hijackers underpaid, incurs
ministration oakes legato,
jeopardized NYSE
9515-9523
sexy stalin thee lethal limb
registrar manages monuments
indoors
1224-1226
20414-20416
brando lbs predominate
19643-19646
surfing martyrs tearful
stockbrokers
16047-16048
arafat unhealthy
1088-1090
2629-2634
1941-1945
992-996
12608-12610
4670-4673
1442-1445
President Bush's recent assertion that North
Korea, Iraq and Iran form an "Axis of Evil"[2]
was more than a calculated political act -- it
was also an imaginatively formal, geometric
one, which had the effect of erecting a
monumental, virtual, globe-spanning triangle.
Axis is an online tool intended to broaden
opportunities for similar kinds of Axis creation.
It allows its participant to connect any three
points in space [countries] into a new Axis of
his or her own design. With the help of
multidimensional statistical metrics culled
from international public databases[3], the
commonalities amongst the user's choices
are revealed. In this manner, Axis presents an
inversion of Bush's praxis, obtaining lexicopolitical meaning from the formal act of spatial
selection.
"The Baby Name Wizard", Martin Wattenberg
AxisApplet, Golan Levin, 2002
The authors conducted an exhaustive empirical study, with the aid of custom software, public search engines and
powerful statistical techniques, in order to determine the relative popularity of every integer between 0 and one million.
The resulting information exhibits an extraordinary variety of patterns which reflect and refract our culture, our minds,
and our bodies...
For example, certain numbers, such as 212, 486, 911, 1040, 1492, 1776, 68040, or 90210, occur more frequently than
their neighbors because they are used to denominate the phone numbers, tax forms, computer chips, famous dates, or
television programs that figure prominently in our culture. Regular periodicities in the data, located at multiples and
powers of ten, mirror our cognitive preference for round numbers in our biologically-driven base-10 numbering system.
Certain numbers, such as 12345 or 8888, appear to be more popular simply because they are easier to remember.
Golan Levin et al, The Secret Lives of Numbers, 2002
The Google AdWords Happening
Christophe Bruno, April 2002
How to lose money with your art ?
At the beginning of April, a debate took place on rhizome.org mailing list, about
how to earn money with net art. It suggested to me an answer to an easier
problem : how to spend money with my art (if you understand everything on how to
spend money, you should in principle understand also how to earn money,
because of conservation laws...)
I decided to launch a happening on the web, consisting in a poetry advertisement
campaign on Google AdWords . I opened an account for $5 and began to buy
some keywords. For each keyword you can write a little ad and, instead of the
usual ad, I decided to write little "poems", non-sensical or funny or a bit
provocative.
I began with the keyword "symptom". The first ad I wrote was :
Words aren't free anymore
bicornuate-bicervical uterus
one-eyed hemi-vagina
www.unbehagen.com
As soon as the campaign was launched, I was able to see the results. Every time
somebody was looking for the word "symptom" in Google, they could see my ad in
the top right corner of the page.
http://www.iterature.com/adwords/
During the fourth campaign, I kept receiving these emails
from Google :
" We believe that the content of your ad does not accurately reflect the
content of your website. We suggest that you edit your ad text to
precisely indicate the nature of the products you offer. This will help to
create a more effective campaign and to increase your conversion
rate. We also recommend that you insert your specific keywords into
the first line of your ad, as this tends to attract viewers to your website"
Then I got a last email :
"Hello.
2. Text, or rather texts as collections of words
I am the automated performance monitor for Google AdWords Select.
My job is to keep average clickthrough rates at a high level, so that
users can consistently count on AdWords ads to help them find
products and services.
The last 1,000 ad impressions I served to your campaign(s) received
fewer than five clicks. When I see results like this, I significantly reduce
the rate at which I show the ads so you can make changes to improve
performance.
( ... )
Sincerely,
The Google AdWords Automated Performance Monitor"
The price of words : towards a generalized
semantic capitalism
One of the most interesting fact is that we have reached a
situation in which any word of any language has its price,
fluctuating according to the laws of the market.
Words already had some kind of exchange value, but we
hardly realized it : if I insult somebody, I will get something in
return, such as a punch in my face for instance. But now
there is no doubt anymore. The word "sex" is worth $3,837,
the word "art" $410, "net art" is only $0.05 (prices on the 11
of April 2002). And the most expensive word is "free"! Prices
are determined according to the number of search requests
and an average Cost-Per-Click.
At first sight, there may be something healthy in the fact that
words may have a price. If you know you have to pay, you
are more careful when you have something to say. And if you
see that every person who clicks on your link, makes you
lose 0.05$ (as it is the case in the Google system), you think
twice before writing your sentence.
But of course there is another side to this story and I have
the intuition that this could be a big event in the history of
mankind. There aren't many events like that : the invention of
writing is one of them for instance. Right now, we may not
realize the importance of this fact because the web is not
such a big part of our existence. But imagine the day when a
search engine will rule the whole textual content of the web,
in which the memory of mankind will be stored.Think of the
power in their hands.
TextArc, Brad Paley, 2002
TextArc, Brad Paley, 2002
TextArc, Brad Paley, 2002
TextArc, Brad Paley, 2002
!
!4
!
! !
!!
!
!!
log!freq
!6
!!!
!!
!!!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!8
the
to
i
that
of
and
you
a
in
was
gonzales
is
this
have
it
with
senator
about
not
as
be
on
we
attorney
what
for
mr
but
us
or
would
were
think
there
!
!
!
!
!
!
!
!
!
!
!
!
!10
2487
1602
1408
1320
1267
1232
913
823
764
500
490
476
452
433
400
396
395
353
351
335
332
314
305
299
297
294
280
279
262
243
235
227
225
218
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
0
2
48016 words total
4
Word frequencies
Word frequencies
• Tokens v. types
• Zipf distributions: frequence f and position in list (rank) r has the
relationship f is proportional to 1/r
• Upshot: A few common words, middling number of medium
words and many low frequency words
• Frequency of frequencies; at the right we have
computed the frequency of each frequency as
well as a running total
• This is the standard Zipf structure...
6
8
log!rank
1 1487 0.428
2 506 0.574
3 284 0.655
4 183 0.708
5 117 0.742
6
97 0.769
7
69 0.789
8
65 0.808
9
60 0.825
10 44 0.838
11 42 0.850
12 33 0.860
13 28 0.868
14 24 0.875
15 27 0.882
16 23 0.889
17 19 0.894
18 17 0.899
19 14 0.903
20 12 0.907
21 11 0.910
22 15 0.914
23
6 0.916
24 11 0.919
25
8 0.921
26 11 0.925
27
9 0.927
28
5 0.929
29
7 0.931
30
2 0.931
31
3 0.932
32
8 0.934
33
7 0.936
34 10 0.939
35
4 0.940
36 5 0.942
37 3 0.943
38 4 0.944
39 3 0.945
40 5 0.946
41 3 0.947
Zipf (take 2)
• We can also consider the distance between word pairs;
distances again have a power law
3. Collections of texts
• Compare this finding to the kind of layout that Brad Paley used
for TextArc
Adding structure
• Collocations is an expressioin of two or more
words that correspond to some conventional
way of saying something(bigrams, trigrams)
• At the right we have frequent bigrams; often we
employ some kind of filtering to clean these up
and highlight contentful expressions; what
would you suggest?
• More structure can be examined via regular
expressions and NLP software...
353
191
190
179
177
157
150
145
142
141
123
117
102
101
99
95
88
87
84
84
81
80
79
78
75
74
71
71
69
68
68
65
64
64
63
62
62
61
61
61
of the
in the
i think
attorney general
gonzales senator
the department
and i
us attorneys
to the
i dont
united states
to be
senator i
at the
that i
with the
that you
of justice
thank you
and the
it was
what i
department of
white house
that the
want to
the president
have been
on the
i have
going to
the attorney
with respect
us attorney
this is
respect to
gonzales i
would be
to me
for the
Trends
• Jon Kleinberg at Cornell has developed models to express the
“burstiness” in text streams
• He applies these tools to his own (email) Inbox and identifies
bursts of activity around different topics
• Rather than respond to the data that’s there, we can also filter...
JJ (Network Surveillance Tool / Empathic Data Visualization)
A Carnivore Client by Golan Levin, May 2002
Synopsis: The Radical Software Group's Carnivore project is a surveillance tool that listens to the Internet traffic (email, web surfing, etc.) on a
given local network, and serves this datastream over the net to a variety of interfaces called "clients." These clients -- created by a number of
computational artists and designers from around the world -- are each designed to animate, diagnose, or interpret the network traffic in various
ways. JJ is one such client: a software agent which uses facial expressions to visualize the emotional content of network traffic.
While many visualizations rely on charts or graphs to convey numeric data, other visualization research has leveraged certain affordances of
human cognition in order to represent information in a more qualitatively readable way. One important example of this is the work of Hermann
Chernoff, who pioneered the use of cartoon faces as a tool for portraying high-dimensional multivariate data. Chernoff's research demonstrated
that our intuitive and highly sensitive ability to interpret facial expressions could be incorporated into unusually legible visualizations of complex
information.
JJ is an autonomous software agent who displays facial expressions appropriate to the emotional content of the words that are presented to him.
Implemented as a Carnivore Client, JJ literally "puts a face" on the information transmitted through his host network, in order to provide a data
visualization of the network's "emotional content." JJ operates according to a mapping established between two well-known psychological
databases: (A) Ekman and Friesen's set of "universal facial expressions" — the set of face photographs which have been shown to embody basic
cross-cultural human emotions (namely: anger, fear, surprise, disgust, sadness and pleasure) — and (B) the Linguistic Inquiry and Word Count
(LIWC) dictionary by Pennebaker, Francis, & Booth, which categorizes the "emotional associations" of several thousand common English words, and
provides an efficient and effective method for evaluating the various affective components present in verbal and written speech samples.
JJ scans his host network for text packets, reading each packet one word at a time. When JJ finds a word that matches a term in the LIWC
dictionary, his emotional state (represented as an array of affective activation levels) is updated in response to that word's emotional associations.
JJ then displays a (morphed) mixture of facial expressions, weighted according to the current intensity of his different emotions. Considered
cumulatively, JJ's expressions reflect the overall "mood" of his information environment in an extremely simple, yet direct and unmistakeable way.
At present, JJ's emotional responses conform to those of the statistical "everyman": for example, if JJ sees a word commonly associated with
disgust, then he will present a "disgust" face. An alternate version of JJ could permit his user to modify these associations, and thus modify JJ's
apparent personality (so, for example, a "perverted" JJ might appear happy when he hears a disgusting word, while a "repressed" JJ might appear
angry).
Items for Monday, April 30, 2007
CALLER #1 INFLATABLE BOAT COMES WITH TROLLING MOTORS, OARS, ETC $275 331-8101
CALLER #2 WASHER AND DRYER $100 331-1774
CALLER #3 FRONT AND REAR DIFF. 70S TO 80`S MODEL CHEVY AND 330 CU IN MOTOR FOR FORD 331-0087
CALLER #4 2004 CHEVY CAVALIER 761-3913
CALLER #5 79 CHEVY ONE TON DUALLY NO MOTOR SELL FOR PARTS//12HP MURRAY MOWER 837-2933
CALLER #6 WHIRLPOOL REFRIGERATOR AND PLASTIC BODY PARTS FOR KAWASASKI 331-2865
CALLER #7 2 SPOT ATV TRAILER 331-2582
CALLER #8 LOOKING FOR SCRAP IRON AND OLD CARS 837-0184
CALLER #9 1990 FORD F-350 7.3 DIESEL WITH FLAT BED WITH GOOSENECK HITCH BUILT IN 322-5172
CALLER #10 ELECTRIC BATTERY CHARGER $20 322-2238
CALLER #11 12FT ALUM. V-HAUL BOAT AND COLEMAN FORCED AIR FURNACE AND PROPANE FURNACE 331-2596
CALLER #12 LOOKING FOR LITTLE CHINA TEA CUPS 331-8904
CALLER #13 1980 CHEVY SUBURBAN NO AIR $1000 331-0446
CALLER #14 LOOKING FOR BOOKSELVES AND RABBIT CAGES 322-1867
CALLER #15 15 FT CAMPER TRAILER SLEEPS 4 TO 6 AND 36 INCH SCREEN AND EXTERIOR DOOR 331-8596
http://www.kycn-kzew.com/!!http://www.radiomontana.net/fri.html
2006-07-31
2006-08-28
Rachel
JadziaDax
2006-07-31
2006-08-30
Has called several times according Didn't leave a message, must not
to our Caller ID, at least once each have been too important. And we
day for the past few days, I've never have all of our phone numbers on
the national DNC registry.
picked up the call.
Pam
This call came in at 8:41pm one
night. Didn't answer.
Dobber
This is the 2nd day this number
called. Same time of day also, 6p.
EST.
2006-08-10
Ben
Taylor Nelson Sofres (TNS)
Intersearch is a market research
company.
Don't know who and didn't pick up.
2006-09-07
dd
Received called @ 7:20pm
I did not pick up because I did not
recognize the number.
Caller did not leave a message.
2006-09-07
J
Don't know who they are. Call come
in while I was on another call,
number showed on caller ID. They
left no message.
2006-09-08
2006-08-31
dd
Jen
Received called @ 5:15pm on
9/7/06.
Called twice, did not pick up.
2006-09-13
Bob
They have been calling every night
for the past week. We do not
answer. They leave no message
when the answering machine picks
up.
2006-09-14
MH
At 8:08 pm, on 9/13/06, my home
phone rang and the called ID said
"Intersearch Cor" and gave the
number 215 442 7094. I didn't pick
up and they left no message. I AM
on the "National Do Not Call List",
and I'm going to report the phone
number to those people now.
2006-08-17
Sharon
Called at 4:59pm on 17 August
2006. No idea who or what this
number belongs to. Cell Phone is
already on the National Do Not Call
Registry.
2006-08-22
Chris
Got this call on my mobile on 22
Aug 2006. It was a woman calling
on behalf of Verizon Wireless
asking for someone that I do not
know. They said they had a record
of someone calling the VZW service
center from my number on the 18th.
No one had. I'm assuming it was a
independent survey company.
2006-08-25
Otis Chance
These freak call and even if you
answer they hang up!
2006-08-31
Alias Smith
Received call from 215-442-7094.
Caller ID said it was Intersearch
Cor. Location Alabama. Time of call
was 7:57pm Central. I picked up but
no one said anything. Probably and
auto message if I would have
waited longer.
2006-08-31
Alias Smith
Was not clear...I'm in Alabama..the
caller is
(215) 442-7094 is a land line based
in Philadelphia, PA
The registered service provider is
Verizon**.
Detailed listing information is not
available.
I did not pick up because I did not
recognize the number/caller.
Caller did not leave a message.
2006-09-10
notyour bz
2006-09-23
Pamela
Called at 10:24 am PST on
September 23, 2006. Didn't answer
and didn't leave a message.
2006-09-25
The # showed up on my cell.
provider is SPRINT. on 9/10/06 at
1730pm EAStern time. no
message. Lots of scams out there
people, do not answer it!!!
KC
2006-09-11
Called at 6:02 pm PST on Sept 25,
2006
Sam
Called every few days in the past
couple of weeks. When I pick up
there is nobody there.
2006-09-13
Julie
Received call yesterday aournd
suppertime..don't pick up on
someone who I don't recognize!
Just got a call from this one. I didn't
answer, they didn't leave voice mail.
2006-09-25
james
2006-09-26
Skip69346
Registered as PHILA SUBRB, PA
on caller id: no message
Lost - Blue rope bag, Justin brand, name "Cotton
Moore" on bag, lost out of pickup between Miles City
and Rosebud on Sunday.
W - duck.
Call 853-1224.
FS - Four 235-75-17 tires, $50 each.
Call 234-4517.
FS - 12 steel fence posts, $1 each.
1973 Winnebago motor home, 21', 440 engine, runs
good, good tires, $3000 or best offer.
Call 951-1000.
FS - 1976 F250.
2005 KX 250F dirt bike.
1997 Dodge Ram 2500 Cummins pickup.
Call 853-0356 or 234-5088.
FS - Cedar chest.
Louie L'Amour books.
Washer and dryer.
Call 232-0044 or 852-0044.
FS - Antique 48" round oak table w/6 chairs, $900.
China cabinet, $400.
48" round oak table, $300.
Walnut settee and rocker, $700.
35' of scraped knotty hickory flooring, $120 or best
offer.
Call 234-1711.
FS - 1999 Isuzu Trooper, $5800.
Call 351-1721.
FS - Windows XP upgrades, $90 each.
17" monitor.
16' flatbed for truck.
Riding mowers.
Call 232-3030.
FS - 1 year old purebred black lab.
Call 853-3037 or 234-8882.
FS - Women's 3-speed bicycle, $15.
Call 234-4532.
FS - Four 235-75-17 tires, $50 each.
Call 234-4517.
Lost - Men's wedding ring in Miles City.
Call 951-2567.
FS - Bum heifer calf, black angus cross.
FS - 1994 16' boat, deep V, $7500.
GA - Male pug.
Call 778-3454.
FS - 2007 Enclosed cargo trailer, $5000.
Call 270-2370.
FS - Pickup topper, will trade, $150 or best offer.
Call 232-0692 or 951-1801.
“Iraq" on Wikipedia - spaced out by time
http://www.research.ibm.com/visual/projects/history_flow/explanation.htm
“Evolution" on Wikipedia
Treemap
concept
Newsmap is an application that visually
reflects the constantly changing landscape of
the Google News news aggregator. A treemap
visualization algorithm helps display the
enormous amount of information gathered by
the aggregator. Treemaps are traditionally
space-constrained visualizations of
information. Newsmap's objective takes that
goal a step further and provides a tool to
divide information into quickly recognizable
bands which, when presented together, reveal
underlying patterns in news reporting across
cultures and within news segments in constant
change around the globe.
Newsmap does not pretend to replace the
googlenews aggregator. Its objective is to
simply demonstrate visually the relationships
between data and the unseen patterns in news
media. It is not thought to display an unbiased
view of the news; on the contrary, it is
thought to ironically accentuate the bias of it.
!
!
Project description
Treemap is a space-constrained visualization of hierarchical structures. It is very effective in showing attributes of leaf
nodes using size and color coding. Treemap enables users to compare nodes and sub-trees even at varying depth in the tree,
and help them spot patterns and exceptions.
credits
concept, design & frontend coding:
Marcos Weskamp
backend coding
Marcos Weskamp
Dan Albritton
http://www.marumushi.com/apps/newsmap/index.cfm
Treemap was first designed by Ben Shneiderman during the 1990s. For more information, read the historical summary of
treemaps, their growing set of applications, and the many other implementations. Treemaps are a continuing topic of
research and application at the HCIL.
http://www.cs.umd.edu/hcil/treemap/index.shtml
Documents as data
• Given a collection of documents can you
Group them according to content?
Identify common phrases, uses of language?
Create rules to classify documents into different types?
Conduct searches to identify documents relevant to a query?
For example, the NY Times web site offers an “analysis” of the State
of the Union addresses delivered by President Bush so far
http://www.nytimes.com
State of the Union
• As expressed in your readings, these displays also
reduce documents (in this case, the transcripts from the
President’s speeches) to words
• Here, they also introduce some semantic information and
group words dealing with “domestic affairs”
Another example...
http://www.nytimes.com
Another example
• Some time back, the Senate Judiciary Committee held
confirmation hearings to decide if Judge John Roberts
was a suitable candidate for Chief Justice of the
Supreme Court
• Let’s consider the transcripts from these sessions as
data; what might the dialog between the senators and
Judge Roberts reveal?
• Here is what the transcript for the first day of the hearings
looks like...
SPECTER: Good afternoon, ladies and gentlemen.
We begin these hearings on the confirmation of Judge John Roberts to be chief justice of the United States
with first the introduction by Judge Roberts of his beautiful family, and then a few administrative
housekeeping details before we begin the opening statements, which will be 10 minutes in length, by each
senator.
At the conclusion of the opening statements, we will then turn to the introductions by Judge Lugar, Judge
Warner -- actually, Senator Lugar, Senator Warner and Senator Bayh, and then the administration of the
oath to Judge Roberts and his opening statement.
Now let’s look at what was said
So, Judge Roberts, if you would at this time introduce your family we would appreciate it.
ROBERTS: (OFF-MIKE) Peggy Roberts and Barbara Burke. Barbara's husband Tim Burke is also here.
My uncle, Richard Podrasky (ph). Representing the cousins, my cousin, Jeannie Podrasky (ph).
• As with your lab or the NY Times’ analysis of the State of
the Union address, the focus on documents often comes
down to words
My wife, Jane is right here, front and center, with our daughter, Josephine and our son, Jack. You'll see she
has a very tight grasp on Jack.
(LAUGHTER)
SPECTER: Thank you very much, Judge Roberts.
• Let’s look at the frequency distribution of words used by
each participant in the hearings; what do you notice?
Judge Roberts had expressed his appreciation to have the introductions early. He said the maximum time
of the children's staying power was five minutes. And that is certainly understandable.
Thank you for doing that, Judge Roberts.
And now before beginning the opening statements, let me yield to my distinguished ranking member,
Senator Leahy.
LEAHY: Well, Mr. Chairman, I want to thank you for all the consultations. I think we have had each other's
home phones on speed dial, we've talked to each other so often. And I have every confidence our chairman
will conduct a fair and thorough hearing. You know, less than a quarter of those of us currently serving in
the Senate have exercised the Senate's advice-and-consent responsibility in connection with a nomination
to be chief justice of the United States. I think only 23 senators have actually been involved in that.
Number of “turns” at the mic
• For each day of the hearings, we have counted the
number of times the different players spoke
• What pattern do we see? What roles do each of these
people play in the proceedings?
30 SPECTER
4 ROBERTS
2 LUGAR
2 LEAHY
2 FEINSTEIN
1 WARNER
1 SESSIONS
1 SCHUMER
1 KYL
1 KOHL
1 KENNEDY
1 HATCH
1 GRASSLEY
1 GRAHAM
1 FEINGOLD
1 DURBIN
1 DEWINE
1 CORNYN
1 COBURN
1 BROWNBACK
1 BIDEN
1 BAYH
423 ROBERTS
84 SPECTER
47 BIDEN
44 GRAHAM
40 SESSIONS
34 SCHUMER
34 FEINGOLD
33 LEAHY
33 KENNEDY
32 FEINSTEIN
27 KOHL
25 DURBIN
21 DEWINE
20 HATCH
19 CORNYN
17 KYL
14 GRASSLEY
392 ROBERTS
78 SPECTER
77 SCHUMER
47 FEINSTEIN
34 FEINGOLD
34 BROWNBACK
34 BIDEN
28 LEAHY
27 GRAHAM
25 KOHL
23 COBURN
22 KENNEDY
21 CORNYN
18 SESSIONS
17 DURBIN
15 GRASSLEY
10 HATCH
9 KYL
6 DEWINE
90 ROBERTS
31 SPECTER
24 LEAHY
21 FEINSTEIN
19 FEINGOLD
17 KENNEDY
12 SCHUMER
6 DURBIN
4 CORNYN
3 SESSIONS
2 GRAHAM
Biden
Kennedy
Feinstein
Hatch
Grassley
Graham
Roberts
404 the
264 to
225 you
206 and
195 that
179 a
165 i
141 of
138 in
99 it
92 is
73 not
71 this
68 as
64 said
627 the
283 to
272 of
271 and
267 that
198 in
145 you
135 a
89 we
80 was
75 i
74 it
73 is
68 have
67 on
369 the
227 to
204 of
201 you
184 and
172 that
154 a
141 in
127 i
72 this
69 is
63 it
55 for
51 on
47 be
298 the
170 to
167 that
146 and
140 of
117 a
108 you
107 in
105 i
62 is
59 as
46 not
43 it
43 have
42 this
298 the
159 to
143 of
129 you
118 that
116 and
99 in
84 a
60 i
55 on
48 be
44 your
43 court
41 is
37 not
391 the
235 to
205 that
180 and
170 you
163 of
137 a
127 i
107 in
91 is
68 it
60 on
60 be
59 not
56 we
5268 the
2694 that
2126 to
2086 of
2045 and
1754 i
1469 in
1467 a
954 was
918 it
918 is
762 you
672 not
644 court
591 on
...
...
...
...
...
...
...
1 1982
1 1977
1 1971
1 1967
1 1965
1 1937
1 1925
1 1900s
1 1896
1 1873
1 1819
1 12
1 11th
1 10th
1 10-year-olds
1 5th
15
1 35
1 333-85
1 22
1 2001
1 1991
1 1988
1 1984
1 1981
1 1980
1 1965
1 1954
1 1950s
1 17
1 1960
1 193
1 1920
1 1918
1 1915
1 1913
1 1876
1 1846
1 1839
1 16-year
1 12
1 11th
1 100
1 1,00
11
18
1 78
1 69-11
1 29
1 2001
1 1997
1 1994
1 1987
1 1980s
1 1967
1 1944
1 1922
1 1792
1 11th
1 100
18
1 78
1 58
1 37.30.a
1 37.29.c
13
1 24
1 200
1 1st
1 1986
1 1982
1 1962
1 1925
1 15
1 11th
1 98
1 94
1 9-0
1 8/30/05
1 50/50
1 5-4
1 49
1 30s
1 2000
1 200
1 185
1 18
1 150
1 10
11
1 22
1 218-year
1 21
1 200-year
1 20-some
1 20-plus
1 1983
1 1939
1 1787
1 15th
1 150
1 14th
1 12
1 100,000
1 10
627 the
283 to
272 of
271 and
267 that
198 in
145 you
135 a
89 we
80 was
75 i
74 it
73 is
68 have
67 on
62 our
57 this
51 not
50 be
50 as
47 rights
45 for
43 court
38 by
38 are
37 with
36 at
35 but
35 all
35 act
34 about
33 your
32 case
31 voting
30 or
29 had
28 they
27 think
26 do
25 time
24 were
24 so
24 civil
23 what
23 an
22 today
22 law
22 country
21 i'm
21 equal
21 believe
21 been
20 whether
20 well
20 there
20 right
20 out
20 legislation
20 judge
20 if
20 because
19 will
19 their
19 test
19 roberts
19 people
19 justice
19 issue
18 would
18 up
18 know
18 house
18 from
18 discrimination
17 who
17 its
17 going
16 supreme
16 senate
16 effects
16 action
15 other
15 many
15 laws
15 federal
15 education
15 decision
15 american
14 which
14 want
14 then
14 society
14 public
14 most
14 more
14 important
14 has
14 did
14 affirmative
13 could
13 after
12 you're
12 these
12 national
12 like
12 congress
12 chairman
12 any
11 very
11 too
11 those
11 should
11 said
11 quote
11 passed
11 over
11 now
11 mr
11 let
11 just
11 his
11 constitutional
11 come
11 bill
11 also
10 us
10 record
10 question
10 progress
10 power
10 opportunity
10 one
10 nation
10 must
10 make
10 legal
10 impact
10 every
10 don't
10 before
10 americans
10 administration
9 year
9 view
9 under
9 suspect
9 program
9 position
9 here
9 great
9 government
9 good
9 go
9 extraordinary
9 brown
9 basis
8 women
8 where
8 when
8v
8 thank
8 section
8 race
8 president
8 no
8 my
8 memoranda
8 me
17 who
17 its
17 going
16 supreme
16 senate
16 effects
16 action
15 other
15 many
15 laws
15 federal
15 education
15 decision
15 american
14 which
14 want
14 then
14 society
14 public
14 most
14 more
14 important
14 has
14 did
14 affirmative
13 could
13 after
12 you're
12 these
12 national
12 like
12 congress
12 chairman
12 any
11 very
11 too
11 those
11 should
11 said
11 quote
11 passed
11 over
11 now
11 mr
11 let
11 just
11 his
11 constitutional
11 come
11 bill
11 also
10 us
10 record
15 education
15 decision
15 american
14 which
14 want
14 then
14 society
14 public
14 most
14 more
14 important
14 has
14 did
14 affirmative
13 could
13 after
12 you're
12 these
12 national
12 like
12 congress
12 chairman
12 any
11 very
11 too
11 those
11 should
11 said
11 quote
11 passed
11 over
11 now
11 mr
11 let
11 just
11 his
11 constitutional
11 come
11 bill
11 also
10 us
10 record
10 question
10 progress
10 power
10 opportunity
10 one
10 nation
10 must
10 make
10 legal
10 impact
10 every
10 don't
10 before
10 americans
10 question
10 progress
10 power
10 opportunity
10 one
10 nation
10 must
10 make
10 legal
10 impact
10 every
10 don't
10 before
10 americans
10 administration
9 year
9 view
9 under
9 suspect
9 program
9 position
9 here
9 great
9 government
9 good
9 go
9 extraordinary
9 brown
9 basis
8 women
8 where
8 when
8v
8 thank
8 section
8 race
8 president
8 no
8 my
8 memoranda
8 me
8 made
8 land
8 it's
8 included
8 housing
8 he
8 give
8 find
8 discriminate
8 denied
8 constitutionally
8 committee
8 citizens
8 can
8 back
8 land
8 it's
8 included
8 housing
8 he
8 give
8 find
8 discriminate
8 denied
8 constitutionally
8 committee
8 citizens
8 can
8 back
8 ask
7 zimmer
7 years
7 whole
7 we've
7 university
7 through
7 them
7 than
7 students
7 still
7 some
7 signed
7 racial
7 only
7 new
7 lives
Skimming the fat
8 made
8 land
8 it's
8 included
8 housing
8 he
8 give
8 find
8 discriminate
8 denied
8 constitutionally
8 committee
8 citizens
8 can
8 back
8 ask
7 zimmer
7 years
7 whole
7 we've
7 university
7 through
7 them
7 than
7 students
7 still
7 some
7 signed
7 racial
7 only
7 new
7 lives
Vector space representation
• Spatial proximity implies semantic proximity; documents (and in
IR, queries) are represented as vectors in a high-dimensional
space, each dimension corresponding to a different word in the
collection
• Semantically-related words are “close” in this space; often
focus not on magnitude but just the angle between points,
leading to the famous cosine distance
Kennedy’s top 200 words
• A lot of the words in our lists don’t contain much
content; that is, they don’t help us understand what
topics were being discussed
• Articles like “the” and “a” or conjunctions like “but” and
“or” are important grammatically, but are often not
helpful when comparing two speeches
• We put words like these in a “stop list” and remove
them
a
an
as
at
by
he
his
or
thou
us
who
against
amid
amidst
among
amongst
and
anybody
anyone
because
beside
despite
during
everybody
everyone
for
from
her
hers
herself
him
himself
hisself
if
into
it
its
itself
myself
nor
of
oneself
onto
our
Where we’re headed
• Given lists of words that are not ignorable we can
compute “distances” between the senators based on the
frequency distribution of words
• Senators using the same language will be close to each
other and those using different language will be far apart
• We can then use this to create a “map” of each senator;
for example, multi-dimensional scaling is used to lay out
points in the plane so that their interpoint distances best
match the word-usage distances
• Here’s what you get...
KYL
HATCH
0.10
CORNYN
GRASSLEY
0.05
dimension 2
SESSIONS
KOHL
0.00
COBURN
BROWNBACK
GRAHAM
DEWINE
SCHUMER
FEINGOLD
BIDEN
!0.05
DURBIN
LEAHY
!0.10
FEINSTEIN
!0.15
KENNEDY
!0.15
!0.10
!0.05
0.00
0.05
0.10
0.15
dimension 1
Documents as data
• There are plenty of examples in which one wants to identify
important structures shared by groups of documents
Web search
FAA Pilot Narratives
(Part of an Incident Report)
Leaving gate 3 at HOU Ground Control told us to taxi
to Runway 12R even though at push back we were
told to expect Runway 4. He evidently told us to
proceed via Taxiway Y,H,M. I missed these
instructions. I was distracted by the thick fog (800
RVR) and the change to Runway 12R. I was caught
up in reviewing the takeoff mins for 12R versus
Runway 4. As I started aroudn the terminal to join
Taxiway Y Ground advised us to watch for a Cessna
Conquest coming the opposite direction on Taxiway
B off of Runway 4. The Conquest came into sight in
the fog and the controller told him to follow us on
Yankee Taxiway. Passing the Conquest I again
became focused on the fog, Runway 12R takeoff
mins and the change of runways...