Today - UCLA Statistics
Transcription
Today - UCLA Statistics
Today • We are going to consider the statistical properties of text; simply, the use of text as data • Our readings brought up basic “laws” of text; somewhat stable patterns in word frequencies, for example Fifth meeting: Text as data • We also read about mathematical models for text and how they were applied to text compression and authorship attribution • In the slides that follow, we’ll consider information visualizations (and here the line between info viz and media art is at its blurriest) and then have a discussion about the readings • To frame the discussion a little... Triggerhappy, Thompson & Craighead, 1998 "Triggerhappy” is a simple re-working of the classic arcade game, “Space Invaders.” Rather than defending against wave after wave of pixelated aliens, players must shoot up a series of text extracts taken from Michel Foucault’s essay, “What is an Anuthor?” The game has nine levels each with their own soundtrack taken from anonymous shortwave radio broadcast sometimes referred to as Numbers Stations. "In the web environment, as in that of Trigger Happy, the reader"s focus on text seems constantly and thoroughly aborted, perpetually distracted by the prospect of more specialised, more scintillating, more apropos information. Thus, in the midst of this play on hits and clicks, Trigger Happy is gesturing towards the basis of a future information economy, where attention, precisely because of its scarcity, may become a central commodity." Jamie King, IF/THEN Published by The Netherlands Design Institute grooves heng khanikaaa eyebay helloho stopping fnork surprising fydaan docs liggy indicates musicman originally elroy limp nightmare yoru eer peein emm zebra libs onesweetangel snuggle offense hadd curte worm girlieee barbie coast wage naj bitt naj bitt christianne challenge kusura polpot, planting anlmyo netwerk flavor vodkas tiamaria stark straw 14w efforts nitroglycerin beadyeyed religiontm hugggss chitter unsuccessful exwife givin melting urinals probation whiz imbeciles sitto sascham_152 ismple crazyyet nelbula titles lettuce fruit tink fc5 vania_7800 winded goood d2ydx balloon cupids eey asylum 2131 posionous ausser wagon hellz hte pwns sau mtu avoiding rooted ginger gist imprison unauthorized mujer tek mjwell, cooties lameness everychance lindsi heidi everybudy 000 bikdik skipped alor mounted lo abe360 plbbbt aller adys relatives charged membrane springs me2 dumbass keinen derail births rocketman? unisex fizzy koolchick futt struggling ought 4eva!! bakym november bir sonic 18000 manipulating boarding scribbled gomes hyperness ta77an lionin fella nvm eveninng _514 fetish 4010 bipolar... wuzup schonweg drank chickies 8021q stgirl stealth 6mo staring espnola jamale accomplish oki tardis expectations titles lettuce tzu tink fc5 interupt winded goood somebidy christ balloon cupids slang asylum 2131 shelf ausser wagon hillbillies hte pwns usawx i am i'm ok except mtu i have to go to class on memorial riverrr day avoiding I am 47 i am the light heavy wheight champion of the world ginger gist cocoa i'm man I'm unableunauthorized to begin to makemujer sense of your reply islamists i'm cool i'm not 'buff' or anything,cooties but i'm doing okay. mjwell, shakalka I'm tiny. I'm perfect, but I don't have bad intentions. everychance lindsi giess i'm gay too i am 28 i am hot i am male looking for chat everybudy 000 forgettable i'm at work i'm taking a morealor quiet stance lately skipped shabby i am a angel I'm happy as a pig in sunshine Nakie lo. abe360 undercover aller adys clothings I am 18 years old I am in my boxer briefs and a shirt charged pood i am from Maryland I am glad ofmembrane his and also proud. me2 asks i am working there I'm gonnadumbass go back to bed soon derail births aufeinmal i am ok emma honest I am being serenaded by mail unisex fizzy I am too, bi I mean i'm doing php fulltime theoretical now futt struggling mid1980s I am the anti-christ. i m from illinois melissa 4eva!! bakym railroad I am a bit slow though I am reporting you fool bir sonic australian I am a 35 year veteran i'm a girl, you stupid manipulating boarding downhypo?? I'm horny all the time I am in St. Catherines gomes hyperness geeeze I'm from ancient Babylon... I'm ok what bout u??? lionin fella marooon i am getting worse as a talker i am a capricorn eveninng _514 sumfin i'm just repeatin the hearsay. I m not yelling 4010 bipolar... smellyrepulsive,and I am happy with my 512 at home i'm off too bed schonweg drank zinc I am 35/male from Sweden, and you? I M FROM TURKEY 8021q stgirl orny I am curious if anyone else6mo saw it staringi m from india regulations I'm good too, just struggling a bit I'm a teacher jamale accomplish lieben i'm too much young to have tardis children I'm tepid. expectations sonny I am a Christian, I love homosexuals. i'm alone lettuce tzu germnay fc5chat with interupt i am looking for some men to i am mitch not goood i'm happy to have you on top MightyMidget somebidy i amroomany 14 cupids slang forum I'm quite tempted to buy their piano back i'm 19 2131 shelf volunteers I'm jealous. He got it on with Eleanor Mondale. i am hillbillies flagstaff I'm seeing a pattern here. wagon It's called inflation. i m pwns usawx 514 avoiding riverrr transfer gist cocoa kelly22 mujer islamists arghhhhhhh cooties shakalka completeing lindsi giess nightrider nelbula ianne fruit challenge vania_7800 kusura d2ydx polpot, eey planting posionous anlmyo hellz netwerk sau flavor rooted vodkas imprison tiamaria tek stark lameness straw heidi 14w bikdik efforts mounted nitroglycerin plbbbt beadyeyed relatives religiontm) springs hugggss keinen chitter rocketman? unsuccessful koolchick exwife ought 1. Words or, rather, databases of words A small point • This is just one of an untold number of examples in which text is escaping the confines of the page; our urban spaces are rich with text, with words • In computer mediated settings, we are having more and more opportunities to contribute text; even our actions are described in text • The next few slides are a biased collection of works; there are some noticeable omissions (Holzer, Acconci) that I can proved references for after class or over coffee http://www.wordcount.org/main.php http://www.wordcount.org/querycount.php WordCount Conspiracy. What follows are WordCount sequences that people have emailed me, discovered in the data rankings. 7964-7967 homosexual loses papal schooling Conspiracists unite! 8562-8565 conspicuous brutal snake tomatoes 992-995 america ensure oil opportunity 53425 - 53426 backstreet leotards 30523-30525 despotism clinching internet 4304-4307 microsoft aquire salary tremendous 17244-17246 neon porn convict 5283-5285 angel seeks supper 3046-3051 iraq winner, fucking smooth, nick votes (GWB's election strategy?) 78963-78964 toucan tonsillectomy 4136-4139 temple plot establishing courage 3474-3476 apple formula: imagination 23134-23138 manipulative fruity adolf waived munitions (WW2 Story?) 17032-17037 unwitting fashion crimes in glasgow: hushed caledonian jock embraces innocently polyester 1443-1445 conservative reduce vote 7964-7967 homosexual loses papal schooling 8562-8565 conspicuous brutal snake tomatoes 53425 - 53426 backstreet leotards 1443-1445 conservative reduce vote 1941-1945 faith establish facts requires membership 2629-2634 bush admit specifically agents smell denied 30591-30594 halloween pastimes rebuffed tranquillizers 30613-30615 wealthiest redefine stalwarts 20652-20654 angelica howled orgasm 38599-38606 Hijackers underpaid, incurs ministration oakes legato, jeopardized NYSE 9515-9523 sexy stalin thee lethal limb registrar manages monuments indoors 1224-1226 environmental damage proposed 728-729 Cheney's master plan.. Dark talking... thinking success 1941-1945 faith establish facts requires membership 2629-2634 bush admit specifically agents smell denied 30591-30594 halloween pastimes rebuffed tranquillizers 6456-6459 Problems for the Catholic Church in the Boston.. Legally, priests lacked financing 13915-13918 The future of reproduction? Defiant clone stung coupling 78963-78964 toucan tonsillectomy 4136-4139 temple plot establishing courage 3474-3476 apple formula: imagination 23134-23138 manipulative fruity adolf waived munitions (WW2 Story?) 372-405 Hidden message from God on the role of women??... 30613-30615 wealthiest redefine stalwarts 20652-20654 angelica howled orgasm 38599-38606 Hijackers underpaid, incurs ministration oakes legato, jeopardized NYSE 9515-9523 sexy stalin thee lethal limb registrar manages monuments indoors 1224-1226 20414-20416 brando lbs predominate 19643-19646 surfing martyrs tearful stockbrokers 16047-16048 arafat unhealthy 1088-1090 2629-2634 1941-1945 992-996 12608-12610 4670-4673 1442-1445 President Bush's recent assertion that North Korea, Iraq and Iran form an "Axis of Evil"[2] was more than a calculated political act -- it was also an imaginatively formal, geometric one, which had the effect of erecting a monumental, virtual, globe-spanning triangle. Axis is an online tool intended to broaden opportunities for similar kinds of Axis creation. It allows its participant to connect any three points in space [countries] into a new Axis of his or her own design. With the help of multidimensional statistical metrics culled from international public databases[3], the commonalities amongst the user's choices are revealed. In this manner, Axis presents an inversion of Bush's praxis, obtaining lexicopolitical meaning from the formal act of spatial selection. "The Baby Name Wizard", Martin Wattenberg AxisApplet, Golan Levin, 2002 The authors conducted an exhaustive empirical study, with the aid of custom software, public search engines and powerful statistical techniques, in order to determine the relative popularity of every integer between 0 and one million. The resulting information exhibits an extraordinary variety of patterns which reflect and refract our culture, our minds, and our bodies... For example, certain numbers, such as 212, 486, 911, 1040, 1492, 1776, 68040, or 90210, occur more frequently than their neighbors because they are used to denominate the phone numbers, tax forms, computer chips, famous dates, or television programs that figure prominently in our culture. Regular periodicities in the data, located at multiples and powers of ten, mirror our cognitive preference for round numbers in our biologically-driven base-10 numbering system. Certain numbers, such as 12345 or 8888, appear to be more popular simply because they are easier to remember. Golan Levin et al, The Secret Lives of Numbers, 2002 The Google AdWords Happening Christophe Bruno, April 2002 How to lose money with your art ? At the beginning of April, a debate took place on rhizome.org mailing list, about how to earn money with net art. It suggested to me an answer to an easier problem : how to spend money with my art (if you understand everything on how to spend money, you should in principle understand also how to earn money, because of conservation laws...) I decided to launch a happening on the web, consisting in a poetry advertisement campaign on Google AdWords . I opened an account for $5 and began to buy some keywords. For each keyword you can write a little ad and, instead of the usual ad, I decided to write little "poems", non-sensical or funny or a bit provocative. I began with the keyword "symptom". The first ad I wrote was : Words aren't free anymore bicornuate-bicervical uterus one-eyed hemi-vagina www.unbehagen.com As soon as the campaign was launched, I was able to see the results. Every time somebody was looking for the word "symptom" in Google, they could see my ad in the top right corner of the page. http://www.iterature.com/adwords/ During the fourth campaign, I kept receiving these emails from Google : " We believe that the content of your ad does not accurately reflect the content of your website. We suggest that you edit your ad text to precisely indicate the nature of the products you offer. This will help to create a more effective campaign and to increase your conversion rate. We also recommend that you insert your specific keywords into the first line of your ad, as this tends to attract viewers to your website" Then I got a last email : "Hello. 2. Text, or rather texts as collections of words I am the automated performance monitor for Google AdWords Select. My job is to keep average clickthrough rates at a high level, so that users can consistently count on AdWords ads to help them find products and services. The last 1,000 ad impressions I served to your campaign(s) received fewer than five clicks. When I see results like this, I significantly reduce the rate at which I show the ads so you can make changes to improve performance. ( ... ) Sincerely, The Google AdWords Automated Performance Monitor" The price of words : towards a generalized semantic capitalism One of the most interesting fact is that we have reached a situation in which any word of any language has its price, fluctuating according to the laws of the market. Words already had some kind of exchange value, but we hardly realized it : if I insult somebody, I will get something in return, such as a punch in my face for instance. But now there is no doubt anymore. The word "sex" is worth $3,837, the word "art" $410, "net art" is only $0.05 (prices on the 11 of April 2002). And the most expensive word is "free"! Prices are determined according to the number of search requests and an average Cost-Per-Click. At first sight, there may be something healthy in the fact that words may have a price. If you know you have to pay, you are more careful when you have something to say. And if you see that every person who clicks on your link, makes you lose 0.05$ (as it is the case in the Google system), you think twice before writing your sentence. But of course there is another side to this story and I have the intuition that this could be a big event in the history of mankind. There aren't many events like that : the invention of writing is one of them for instance. Right now, we may not realize the importance of this fact because the web is not such a big part of our existence. But imagine the day when a search engine will rule the whole textual content of the web, in which the memory of mankind will be stored.Think of the power in their hands. TextArc, Brad Paley, 2002 TextArc, Brad Paley, 2002 TextArc, Brad Paley, 2002 TextArc, Brad Paley, 2002 ! !4 ! ! ! !! ! !! log!freq !6 !!! !! !!! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !8 the to i that of and you a in was gonzales is this have it with senator about not as be on we attorney what for mr but us or would were think there ! ! ! ! ! ! ! ! ! ! ! ! !10 2487 1602 1408 1320 1267 1232 913 823 764 500 490 476 452 433 400 396 395 353 351 335 332 314 305 299 297 294 280 279 262 243 235 227 225 218 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0 2 48016 words total 4 Word frequencies Word frequencies • Tokens v. types • Zipf distributions: frequence f and position in list (rank) r has the relationship f is proportional to 1/r • Upshot: A few common words, middling number of medium words and many low frequency words • Frequency of frequencies; at the right we have computed the frequency of each frequency as well as a running total • This is the standard Zipf structure... 6 8 log!rank 1 1487 0.428 2 506 0.574 3 284 0.655 4 183 0.708 5 117 0.742 6 97 0.769 7 69 0.789 8 65 0.808 9 60 0.825 10 44 0.838 11 42 0.850 12 33 0.860 13 28 0.868 14 24 0.875 15 27 0.882 16 23 0.889 17 19 0.894 18 17 0.899 19 14 0.903 20 12 0.907 21 11 0.910 22 15 0.914 23 6 0.916 24 11 0.919 25 8 0.921 26 11 0.925 27 9 0.927 28 5 0.929 29 7 0.931 30 2 0.931 31 3 0.932 32 8 0.934 33 7 0.936 34 10 0.939 35 4 0.940 36 5 0.942 37 3 0.943 38 4 0.944 39 3 0.945 40 5 0.946 41 3 0.947 Zipf (take 2) • We can also consider the distance between word pairs; distances again have a power law 3. Collections of texts • Compare this finding to the kind of layout that Brad Paley used for TextArc Adding structure • Collocations is an expressioin of two or more words that correspond to some conventional way of saying something(bigrams, trigrams) • At the right we have frequent bigrams; often we employ some kind of filtering to clean these up and highlight contentful expressions; what would you suggest? • More structure can be examined via regular expressions and NLP software... 353 191 190 179 177 157 150 145 142 141 123 117 102 101 99 95 88 87 84 84 81 80 79 78 75 74 71 71 69 68 68 65 64 64 63 62 62 61 61 61 of the in the i think attorney general gonzales senator the department and i us attorneys to the i dont united states to be senator i at the that i with the that you of justice thank you and the it was what i department of white house that the want to the president have been on the i have going to the attorney with respect us attorney this is respect to gonzales i would be to me for the Trends • Jon Kleinberg at Cornell has developed models to express the “burstiness” in text streams • He applies these tools to his own (email) Inbox and identifies bursts of activity around different topics • Rather than respond to the data that’s there, we can also filter... JJ (Network Surveillance Tool / Empathic Data Visualization) A Carnivore Client by Golan Levin, May 2002 Synopsis: The Radical Software Group's Carnivore project is a surveillance tool that listens to the Internet traffic (email, web surfing, etc.) on a given local network, and serves this datastream over the net to a variety of interfaces called "clients." These clients -- created by a number of computational artists and designers from around the world -- are each designed to animate, diagnose, or interpret the network traffic in various ways. JJ is one such client: a software agent which uses facial expressions to visualize the emotional content of network traffic. While many visualizations rely on charts or graphs to convey numeric data, other visualization research has leveraged certain affordances of human cognition in order to represent information in a more qualitatively readable way. One important example of this is the work of Hermann Chernoff, who pioneered the use of cartoon faces as a tool for portraying high-dimensional multivariate data. Chernoff's research demonstrated that our intuitive and highly sensitive ability to interpret facial expressions could be incorporated into unusually legible visualizations of complex information. JJ is an autonomous software agent who displays facial expressions appropriate to the emotional content of the words that are presented to him. Implemented as a Carnivore Client, JJ literally "puts a face" on the information transmitted through his host network, in order to provide a data visualization of the network's "emotional content." JJ operates according to a mapping established between two well-known psychological databases: (A) Ekman and Friesen's set of "universal facial expressions" — the set of face photographs which have been shown to embody basic cross-cultural human emotions (namely: anger, fear, surprise, disgust, sadness and pleasure) — and (B) the Linguistic Inquiry and Word Count (LIWC) dictionary by Pennebaker, Francis, & Booth, which categorizes the "emotional associations" of several thousand common English words, and provides an efficient and effective method for evaluating the various affective components present in verbal and written speech samples. JJ scans his host network for text packets, reading each packet one word at a time. When JJ finds a word that matches a term in the LIWC dictionary, his emotional state (represented as an array of affective activation levels) is updated in response to that word's emotional associations. JJ then displays a (morphed) mixture of facial expressions, weighted according to the current intensity of his different emotions. Considered cumulatively, JJ's expressions reflect the overall "mood" of his information environment in an extremely simple, yet direct and unmistakeable way. At present, JJ's emotional responses conform to those of the statistical "everyman": for example, if JJ sees a word commonly associated with disgust, then he will present a "disgust" face. An alternate version of JJ could permit his user to modify these associations, and thus modify JJ's apparent personality (so, for example, a "perverted" JJ might appear happy when he hears a disgusting word, while a "repressed" JJ might appear angry). Items for Monday, April 30, 2007 CALLER #1 INFLATABLE BOAT COMES WITH TROLLING MOTORS, OARS, ETC $275 331-8101 CALLER #2 WASHER AND DRYER $100 331-1774 CALLER #3 FRONT AND REAR DIFF. 70S TO 80`S MODEL CHEVY AND 330 CU IN MOTOR FOR FORD 331-0087 CALLER #4 2004 CHEVY CAVALIER 761-3913 CALLER #5 79 CHEVY ONE TON DUALLY NO MOTOR SELL FOR PARTS//12HP MURRAY MOWER 837-2933 CALLER #6 WHIRLPOOL REFRIGERATOR AND PLASTIC BODY PARTS FOR KAWASASKI 331-2865 CALLER #7 2 SPOT ATV TRAILER 331-2582 CALLER #8 LOOKING FOR SCRAP IRON AND OLD CARS 837-0184 CALLER #9 1990 FORD F-350 7.3 DIESEL WITH FLAT BED WITH GOOSENECK HITCH BUILT IN 322-5172 CALLER #10 ELECTRIC BATTERY CHARGER $20 322-2238 CALLER #11 12FT ALUM. V-HAUL BOAT AND COLEMAN FORCED AIR FURNACE AND PROPANE FURNACE 331-2596 CALLER #12 LOOKING FOR LITTLE CHINA TEA CUPS 331-8904 CALLER #13 1980 CHEVY SUBURBAN NO AIR $1000 331-0446 CALLER #14 LOOKING FOR BOOKSELVES AND RABBIT CAGES 322-1867 CALLER #15 15 FT CAMPER TRAILER SLEEPS 4 TO 6 AND 36 INCH SCREEN AND EXTERIOR DOOR 331-8596 http://www.kycn-kzew.com/!!http://www.radiomontana.net/fri.html 2006-07-31 2006-08-28 Rachel JadziaDax 2006-07-31 2006-08-30 Has called several times according Didn't leave a message, must not to our Caller ID, at least once each have been too important. And we day for the past few days, I've never have all of our phone numbers on the national DNC registry. picked up the call. Pam This call came in at 8:41pm one night. Didn't answer. Dobber This is the 2nd day this number called. Same time of day also, 6p. EST. 2006-08-10 Ben Taylor Nelson Sofres (TNS) Intersearch is a market research company. Don't know who and didn't pick up. 2006-09-07 dd Received called @ 7:20pm I did not pick up because I did not recognize the number. Caller did not leave a message. 2006-09-07 J Don't know who they are. Call come in while I was on another call, number showed on caller ID. They left no message. 2006-09-08 2006-08-31 dd Jen Received called @ 5:15pm on 9/7/06. Called twice, did not pick up. 2006-09-13 Bob They have been calling every night for the past week. We do not answer. They leave no message when the answering machine picks up. 2006-09-14 MH At 8:08 pm, on 9/13/06, my home phone rang and the called ID said "Intersearch Cor" and gave the number 215 442 7094. I didn't pick up and they left no message. I AM on the "National Do Not Call List", and I'm going to report the phone number to those people now. 2006-08-17 Sharon Called at 4:59pm on 17 August 2006. No idea who or what this number belongs to. Cell Phone is already on the National Do Not Call Registry. 2006-08-22 Chris Got this call on my mobile on 22 Aug 2006. It was a woman calling on behalf of Verizon Wireless asking for someone that I do not know. They said they had a record of someone calling the VZW service center from my number on the 18th. No one had. I'm assuming it was a independent survey company. 2006-08-25 Otis Chance These freak call and even if you answer they hang up! 2006-08-31 Alias Smith Received call from 215-442-7094. Caller ID said it was Intersearch Cor. Location Alabama. Time of call was 7:57pm Central. I picked up but no one said anything. Probably and auto message if I would have waited longer. 2006-08-31 Alias Smith Was not clear...I'm in Alabama..the caller is (215) 442-7094 is a land line based in Philadelphia, PA The registered service provider is Verizon**. Detailed listing information is not available. I did not pick up because I did not recognize the number/caller. Caller did not leave a message. 2006-09-10 notyour bz 2006-09-23 Pamela Called at 10:24 am PST on September 23, 2006. Didn't answer and didn't leave a message. 2006-09-25 The # showed up on my cell. provider is SPRINT. on 9/10/06 at 1730pm EAStern time. no message. Lots of scams out there people, do not answer it!!! KC 2006-09-11 Called at 6:02 pm PST on Sept 25, 2006 Sam Called every few days in the past couple of weeks. When I pick up there is nobody there. 2006-09-13 Julie Received call yesterday aournd suppertime..don't pick up on someone who I don't recognize! Just got a call from this one. I didn't answer, they didn't leave voice mail. 2006-09-25 james 2006-09-26 Skip69346 Registered as PHILA SUBRB, PA on caller id: no message Lost - Blue rope bag, Justin brand, name "Cotton Moore" on bag, lost out of pickup between Miles City and Rosebud on Sunday. W - duck. Call 853-1224. FS - Four 235-75-17 tires, $50 each. Call 234-4517. FS - 12 steel fence posts, $1 each. 1973 Winnebago motor home, 21', 440 engine, runs good, good tires, $3000 or best offer. Call 951-1000. FS - 1976 F250. 2005 KX 250F dirt bike. 1997 Dodge Ram 2500 Cummins pickup. Call 853-0356 or 234-5088. FS - Cedar chest. Louie L'Amour books. Washer and dryer. Call 232-0044 or 852-0044. FS - Antique 48" round oak table w/6 chairs, $900. China cabinet, $400. 48" round oak table, $300. Walnut settee and rocker, $700. 35' of scraped knotty hickory flooring, $120 or best offer. Call 234-1711. FS - 1999 Isuzu Trooper, $5800. Call 351-1721. FS - Windows XP upgrades, $90 each. 17" monitor. 16' flatbed for truck. Riding mowers. Call 232-3030. FS - 1 year old purebred black lab. Call 853-3037 or 234-8882. FS - Women's 3-speed bicycle, $15. Call 234-4532. FS - Four 235-75-17 tires, $50 each. Call 234-4517. Lost - Men's wedding ring in Miles City. Call 951-2567. FS - Bum heifer calf, black angus cross. FS - 1994 16' boat, deep V, $7500. GA - Male pug. Call 778-3454. FS - 2007 Enclosed cargo trailer, $5000. Call 270-2370. FS - Pickup topper, will trade, $150 or best offer. Call 232-0692 or 951-1801. “Iraq" on Wikipedia - spaced out by time http://www.research.ibm.com/visual/projects/history_flow/explanation.htm “Evolution" on Wikipedia Treemap concept Newsmap is an application that visually reflects the constantly changing landscape of the Google News news aggregator. A treemap visualization algorithm helps display the enormous amount of information gathered by the aggregator. Treemaps are traditionally space-constrained visualizations of information. Newsmap's objective takes that goal a step further and provides a tool to divide information into quickly recognizable bands which, when presented together, reveal underlying patterns in news reporting across cultures and within news segments in constant change around the globe. Newsmap does not pretend to replace the googlenews aggregator. Its objective is to simply demonstrate visually the relationships between data and the unseen patterns in news media. It is not thought to display an unbiased view of the news; on the contrary, it is thought to ironically accentuate the bias of it. ! ! Project description Treemap is a space-constrained visualization of hierarchical structures. It is very effective in showing attributes of leaf nodes using size and color coding. Treemap enables users to compare nodes and sub-trees even at varying depth in the tree, and help them spot patterns and exceptions. credits concept, design & frontend coding: Marcos Weskamp backend coding Marcos Weskamp Dan Albritton http://www.marumushi.com/apps/newsmap/index.cfm Treemap was first designed by Ben Shneiderman during the 1990s. For more information, read the historical summary of treemaps, their growing set of applications, and the many other implementations. Treemaps are a continuing topic of research and application at the HCIL. http://www.cs.umd.edu/hcil/treemap/index.shtml Documents as data • Given a collection of documents can you Group them according to content? Identify common phrases, uses of language? Create rules to classify documents into different types? Conduct searches to identify documents relevant to a query? For example, the NY Times web site offers an “analysis” of the State of the Union addresses delivered by President Bush so far http://www.nytimes.com State of the Union • As expressed in your readings, these displays also reduce documents (in this case, the transcripts from the President’s speeches) to words • Here, they also introduce some semantic information and group words dealing with “domestic affairs” Another example... http://www.nytimes.com Another example • Some time back, the Senate Judiciary Committee held confirmation hearings to decide if Judge John Roberts was a suitable candidate for Chief Justice of the Supreme Court • Let’s consider the transcripts from these sessions as data; what might the dialog between the senators and Judge Roberts reveal? • Here is what the transcript for the first day of the hearings looks like... SPECTER: Good afternoon, ladies and gentlemen. We begin these hearings on the confirmation of Judge John Roberts to be chief justice of the United States with first the introduction by Judge Roberts of his beautiful family, and then a few administrative housekeeping details before we begin the opening statements, which will be 10 minutes in length, by each senator. At the conclusion of the opening statements, we will then turn to the introductions by Judge Lugar, Judge Warner -- actually, Senator Lugar, Senator Warner and Senator Bayh, and then the administration of the oath to Judge Roberts and his opening statement. Now let’s look at what was said So, Judge Roberts, if you would at this time introduce your family we would appreciate it. ROBERTS: (OFF-MIKE) Peggy Roberts and Barbara Burke. Barbara's husband Tim Burke is also here. My uncle, Richard Podrasky (ph). Representing the cousins, my cousin, Jeannie Podrasky (ph). • As with your lab or the NY Times’ analysis of the State of the Union address, the focus on documents often comes down to words My wife, Jane is right here, front and center, with our daughter, Josephine and our son, Jack. You'll see she has a very tight grasp on Jack. (LAUGHTER) SPECTER: Thank you very much, Judge Roberts. • Let’s look at the frequency distribution of words used by each participant in the hearings; what do you notice? Judge Roberts had expressed his appreciation to have the introductions early. He said the maximum time of the children's staying power was five minutes. And that is certainly understandable. Thank you for doing that, Judge Roberts. And now before beginning the opening statements, let me yield to my distinguished ranking member, Senator Leahy. LEAHY: Well, Mr. Chairman, I want to thank you for all the consultations. I think we have had each other's home phones on speed dial, we've talked to each other so often. And I have every confidence our chairman will conduct a fair and thorough hearing. You know, less than a quarter of those of us currently serving in the Senate have exercised the Senate's advice-and-consent responsibility in connection with a nomination to be chief justice of the United States. I think only 23 senators have actually been involved in that. Number of “turns” at the mic • For each day of the hearings, we have counted the number of times the different players spoke • What pattern do we see? What roles do each of these people play in the proceedings? 30 SPECTER 4 ROBERTS 2 LUGAR 2 LEAHY 2 FEINSTEIN 1 WARNER 1 SESSIONS 1 SCHUMER 1 KYL 1 KOHL 1 KENNEDY 1 HATCH 1 GRASSLEY 1 GRAHAM 1 FEINGOLD 1 DURBIN 1 DEWINE 1 CORNYN 1 COBURN 1 BROWNBACK 1 BIDEN 1 BAYH 423 ROBERTS 84 SPECTER 47 BIDEN 44 GRAHAM 40 SESSIONS 34 SCHUMER 34 FEINGOLD 33 LEAHY 33 KENNEDY 32 FEINSTEIN 27 KOHL 25 DURBIN 21 DEWINE 20 HATCH 19 CORNYN 17 KYL 14 GRASSLEY 392 ROBERTS 78 SPECTER 77 SCHUMER 47 FEINSTEIN 34 FEINGOLD 34 BROWNBACK 34 BIDEN 28 LEAHY 27 GRAHAM 25 KOHL 23 COBURN 22 KENNEDY 21 CORNYN 18 SESSIONS 17 DURBIN 15 GRASSLEY 10 HATCH 9 KYL 6 DEWINE 90 ROBERTS 31 SPECTER 24 LEAHY 21 FEINSTEIN 19 FEINGOLD 17 KENNEDY 12 SCHUMER 6 DURBIN 4 CORNYN 3 SESSIONS 2 GRAHAM Biden Kennedy Feinstein Hatch Grassley Graham Roberts 404 the 264 to 225 you 206 and 195 that 179 a 165 i 141 of 138 in 99 it 92 is 73 not 71 this 68 as 64 said 627 the 283 to 272 of 271 and 267 that 198 in 145 you 135 a 89 we 80 was 75 i 74 it 73 is 68 have 67 on 369 the 227 to 204 of 201 you 184 and 172 that 154 a 141 in 127 i 72 this 69 is 63 it 55 for 51 on 47 be 298 the 170 to 167 that 146 and 140 of 117 a 108 you 107 in 105 i 62 is 59 as 46 not 43 it 43 have 42 this 298 the 159 to 143 of 129 you 118 that 116 and 99 in 84 a 60 i 55 on 48 be 44 your 43 court 41 is 37 not 391 the 235 to 205 that 180 and 170 you 163 of 137 a 127 i 107 in 91 is 68 it 60 on 60 be 59 not 56 we 5268 the 2694 that 2126 to 2086 of 2045 and 1754 i 1469 in 1467 a 954 was 918 it 918 is 762 you 672 not 644 court 591 on ... ... ... ... ... ... ... 1 1982 1 1977 1 1971 1 1967 1 1965 1 1937 1 1925 1 1900s 1 1896 1 1873 1 1819 1 12 1 11th 1 10th 1 10-year-olds 1 5th 15 1 35 1 333-85 1 22 1 2001 1 1991 1 1988 1 1984 1 1981 1 1980 1 1965 1 1954 1 1950s 1 17 1 1960 1 193 1 1920 1 1918 1 1915 1 1913 1 1876 1 1846 1 1839 1 16-year 1 12 1 11th 1 100 1 1,00 11 18 1 78 1 69-11 1 29 1 2001 1 1997 1 1994 1 1987 1 1980s 1 1967 1 1944 1 1922 1 1792 1 11th 1 100 18 1 78 1 58 1 37.30.a 1 37.29.c 13 1 24 1 200 1 1st 1 1986 1 1982 1 1962 1 1925 1 15 1 11th 1 98 1 94 1 9-0 1 8/30/05 1 50/50 1 5-4 1 49 1 30s 1 2000 1 200 1 185 1 18 1 150 1 10 11 1 22 1 218-year 1 21 1 200-year 1 20-some 1 20-plus 1 1983 1 1939 1 1787 1 15th 1 150 1 14th 1 12 1 100,000 1 10 627 the 283 to 272 of 271 and 267 that 198 in 145 you 135 a 89 we 80 was 75 i 74 it 73 is 68 have 67 on 62 our 57 this 51 not 50 be 50 as 47 rights 45 for 43 court 38 by 38 are 37 with 36 at 35 but 35 all 35 act 34 about 33 your 32 case 31 voting 30 or 29 had 28 they 27 think 26 do 25 time 24 were 24 so 24 civil 23 what 23 an 22 today 22 law 22 country 21 i'm 21 equal 21 believe 21 been 20 whether 20 well 20 there 20 right 20 out 20 legislation 20 judge 20 if 20 because 19 will 19 their 19 test 19 roberts 19 people 19 justice 19 issue 18 would 18 up 18 know 18 house 18 from 18 discrimination 17 who 17 its 17 going 16 supreme 16 senate 16 effects 16 action 15 other 15 many 15 laws 15 federal 15 education 15 decision 15 american 14 which 14 want 14 then 14 society 14 public 14 most 14 more 14 important 14 has 14 did 14 affirmative 13 could 13 after 12 you're 12 these 12 national 12 like 12 congress 12 chairman 12 any 11 very 11 too 11 those 11 should 11 said 11 quote 11 passed 11 over 11 now 11 mr 11 let 11 just 11 his 11 constitutional 11 come 11 bill 11 also 10 us 10 record 10 question 10 progress 10 power 10 opportunity 10 one 10 nation 10 must 10 make 10 legal 10 impact 10 every 10 don't 10 before 10 americans 10 administration 9 year 9 view 9 under 9 suspect 9 program 9 position 9 here 9 great 9 government 9 good 9 go 9 extraordinary 9 brown 9 basis 8 women 8 where 8 when 8v 8 thank 8 section 8 race 8 president 8 no 8 my 8 memoranda 8 me 17 who 17 its 17 going 16 supreme 16 senate 16 effects 16 action 15 other 15 many 15 laws 15 federal 15 education 15 decision 15 american 14 which 14 want 14 then 14 society 14 public 14 most 14 more 14 important 14 has 14 did 14 affirmative 13 could 13 after 12 you're 12 these 12 national 12 like 12 congress 12 chairman 12 any 11 very 11 too 11 those 11 should 11 said 11 quote 11 passed 11 over 11 now 11 mr 11 let 11 just 11 his 11 constitutional 11 come 11 bill 11 also 10 us 10 record 15 education 15 decision 15 american 14 which 14 want 14 then 14 society 14 public 14 most 14 more 14 important 14 has 14 did 14 affirmative 13 could 13 after 12 you're 12 these 12 national 12 like 12 congress 12 chairman 12 any 11 very 11 too 11 those 11 should 11 said 11 quote 11 passed 11 over 11 now 11 mr 11 let 11 just 11 his 11 constitutional 11 come 11 bill 11 also 10 us 10 record 10 question 10 progress 10 power 10 opportunity 10 one 10 nation 10 must 10 make 10 legal 10 impact 10 every 10 don't 10 before 10 americans 10 question 10 progress 10 power 10 opportunity 10 one 10 nation 10 must 10 make 10 legal 10 impact 10 every 10 don't 10 before 10 americans 10 administration 9 year 9 view 9 under 9 suspect 9 program 9 position 9 here 9 great 9 government 9 good 9 go 9 extraordinary 9 brown 9 basis 8 women 8 where 8 when 8v 8 thank 8 section 8 race 8 president 8 no 8 my 8 memoranda 8 me 8 made 8 land 8 it's 8 included 8 housing 8 he 8 give 8 find 8 discriminate 8 denied 8 constitutionally 8 committee 8 citizens 8 can 8 back 8 land 8 it's 8 included 8 housing 8 he 8 give 8 find 8 discriminate 8 denied 8 constitutionally 8 committee 8 citizens 8 can 8 back 8 ask 7 zimmer 7 years 7 whole 7 we've 7 university 7 through 7 them 7 than 7 students 7 still 7 some 7 signed 7 racial 7 only 7 new 7 lives Skimming the fat 8 made 8 land 8 it's 8 included 8 housing 8 he 8 give 8 find 8 discriminate 8 denied 8 constitutionally 8 committee 8 citizens 8 can 8 back 8 ask 7 zimmer 7 years 7 whole 7 we've 7 university 7 through 7 them 7 than 7 students 7 still 7 some 7 signed 7 racial 7 only 7 new 7 lives Vector space representation • Spatial proximity implies semantic proximity; documents (and in IR, queries) are represented as vectors in a high-dimensional space, each dimension corresponding to a different word in the collection • Semantically-related words are “close” in this space; often focus not on magnitude but just the angle between points, leading to the famous cosine distance Kennedy’s top 200 words • A lot of the words in our lists don’t contain much content; that is, they don’t help us understand what topics were being discussed • Articles like “the” and “a” or conjunctions like “but” and “or” are important grammatically, but are often not helpful when comparing two speeches • We put words like these in a “stop list” and remove them a an as at by he his or thou us who against amid amidst among amongst and anybody anyone because beside despite during everybody everyone for from her hers herself him himself hisself if into it its itself myself nor of oneself onto our Where we’re headed • Given lists of words that are not ignorable we can compute “distances” between the senators based on the frequency distribution of words • Senators using the same language will be close to each other and those using different language will be far apart • We can then use this to create a “map” of each senator; for example, multi-dimensional scaling is used to lay out points in the plane so that their interpoint distances best match the word-usage distances • Here’s what you get... KYL HATCH 0.10 CORNYN GRASSLEY 0.05 dimension 2 SESSIONS KOHL 0.00 COBURN BROWNBACK GRAHAM DEWINE SCHUMER FEINGOLD BIDEN !0.05 DURBIN LEAHY !0.10 FEINSTEIN !0.15 KENNEDY !0.15 !0.10 !0.05 0.00 0.05 0.10 0.15 dimension 1 Documents as data • There are plenty of examples in which one wants to identify important structures shared by groups of documents Web search FAA Pilot Narratives (Part of an Incident Report) Leaving gate 3 at HOU Ground Control told us to taxi to Runway 12R even though at push back we were told to expect Runway 4. He evidently told us to proceed via Taxiway Y,H,M. I missed these instructions. I was distracted by the thick fog (800 RVR) and the change to Runway 12R. I was caught up in reviewing the takeoff mins for 12R versus Runway 4. As I started aroudn the terminal to join Taxiway Y Ground advised us to watch for a Cessna Conquest coming the opposite direction on Taxiway B off of Runway 4. The Conquest came into sight in the fog and the controller told him to follow us on Yankee Taxiway. Passing the Conquest I again became focused on the fog, Runway 12R takeoff mins and the change of runways...