Learning from Web Activity
Transcription
Learning from Web Activity
Learning from Web Activity Jake Hofman Joint work with Irmak Sirer & Sharad Goel Yahoo! Research June 27, 2011 Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 1 / 39 Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 2 / 39 “... applies social sciences techniques, including algorithmic game theory, economics, network analysis, psychology, ethnography, and mechanism design, to online situations.” http://research.yahoo.com Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 3 / 39 possible to estimate accurately the age of origin of almost all extant genera. It is then possible to plot a backward survivorship curve (8) for each of the 27 global bivalve provinces (9). On the basis of these curves, Krug et al. find that origination rates of marine bivalves in- biodiversification event in the Paleogene (65 to 23 million years ago) that is perhaps not yet captured in Alroy et al.’s database (5, 7). The jury is still out on what may have caused this event. But we should not lose sight of the fact that the steep rise to prominence of many mod- 8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al., Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245, pp. 245–295. 9. M. D. Spalding et al., Bioscience 57, 573 (2007). 10. S. M. Stanley, Paleobiology 33, 1 (2007). 11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007). 10.1126/science.1169410 SOCIAL SCIENCE Computational Social Science A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors. David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5 Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3 Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11 W“... a computational social science is emerging that e live life in the network. We check our e-mails regularly, make mobile phone calls from almost any location, swipe transit cards to use public transportation, and make purchases with credit cards. Our movements in public places may be captured by video cameras, and our medical records stored as digital files. We may post blog entries accessible to anyone, or maintain friendships through online social networks. Each of these transactions leaves digital traces that can be compiled into comprehensive pictures of both individual and group behavior, with the potential to transform our understanding of our lives, organizations, and societies. The capacity to collect and analyze massive amounts of data has transformed such fields as biology and physics. But the emergence of a data-driven “computational social science” has been much slower. Leading journals in economics, sociology, and political science show little evidence of this field. But computational social science is occurring—in Internet companies such as Google and Yahoo, and in govern- ment agencies such as the U.S. National Security Agency. Computational social science could become the exclusive domain of private companies and government agencies. Alternatively, there might emerge a privileged set of academic researchers presiding over private data from which they produce papers that cannot be critiqued or replicated. Neither scenario will serve the long-term public interest of accumulating, verifying, and disseminating knowledge. What value might a computational social science—based in an open academic environment—offer society, by enhancing understanding of individuals and collectives? What are the leverages the capacity to collect and analyze data with an unprecedented breadth and depth and scale ...” http://sciencemag.org/content/323/5915/721 Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 4 / 39 possible to estimate accurately the age of origin of almost all extant genera. It is then possible to plot a backward survivorship curve (8) for each of the 27 global bivalve provinces (9). On the basis of these curves, Krug et al. find that origination rates of marine bivalves in- biodiversification event in the Paleogene (65 to 23 million years ago) that is perhaps not yet captured in Alroy et al.’s database (5, 7). The jury is still out on what may have caused this event. But we should not lose sight of the fact that the steep rise to prominence of many mod- 8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al., Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245, pp. 245–295. 9. M. D. Spalding et al., Bioscience 57, 573 (2007). 10. S. M. Stanley, Paleobiology 33, 1 (2007). 11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007). 10.1126/science.1169410 SOCIAL SCIENCE Computational Social Science A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors. David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5 Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3 Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11 W“... shares with other nascent interdisciplinary fields e live life in the network. We check our e-mails regularly, make mobile phone calls from almost any location, swipe transit cards to use public transportation, and make purchases with credit cards. Our movements in public places may be captured by video cameras, and our medical records stored as digital files. We may post blog entries accessible to anyone, or maintain friendships through online social networks. Each of these transactions leaves digital traces that can be compiled into comprehensive pictures of both individual and group behavior, with the potential to transform our understanding of our lives, organizations, and societies. The capacity to collect and analyze massive amounts of data has transformed such fields as biology and physics. But the emergence of a data-driven “computational social science” has been much slower. Leading journals in economics, sociology, and political science show little evidence of this field. But computational social science is occurring—in Internet companies such as Google and Yahoo, and in govern- ment agencies such as the U.S. National Security Agency. Computational social science could become the exclusive domain of private companies and government agencies. Alternatively, there might emerge a privileged set of academic researchers presiding over private data from which they produce papers that cannot be critiqued or replicated. Neither scenario will serve the long-term public interest of accumulating, verifying, and disseminating knowledge. What value might a computational social science—based in an open academic environment—offer society, by enhancing understanding of individuals and collectives? What are the (e.g., sustainability science) the need to develop a paradigm for training new scholars ...” http://sciencemag.org/content/323/5915/721 Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 4 / 39 The clean real story “ We have a habit in writing articles published in scientific journals to make the work as finished as possible, to cover all the tracks, to not worry about the blind alleys or to describe how you had the wrong idea first, and so on. So there isn’t any place to publish, in a dignified manner, what you actually did in order to get to do the work ...” -Richard Feynman Nobel Lecture 1 , 1965 1 Jake Hofman http://bit.ly/feynmannobel ([email protected]) Learning from Web Activity ICSA, 2011.06.27 5 / 39 Outline • The clean story • The real story • Lessons learned Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 6 / 39 Demographic diversity on the Web with Irmak Sirer and Sharad Goel The clean story (covering our tracks) Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 7 / 39 Motivation Previous work is largely survey-based and focuses and group-level differences in online access Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 8 / 39 Motivation “As of January 1997, we estimate that 5.2 million African Americans and 40.8 million whites have ever used the Web, and that 1.4 million African Americans and 20.3 million whites used the Web in the past week.” -Hoffman & Novak (1998) Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 8 / 39 Motivation Chang, et. al., ICWSM (2010) Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 9 / 39 Motivation Focus on activity instead of access How diverse is the Web? To what extent do online experiences vary across demographic groups? Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 10 / 39 Data • Representative sample of 265,000 individuals in the US, paid via the Nielsen MegaPanel2 • Log of anonymized, complete browsing activity from June 2009 through May 2010 (URLs viewed, timestamps, etc.) • Detailed individual and household demographic information (age, education, income, race, sex, etc.) 2 Jake Hofman Special thanks to Mainak Mazumdar ([email protected]) Learning from Web Activity ICSA, 2011.06.27 11 / 39 Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/Male Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 12 / 39 Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/Male • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 12 / 39 Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/Male • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors) Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 12 / 39 Data • Transform all demographic attributes to binary variables e.g., Age → Over/Under 25, Race → White/Non-White, Sex → Female/Male • Normalize pageviews to at most three domain levels, sans www e.g. www.yahoo.com → yahoo.com, us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com • Restrict to top 100k (out of 9M+ total) most popular sites (by unique visitors) • Aggregate activity at the site, group, and user levels Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 12 / 39 Site-level skew How diverse are site audiences? calculate the skew in visitors (e.g., 93% of pageviews on foxnews.com are by White users) Density • For each site and attribute, • For each attribute, plot the distribution of visitor skew across all sites 0.0 0.2 0.4 0.6 0.8 1.0 Proportion White Visitors Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 13 / 39 0.0 0.2 0.4 0.6 0.8 Density Density Density Site-level skew 1.0 0.0 0.2 0.4 0.6 0.8 Density 0.0 0.0 0.2 0.4 0.6 0.8 Proportion Adult Visitors Jake Hofman 1.0 0.0 Proportion White Visitors 0.2 0.4 0.6 0.8 1.0 Proportion College Educated Visitors Density Proportion Female Visitors ([email protected]) 1.0 0.2 0.4 0.6 0.8 1.0 Proportion of Visitors With Household Incomes Under $50,000 Learning from Web Activity ICSA, 2011.06.27 14 / 39 Site-level skew Many sites have skew close the overall mean, but there also popular, highly-skewed sites Female White College Educated Over 25 Years Old Household Income Under $50,000 Greater Than 90% Less Than 10% youravon.com collectionsetc.com foxnews.com wunderground.com news.google.com nytimes.com mail.yahoo.com apps.facebook.com scarleteen.com boards.adultswim.com coveritlive.com needlive.com blackplanet.com mediatakeout.com slumz.boxden.com sythe.com nanowrimo.org cbox.ws opentable.com marketwatch.com Table 1: A selection of popular sites that are homogeneous along various demographic dimensions. College Jake Hofman r−Capita Pageviews 70 60 50 Non−White ! Female Over 25 ! ! ! White No College 40 30 ([email protected]) Male Under 25 visually apparent from Figure 5, there are significant differences in how groups distribute their time on the web. These differences—which, as mentioned above, hold for highly frequented sites such as Facebook and YouTube—are in some cases even more pronounced for lower traffic sites. For instance, the gaming site pogo.com accounts for less than 1% of pageviews among both low and high income users, but Learning from Web ICSA, low Activity income users spend almost twice as much2011.06.27 of their time 15 / 39 Site-level skew This skew persists even when we restrict attention to the top 10k or 1k sites 0.8 0.6 0.4 0.2 0.0 1.0 Proportion College Educated Visitors 1.0 Proportion White Site Visitors Proportion Female Site Visitors 1.0 0.8 0.6 0.4 0.2 0.0 101 102 103 104 105 101 Site Rank 102 103 0.4 0.2 0.0 105 101 102 103 104 105 Site Rank Proportion of Visitors With Household Incomes Under $50,000 1.0 0.8 0.6 0.4 0.2 0.0 101 102 103 104 105 0.8 0.6 0.4 0.2 0.0 101 Site Rank Jake Hofman 0.6 Site Rank 1.0 Proportion of Adult Site Visitors 104 0.8 ([email protected]) 102 103 104 105 Site Rank Learning from Web Activity ICSA, 2011.06.27 16 / 39 Sites vs. ZIPs How do diversity of the online and offline worlds compare? ZIPs Density Sites ZIPs Density Sites 0.0 0.2 0.4 0.6 Proportion Female Jake Hofman ([email protected]) 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion White Learning from Web Activity ICSA, 2011.06.27 17 / 39 Sites vs. ZIPs How do diversity of the online and offline worlds compare? ZIPs Density Sites ZIPs Density Sites 0.0 0.2 0.4 0.6 Proportion Female 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion White As expected, neighborhoods are more gender-balanced than sites Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 17 / 39 Sites vs. ZIPs How do diversity of the online and offline worlds compare? ZIPs Density Sites ZIPs Density Sites 0.0 0.2 0.4 0.6 Proportion Female 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Proportion White But sites typically have more racially diverse audiences than neighborhoods have residents Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 17 / 39 Group-level activity How does browsing activity vary at the group level? College Daily Per−Capita Pageviews 70 60 50 Non−White ● Female Over 25 ● ● ● White No College Male 40 30 Under 25 20 10 0 Race Education Sex Age Large differences exist even at the aggregate level (e.g. women on average generate 40% more pageviews than men) Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 18 / 39 fa m ce ai bo l.y o k ps g aho .co .fa oo o. m m ce gl co ai b e. m l.g oo co m o m og k.co a le m y il. .c vi c web ou live om ew h m m tu .c m an w ai be om or ne fb l.a .co ep l.f .zy ol m ic ac ng .co se s.m ebo a.c m ar ys ok om ch pa .c . c o m yah e.c m ys o o pa o.c m ce om a m .c sh ma sn om op zo .co im .e n. m ho age y bay com m s. ah .c e. go oo om m my og .co ai sp le m l.c a .c om ce om c .c w w b as om w in t.n m .y g e es cg ah .co t sa i.e oo m gi ng es ba .co .m pn y. m ys .g com p o. ci twace com m i .c .m tte o e r m en my eb .co fa ce . .e o m bo lo wik ba .co ok gin ipe y.c m .m .y di om af ah a. m ia oo or fri ga y. wa .co g en m ya rs m ds e3 ho .co .m .p o. m ys og co o m w ta pac .co or g e m ld ge .co w d m in .c m ne o lo ee r.c m g b o m my in.li o.c m ap p ve om s. oin .c go ts om og .c le om m a .c w m p ol. om ne s.z og co w yn o.c m s. g o fa ya a m nt as w ho .com ys in o po n ste .co rts et r.c m . se ya flix om ar ho .co ch o m . al co .ao com ot m l.c m ca o et s m ric t.n s. et co m ap Percent of Total Time Spent on Site Group-level activity All groups spend more than a third of their time on a handful of email, search, and social networking sites 10% Jake Hofman female male 1% 0.1% ([email protected]) Learning from Web Activity ICSA, 2011.06.27 19 / 39 fa m ce ai bo l.y o k ps g aho .co .fa oo o. m m ce gl co ai b e. m l.g oo co m o m og k.co a le m y il. .c vi c web ou live om ew h m m tu .c m an w ai be om or ne fb l.a .co ep l.f .zy ol m ic ac ng .co se s.m ebo a.c m ar ys ok om ch pa .c . c o m yah e.c m ys o o pa o.c m ce om a m .c sh ma sn om op zo .co im .e n. m ho age y bay com m s. ah .c e. go oo om m my og .co ai sp le m l.c a .c om ce om c .c w w b as om w in t.n m .y g e es cg ah .co t sa i.e oo m gi ng es ba .co .m pn y. m ys .g com p o. ci twace com m i .c .m tte o e r m en my eb .co fa ce . .e o m bo lo wik ba .co ok gin ipe y.c m .m .y di om af ah a. m ia oo or fri ga y. wa .co g en m ya rs m ds e3 ho .co .m .p o. m ys og co o m w ta pac .co or g e m ld ge .co w d m in .c m ne o lo ee r.c m g b o m my in.li o.c m ap p ve om s. oin .c go ts om og .c le om m a .c w m p ol. om ne s.z og co w yn o.c m s. g o fa ya a m nt as w ho .com ys in o po n ste .co rts et r.c m . se ya flix om ar ho .co ch o m . al co .ao com ot m l.c m ca o et s m ric t.n s. et co m ap Percent of Total Time Spent on Site Group-level activity But different groups distribute their time differently, both on universally popular and on more niche sites 10% Jake Hofman white non.white 1% 0.1% ([email protected]) Learning from Web Activity ICSA, 2011.06.27 19 / 39 1% Education 1% Income Jake Hofman 1% Race fa c m eb ail oo .y k. ah co ps go oo m .fa og .co m ceb le. m ail o co .g ok m o .c m ogle om ail . .li co w you ve. m e t c vie ch m bm ube om w an wf ail. .co m n b ao m or el. .zy l. ep fa n co ic ce ga m s b .c se .my oo om ar sp k.c ch ac o .y e m m aho .com ys o pa .co ce m .c am msn om sh az .c op on om .e .c im b o ag y ay m ho es ah .co m .g oo m e. oo .c m o m ys gle m ail pa .co .c ce m om .c ca om w s w b t.n w in e .y g. t a m cg ho com es i. o. sa gin e eba com g. spn y.c m .g om ys o pa .co c m cim tw e.c .m itter om e . m e co en y.e bo. m c fa ce lo .wik bay om bo g ip .c ok in. ed om .m ya ia. af ho org ia o m wa .co y frie gam .ya rs.c m nd e3 ho om s. .p o.c m og om ys o pa .co w ta ce. m or gg c ld e om w d. in co m ner m . lo eeb com gin o .c m my .live om ap po .c s. int om go s. og co le m . a co m w p ol.c m m o o ne s.z go. m w yn co s. ga m ya .c fa h o nt w oo m as in .c s o ys po n ter. m rts etf co li m . y se ah x.c ar o om ch o.c .a om alo com ol.c tm ca om et st. ric ne s. t co m 1% Sex ap 1% Age Percent of Total Time Spent on Site 10% 0.1% 10% 0.1% 10% 0.1% 10% 0.1% 10% 0.1% ([email protected]) Learning from Web Activity ICSA, 2011.06.27 20 / 39 Group-level activity There is both reasonable overlap and variation amongst the most popular sites within groups. Over/Under 25 Years Old ● Female/Male ● Under/Over $50,000 Household Income ● White/Non−White ● College/No College ● 0.0 0.2 0.4 0.6 0.8 1.0 Jaccard Similarity Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 21 / 39 Individual-level prediction How well can one predict an individual’s demographics from their browsing activity? • Represent each user by the set of sites visited • Fit linear models3 to predict majority/minority for each attribute on 80% of users • Tune model parameters using a 10% validation set • Evaluate final performance on held-out 10% test set 3 Jake Hofman Using SVM-perf ([email protected]) Learning from Web Activity ICSA, 2011.06.27 22 / 39 Individual-level prediction Reasonable (∼70-85%) accuracy and AUC across all attributes AUC Accuracy Over/Under 25 Years Old ● Female/Male ● White/Non−White ● ● ● Under/Over $50,000 Household Income ● ● College/No College ● ● .5 Jake Hofman ● ([email protected]) .6 .7 .8 .9 1 .5 Learning from Web Activity .6 .7 .8 .9 1 ICSA, 2011.06.27 23 / 39 Individual-level prediction Highly-weighted sites under the fitted models Large positive weight Large negative weight winster.com lancome-usa.com marlboro.com cmt.com news.yahoo.com linkedin.com evite.com classmates.com eharmony.com tracfone.com sports.yahoo.com espn.go.com mediatakeout.com bet.com youtube.com myspace.com addictinggames.com youtube.com rownine.com matrixdirect.com Female White College Educated Over 25 Years Old Household Income Under $50,000 Table 2: A selection of the most predictive (i.e., most highly weighted) sites for each classification task. AUC Accuracy Over/Under 25 Years Old ! Female/Male White/Non−White ! College/No College ! Jake Hofman ! ! Under/Over $50,000 Household Income ([email protected]) .5 .6 .7 .8 .9 Figure 7, a measure that effectively re-normalizes the majority and minority classes to have equal size. Intuitively, AUC is the probability that a model scores a randomly selected positive example higher than a randomly selected neg! ative one (e.g., the probability that the model correctly distinguishes between a randomly selected female and male). Though an uninformative rule would correctly discriminate between such pairs 50% of the time, ICSA, predictions based on 24 / 39 Learning from Web Activity 2011.06.27 .8 .9 1 ! ! ! ! 1 .5 .6 .7 Individual-level prediction Similar performance even when restricted to top 1k sites 0.9 0.9 ● ● ● 0.8 ● 0.8 ● ● ● Accuracy AUC ● 0.7 ● Age Sex 0.6 0.7 ● Race Race Education Income 0.5 102 102.5 103 103.5 104 104.5 105 Education Income 0.5 102 Number of Domains Jake Hofman ([email protected]) Age Sex 0.6 102.5 103 103.5 104 104.5 105 Number of Domains Learning from Web Activity ICSA, 2011.06.27 25 / 39 Individual-level prediction Substantially better performance when restricted to “stereotypical” users (∼80-90%) 0.95 0.95 ● ● 0.90 0.90 ● ● ● ●● ● ● 0.80 ● 0.85 Accuracy AUC 0.85 Age ● ● ● 0.80 ● Sex 0.75 Sex 0.75 Race Race Education Education Income 0.70 0.0 0.2 Income 0.70 0.4 0.6 0.8 1.0 0.0 Fraction of Users Jake Hofman ●● ● ● Age ([email protected]) 0.2 0.4 0.6 0.8 1.0 Fraction of Users Learning from Web Activity ICSA, 2011.06.27 26 / 39 Individual-level prediction Proof of concept browser demo http://bit.ly/surfpreds Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 27 / 39 Demographics diversity on the Web with Irmak Sirer and Sharad Goel The real story (what we actually did) Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 28 / 39 The real story 0. Got several hundred GBs of MegaPanel data from Nielsen4 , looked at a small sample # ls -alh nielsen_megapanel.tar -rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar 4 Jake Hofman Special thanks to Mainak Mazumdar ([email protected]) Learning from Web Activity ICSA, 2011.06.27 29 / 39 The real story 1. Discussed (many) possible projects • Infer the number of individuals using the same browser or behind the same ip? • Determine number of actual uniques advertisers are receiving? • Predict user demographics from a few minutes of browsing activity for ad-targeting? Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 30 / 39 The real story 2. Modeled real-valued age as a function of site visits • Worked on this for an embarassingly long time • Tried various options for cleaning and normalizing data (100GB → ∼5GB) • Investigated several methods for feature selection (e.g., naive Bayes, mutual information, popularity ) Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 31 / 39 Hadoop + Pig (+ awk) Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 32 / 39 The real story 3. Settled for classification of binary outcomes (e.g. adult/non-adult) • 265,000 users (examples) and 100,000 sites (features) • Logistic regression in R, e.g. model <- glm(is.adult ~ ., data=nielsen, family=binomial) ??? Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 33 / 39 The real story 3. Settled for classification of binary outcomes (e.g. adult/non-adult) ŷ (xi ) = w · xi + b Support Vector Machine : L(y , ŷ ) = C Jake Hofman ([email protected]) P Learning from Web Activity i [1 − yi ŷ (xi )]+ + ||w ||2 ICSA, 2011.06.27 34 / 39 The real story 4. Investigated why classification worked reasonably well • Generated descriptive statistics across all attributes at the site and group levels • Compared site statistics to ZIP code data from the US Census • Compared time distribution across groups Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 35 / 39 The real story 5. Realized that we now had one of the most comprehensive studies available on demographic diversity of the web Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 36 / 39 Conclusion (lessons learned) Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 37 / 39 Lessons learned Data jeopardy Regardless of scale, it’s difficult to find the right questions to ask of the data Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 38 / 39 Lessons learned Rapid iteration The ability to iterate quickly, asking and answering many questions, is crucial Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 38 / 39 Lessons learned Data cleaning Cleaning and normalizing data is a substantial amount of the work Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 38 / 39 Lessons learned Modeling Simple methods (e.g., linear models) work surprisingly well, especially with lots of data Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 38 / 39 Lessons learned It’s easy to cover your tracks—things are often much more complicated than they appear Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 38 / 39 Thanks. Questions? http://messymatters.com/webdemo http://jakehofman.com [email protected] Jake Hofman ([email protected]) Learning from Web Activity ICSA, 2011.06.27 39 / 39