Learning from Web Activity

Transcription

Learning from Web Activity
Learning from Web Activity
Jake Hofman
Joint work with Irmak Sirer & Sharad Goel
Yahoo! Research
June 27, 2011
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
1 / 39
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
2 / 39
“... applies social sciences techniques, including
algorithmic game theory, economics, network analysis,
psychology, ethnography, and mechanism design, to
online situations.”
http://research.yahoo.com
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
3 / 39
possible to estimate accurately the age of origin of almost all extant genera. It is then possible to plot a backward survivorship curve (8)
for each of the 27 global bivalve provinces (9).
On the basis of these curves, Krug et al. find
that origination rates of marine bivalves in-
biodiversification event in the Paleogene (65 to
23 million years ago) that is perhaps not yet
captured in Alroy et al.’s database (5, 7). The
jury is still out on what may have caused this
event. But we should not lose sight of the fact
that the steep rise to prominence of many mod-
8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al.,
Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245,
pp. 245–295.
9. M. D. Spalding et al., Bioscience 57, 573 (2007).
10. S. M. Stanley, Paleobiology 33, 1 (2007).
11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007).
10.1126/science.1169410
SOCIAL SCIENCE
Computational Social Science
A field is emerging that leverages the
capacity to collect and analyze data at a
scale that may reveal patterns of individual
and group behaviors.
David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5
Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3
Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11
W“... a computational social science is emerging that
e live life in the network. We check
our e-mails regularly, make mobile
phone calls from almost any location, swipe transit cards to use public transportation, and make purchases with credit
cards. Our movements in public places may be
captured by video cameras, and our medical
records stored as digital files. We may post blog
entries accessible to anyone, or maintain friendships through online social networks. Each of
these transactions leaves digital traces that can
be compiled into comprehensive pictures of
both individual and group behavior, with the
potential to transform our understanding of our
lives, organizations, and societies.
The capacity to collect and analyze massive
amounts of data has transformed such fields as
biology and physics. But the emergence of a
data-driven “computational social science” has
been much slower. Leading journals in economics, sociology, and political science show
little evidence of this field. But computational
social science is occurring—in Internet companies such as Google and Yahoo, and in govern-
ment agencies such as the U.S. National Security Agency. Computational social science could
become the exclusive domain of private companies and government agencies. Alternatively,
there might emerge a privileged set of academic researchers presiding over private data
from which they produce papers that cannot be
critiqued or replicated. Neither scenario will
serve the long-term public interest of accumulating, verifying, and disseminating knowledge.
What value might a computational social
science—based in an open academic environment—offer society, by enhancing understanding of individuals and collectives? What are the
leverages the capacity to collect and analyze data with an
unprecedented breadth and depth and scale ...”
http://sciencemag.org/content/323/5915/721
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
4 / 39
possible to estimate accurately the age of origin of almost all extant genera. It is then possible to plot a backward survivorship curve (8)
for each of the 27 global bivalve provinces (9).
On the basis of these curves, Krug et al. find
that origination rates of marine bivalves in-
biodiversification event in the Paleogene (65 to
23 million years ago) that is perhaps not yet
captured in Alroy et al.’s database (5, 7). The
jury is still out on what may have caused this
event. But we should not lose sight of the fact
that the steep rise to prominence of many mod-
8. M. Foote, in Evolutionary Patterns, J. B. C. Jackson et al.,
Eds. (Univ. of Chicago Press, Chicago, IL, 2001), vol. 245,
pp. 245–295.
9. M. D. Spalding et al., Bioscience 57, 573 (2007).
10. S. M. Stanley, Paleobiology 33, 1 (2007).
11. M. J. Benton, B. C. Emerson, Palaeontology 50, 23 (2007).
10.1126/science.1169410
SOCIAL SCIENCE
Computational Social Science
A field is emerging that leverages the
capacity to collect and analyze data at a
scale that may reveal patterns of individual
and group behaviors.
David Lazer,1 Alex Pentland,2 Lada Adamic,3 Sinan Aral,2,4 Albert-László Barabási,5
Devon Brewer,6 Nicholas Christakis,1 Noshir Contractor,7 James Fowler,8 Myron Gutmann,3
Tony Jebara,9 Gary King,1 Michael Macy,10 Deb Roy,2 Marshall Van Alstyne2,11
W“... shares with other nascent interdisciplinary fields
e live life in the network. We check
our e-mails regularly, make mobile
phone calls from almost any location, swipe transit cards to use public transportation, and make purchases with credit
cards. Our movements in public places may be
captured by video cameras, and our medical
records stored as digital files. We may post blog
entries accessible to anyone, or maintain friendships through online social networks. Each of
these transactions leaves digital traces that can
be compiled into comprehensive pictures of
both individual and group behavior, with the
potential to transform our understanding of our
lives, organizations, and societies.
The capacity to collect and analyze massive
amounts of data has transformed such fields as
biology and physics. But the emergence of a
data-driven “computational social science” has
been much slower. Leading journals in economics, sociology, and political science show
little evidence of this field. But computational
social science is occurring—in Internet companies such as Google and Yahoo, and in govern-
ment agencies such as the U.S. National Security Agency. Computational social science could
become the exclusive domain of private companies and government agencies. Alternatively,
there might emerge a privileged set of academic researchers presiding over private data
from which they produce papers that cannot be
critiqued or replicated. Neither scenario will
serve the long-term public interest of accumulating, verifying, and disseminating knowledge.
What value might a computational social
science—based in an open academic environment—offer society, by enhancing understanding of individuals and collectives? What are the
(e.g., sustainability science) the need to develop a
paradigm for training new scholars ...”
http://sciencemag.org/content/323/5915/721
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
4 / 39
The clean real story
“ We have a habit in writing articles published in
scientific journals to make the work as finished as
possible, to cover all the tracks, to not worry about the
blind alleys or to describe how you had the wrong idea
first, and so on. So there isn’t any place to publish, in
a dignified manner, what you actually did in order to
get to do the work ...”
-Richard Feynman
Nobel Lecture 1 , 1965
1
Jake Hofman
http://bit.ly/feynmannobel
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
5 / 39
Outline
• The clean story
• The real story
• Lessons learned
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
6 / 39
Demographic diversity on the Web
with Irmak Sirer and Sharad Goel
The clean story
(covering our tracks)
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
7 / 39
Motivation
Previous work is largely survey-based and focuses and group-level
differences in online access
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
8 / 39
Motivation
“As of January 1997, we estimate that 5.2 million
African Americans and 40.8 million whites have ever used
the Web, and that 1.4 million African Americans and
20.3 million whites used the Web in the past week.”
-Hoffman & Novak (1998)
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
8 / 39
Motivation
Chang, et. al., ICWSM (2010)
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
9 / 39
Motivation
Focus on activity instead of access
How diverse is the Web?
To what extent do online experiences vary across demographic
groups?
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
10 / 39
Data
• Representative sample of 265,000 individuals in the US, paid
via the Nielsen MegaPanel2
• Log of anonymized, complete browsing activity from June
2009 through May 2010 (URLs viewed, timestamps, etc.)
• Detailed individual and household demographic information
(age, education, income, race, sex, etc.)
2
Jake Hofman
Special thanks to Mainak Mazumdar
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
11 / 39
Data
• Transform all demographic attributes to binary variables
e.g., Age → Over/Under 25, Race → White/Non-White,
Sex → Female/Male
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
12 / 39
Data
• Transform all demographic attributes to binary variables
e.g., Age → Over/Under 25, Race → White/Non-White,
Sex → Female/Male
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,
us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
12 / 39
Data
• Transform all demographic attributes to binary variables
e.g., Age → Over/Under 25, Race → White/Non-White,
Sex → Female/Male
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,
us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites
(by unique visitors)
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
12 / 39
Data
• Transform all demographic attributes to binary variables
e.g., Age → Over/Under 25, Race → White/Non-White,
Sex → Female/Male
• Normalize pageviews to at most three domain levels, sans www
e.g. www.yahoo.com → yahoo.com,
us.mg2.mail.yahoo.com/neo/launch → mail.yahoo.com
• Restrict to top 100k (out of 9M+ total) most popular sites
(by unique visitors)
• Aggregate activity at the site, group, and user levels
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
12 / 39
Site-level skew
How diverse are site audiences?
calculate the skew in visitors
(e.g., 93% of pageviews on
foxnews.com are by White
users)
Density
• For each site and attribute,
• For each attribute, plot the
distribution of visitor skew
across all sites
0.0
0.2
0.4
0.6
0.8
1.0
Proportion White Visitors
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
13 / 39
0.0
0.2
0.4
0.6
0.8
Density
Density
Density
Site-level skew
1.0
0.0
0.2
0.4
0.6
0.8
Density
0.0
0.0
0.2
0.4
0.6
0.8
Proportion Adult Visitors
Jake Hofman
1.0
0.0
Proportion White Visitors
0.2
0.4
0.6
0.8
1.0
Proportion College Educated Visitors
Density
Proportion Female Visitors
([email protected])
1.0
0.2
0.4
0.6
0.8
1.0
Proportion of Visitors With
Household Incomes Under $50,000
Learning from Web Activity
ICSA, 2011.06.27
14 / 39
Site-level skew
Many sites have skew close the overall mean, but there also
popular, highly-skewed sites
Female
White
College Educated
Over 25 Years Old
Household Income
Under $50,000
Greater Than 90%
Less Than 10%
youravon.com
collectionsetc.com
foxnews.com
wunderground.com
news.google.com
nytimes.com
mail.yahoo.com
apps.facebook.com
scarleteen.com
boards.adultswim.com
coveritlive.com
needlive.com
blackplanet.com
mediatakeout.com
slumz.boxden.com
sythe.com
nanowrimo.org
cbox.ws
opentable.com
marketwatch.com
Table 1: A selection of popular sites that are homogeneous along various demographic dimensions.
College
Jake Hofman
r−Capita Pageviews
70
60
50
Non−White
!
Female
Over 25
!
!
!
White
No College
40
30
([email protected])
Male
Under 25
visually apparent from Figure 5, there are significant differences in how groups distribute their time on the web. These
differences—which, as mentioned above, hold for highly frequented sites such as Facebook and YouTube—are in some
cases even more pronounced for lower traffic sites. For instance, the gaming site pogo.com accounts for less than 1%
of pageviews among both low and high income users, but
Learning from Web
ICSA,
low Activity
income users spend almost twice as
much2011.06.27
of their time 15 / 39
Site-level skew
This skew persists even when we restrict attention to the top 10k
or 1k sites
0.8
0.6
0.4
0.2
0.0
1.0
Proportion College Educated Visitors
1.0
Proportion White Site Visitors
Proportion Female Site Visitors
1.0
0.8
0.6
0.4
0.2
0.0
101
102
103
104
105
101
Site Rank
102
103
0.4
0.2
0.0
105
101
102
103
104
105
Site Rank
Proportion of Visitors With
Household Incomes Under $50,000
1.0
0.8
0.6
0.4
0.2
0.0
101
102
103
104
105
0.8
0.6
0.4
0.2
0.0
101
Site Rank
Jake Hofman
0.6
Site Rank
1.0
Proportion of Adult Site Visitors
104
0.8
([email protected])
102
103
104
105
Site Rank
Learning from Web Activity
ICSA, 2011.06.27
16 / 39
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
ZIPs
Density
Sites
ZIPs
Density
Sites
0.0
0.2
0.4
0.6
Proportion Female
Jake Hofman
([email protected])
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Proportion White
Learning from Web Activity
ICSA, 2011.06.27
17 / 39
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
ZIPs
Density
Sites
ZIPs
Density
Sites
0.0
0.2
0.4
0.6
Proportion Female
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Proportion White
As expected, neighborhoods are more gender-balanced than sites
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
17 / 39
Sites vs. ZIPs
How do diversity of the online and offline worlds compare?
ZIPs
Density
Sites
ZIPs
Density
Sites
0.0
0.2
0.4
0.6
Proportion Female
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Proportion White
But sites typically have more racially diverse audiences than
neighborhoods have residents
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
17 / 39
Group-level activity
How does browsing activity vary at the group level?
College
Daily Per−Capita Pageviews
70
60
50
Non−White
●
Female
Over 25
●
●
●
White
No College
Male
40
30
Under 25
20
10
0
Race
Education
Sex
Age
Large differences exist even at the aggregate level
(e.g. women on average generate 40% more pageviews than men)
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
18 / 39
fa
m ce
ai bo
l.y o
k
ps g aho .co
.fa oo o. m
m ce gl co
ai b e. m
l.g oo co
m
o
m og k.co
a le m
y il. .c
vi c web ou live om
ew h m m tu .c
m an w ai be om
or ne fb l.a .co
ep l.f .zy ol m
ic ac ng .co
se s.m ebo a.c m
ar ys ok om
ch pa .c
. c o
m yah e.c m
ys o o
pa o.c m
ce om
a m .c
sh ma sn om
op zo .co
im
.e n. m
ho age y bay com
m s. ah .c
e. go oo om
m my og .co
ai sp le m
l.c a .c
om ce om
c .c
w
w b as om
w in t.n
m
.y g e
es
cg ah .co t
sa
i.e oo m
gi
ng es ba .co
.m pn y. m
ys .g com
p o.
ci twace com
m i .c
.m tte o
e r m
en my eb .co
fa
ce
. .e o m
bo lo wik ba .co
ok gin ipe y.c m
.m .y di om
af ah a.
m ia oo or
fri ga y. wa .co g
en m ya rs m
ds e3 ho .co
.m .p o. m
ys og co
o m
w ta pac .co
or g e m
ld ge .co
w d m
in .c
m ne o
lo ee r.c m
g b o
m my in.li o.c m
ap p ve om
s. oin .c
go ts om
og .c
le om
m
a .c
w
m p ol. om
ne s.z og co
w yn o.c m
s. g o
fa
ya a m
nt
as
w ho .com
ys
in o
po n ste .co
rts et r.c m
.
se ya flix om
ar ho .co
ch o m
.
al co .ao com
ot m l.c
m ca o
et s m
ric t.n
s. et
co
m
ap
Percent of Total Time Spent on Site
Group-level activity
All groups spend more than a third of their time on a handful of
email, search, and social networking sites
10%
Jake Hofman
female
male
1%
0.1%
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
19 / 39
fa
m ce
ai bo
l.y o
k
ps g aho .co
.fa oo o. m
m ce gl co
ai b e. m
l.g oo co
m
o
m og k.co
a le m
y il. .c
vi c web ou live om
ew h m m tu .c
m an w ai be om
or ne fb l.a .co
ep l.f .zy ol m
ic ac ng .co
se s.m ebo a.c m
ar ys ok om
ch pa .c
. c o
m yah e.c m
ys o o
pa o.c m
ce om
a m .c
sh ma sn om
op zo .co
im
.e n. m
ho age y bay com
m s. ah .c
e. go oo om
m my og .co
ai sp le m
l.c a .c
om ce om
c .c
w
w b as om
w in t.n
m
.y g e
es
cg ah .co t
sa
i.e oo m
gi
ng es ba .co
.m pn y. m
ys .g com
p o.
ci twace com
m i .c
.m tte o
e r m
en my eb .co
fa
ce
. .e o m
bo lo wik ba .co
ok gin ipe y.c m
.m .y di om
af ah a.
m ia oo or
fri ga y. wa .co g
en m ya rs m
ds e3 ho .co
.m .p o. m
ys og co
o m
w ta pac .co
or g e m
ld ge .co
w d m
in .c
m ne o
lo ee r.c m
g b o
m my in.li o.c m
ap p ve om
s. oin .c
go ts om
og .c
le om
m
a .c
w
m p ol. om
ne s.z og co
w yn o.c m
s. g o
fa
ya a m
nt
as
w ho .com
ys
in o
po n ste .co
rts et r.c m
.
se ya flix om
ar ho .co
ch o m
.
al co .ao com
ot m l.c
m ca o
et s m
ric t.n
s. et
co
m
ap
Percent of Total Time Spent on Site
Group-level activity
But different groups distribute their time differently, both on
universally popular and on more niche sites
10%
Jake Hofman
white
non.white
1%
0.1%
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
19 / 39
1%
Education
1%
Income
Jake Hofman
1%
Race
fa
c
m eb
ail oo
.y k.
ah co
ps go oo m
.fa og .co
m ceb le. m
ail o co
.g ok m
o .c
m ogle om
ail .
.li co
w you ve. m
e t c
vie ch m bm ube om
w an wf ail. .co
m n b ao m
or el. .zy l.
ep fa n co
ic ce ga m
s b .c
se .my oo om
ar sp k.c
ch ac o
.y e m
m aho .com
ys o
pa .co
ce m
.c
am msn om
sh az .c
op on om
.e .c
im
b o
ag y ay m
ho es ah .co
m .g oo m
e. oo .c
m
o
m ys gle m
ail pa .co
.c ce m
om .c
ca om
w
s
w b t.n
w in e
.y g. t
a
m
cg ho com
es
i. o.
sa
gin e eba com
g. spn y.c
m .g om
ys o
pa .co
c m
cim tw e.c
.m itter om
e .
m e co
en y.e bo. m
c
fa
ce lo .wik bay om
bo g ip .c
ok in. ed om
.m ya ia.
af ho org
ia o
m wa .co
y
frie gam .ya rs.c m
nd e3 ho om
s. .p o.c
m og om
ys o
pa .co
w ta ce. m
or gg c
ld e om
w d.
in co
m ner m
.
lo eeb com
gin o
.c
m my .live om
ap po .c
s. int om
go s.
og co
le m
.
a co
m
w p ol.c m
m o o
ne s.z go. m
w yn co
s. ga m
ya .c
fa
h o
nt
w oo m
as
in .c
s o
ys
po n ter. m
rts etf co
li m
.
y
se ah x.c
ar o om
ch o.c
.a om
alo com ol.c
tm ca om
et st.
ric ne
s. t
co
m
1%
Sex
ap
1%
Age
Percent of Total Time Spent on Site
10%
0.1%
10%
0.1%
10%
0.1%
10%
0.1%
10%
0.1%
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
20 / 39
Group-level activity
There is both reasonable overlap and variation amongst the most
popular sites within groups.
Over/Under 25
Years Old
●
Female/Male
●
Under/Over $50,000
Household Income
●
White/Non−White
●
College/No College
●
0.0
0.2
0.4
0.6
0.8
1.0
Jaccard Similarity
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
21 / 39
Individual-level prediction
How well can one predict an individual’s demographics from their
browsing activity?
• Represent each user by the set of sites visited
• Fit linear models3 to predict majority/minority for each
attribute on 80% of users
• Tune model parameters using a 10% validation set
• Evaluate final performance on held-out 10% test set
3
Jake Hofman
Using SVM-perf
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
22 / 39
Individual-level prediction
Reasonable (∼70-85%) accuracy and AUC across all attributes
AUC
Accuracy
Over/Under 25
Years Old
●
Female/Male
●
White/Non−White
●
●
●
Under/Over $50,000
Household Income
●
●
College/No College
●
●
.5
Jake Hofman
●
([email protected])
.6
.7
.8
.9
1 .5
Learning from Web Activity
.6
.7
.8
.9
1
ICSA, 2011.06.27
23 / 39
Individual-level prediction
Highly-weighted sites under the fitted models
Large positive weight
Large negative weight
winster.com
lancome-usa.com
marlboro.com
cmt.com
news.yahoo.com
linkedin.com
evite.com
classmates.com
eharmony.com
tracfone.com
sports.yahoo.com
espn.go.com
mediatakeout.com
bet.com
youtube.com
myspace.com
addictinggames.com
youtube.com
rownine.com
matrixdirect.com
Female
White
College Educated
Over 25 Years Old
Household Income
Under $50,000
Table 2: A selection of the most predictive (i.e., most highly weighted) sites for each classification task.
AUC
Accuracy
Over/Under 25
Years Old
!
Female/Male
White/Non−White
!
College/No College
!
Jake Hofman
!
!
Under/Over $50,000
Household Income
([email protected])
.5 .6 .7 .8 .9
Figure 7, a measure that effectively re-normalizes the majority and minority classes to have equal size. Intuitively,
AUC is the probability that a model scores a randomly selected positive example higher than a randomly selected neg!
ative one (e.g., the probability that the model correctly distinguishes between a randomly selected female and male).
Though an uninformative rule would correctly discriminate
between
such pairs 50% of the time, ICSA,
predictions
based on 24 / 39
Learning
from
Web
Activity
2011.06.27
.8 .9
1
!
!
!
!
1 .5
.6
.7
Individual-level prediction
Similar performance even when restricted to top 1k sites
0.9
0.9
●
●
●
0.8
●
0.8
●
●
●
Accuracy
AUC
●
0.7
●
Age
Sex
0.6
0.7
●
Race
Race
Education
Income
0.5
102
102.5
103
103.5
104
104.5
105
Education
Income
0.5
102
Number of Domains
Jake Hofman
([email protected])
Age
Sex
0.6
102.5
103
103.5
104
104.5
105
Number of Domains
Learning from Web Activity
ICSA, 2011.06.27
25 / 39
Individual-level prediction
Substantially better performance when restricted to “stereotypical”
users (∼80-90%)
0.95
0.95
●
●
0.90
0.90
●
●
●
●●
●
●
0.80
●
0.85
Accuracy
AUC
0.85
Age
●
●
●
0.80
●
Sex
0.75
Sex
0.75
Race
Race
Education
Education
Income
0.70
0.0
0.2
Income
0.70
0.4
0.6
0.8
1.0
0.0
Fraction of Users
Jake Hofman
●●
●
●
Age
([email protected])
0.2
0.4
0.6
0.8
1.0
Fraction of Users
Learning from Web Activity
ICSA, 2011.06.27
26 / 39
Individual-level prediction
Proof of concept browser demo
http://bit.ly/surfpreds
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
27 / 39
Demographics diversity on the Web
with Irmak Sirer and Sharad Goel
The real story
(what we actually did)
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
28 / 39
The real story
0. Got several hundred GBs of MegaPanel data from Nielsen4 ,
looked at a small sample
# ls -alh nielsen_megapanel.tar
-rw-r--r-- 100G Jul 17 13:00 nielsen_megapanel.tar
4
Jake Hofman
Special thanks to Mainak Mazumdar
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
29 / 39
The real story
1. Discussed (many) possible projects
• Infer the number of individuals using the same browser or
behind the same ip?
• Determine number of actual uniques advertisers are receiving?
• Predict user demographics from a few minutes of browsing
activity for ad-targeting?
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
30 / 39
The real story
2. Modeled real-valued age as a function of site visits
• Worked on this for an embarassingly long time
• Tried various options for cleaning and normalizing data
(100GB → ∼5GB)
• Investigated several methods for feature selection
(e.g., naive Bayes, mutual information, popularity )
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
31 / 39
Hadoop + Pig (+ awk)
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
32 / 39
The real story
3. Settled for classification of binary outcomes (e.g.
adult/non-adult)
• 265,000 users (examples) and 100,000 sites (features)
• Logistic regression in R, e.g.
model <- glm(is.adult ~ ., data=nielsen, family=binomial)
???
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
33 / 39
The real story
3. Settled for classification of binary outcomes (e.g.
adult/non-adult)
ŷ (xi ) = w · xi + b
Support Vector Machine : L(y , ŷ ) = C
Jake Hofman
([email protected])
P
Learning from Web Activity
i
[1 − yi ŷ (xi )]+ + ||w ||2
ICSA, 2011.06.27
34 / 39
The real story
4. Investigated why classification worked reasonably well
• Generated descriptive statistics across all attributes at the site
and group levels
• Compared site statistics to ZIP code data from the US Census
• Compared time distribution across groups
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
35 / 39
The real story
5. Realized that we now had one of the most comprehensive
studies available on demographic diversity of the web
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
36 / 39
Conclusion
(lessons learned)
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
37 / 39
Lessons learned
Data jeopardy
Regardless of scale, it’s difficult to find the right questions to ask
of the data
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
38 / 39
Lessons learned
Rapid iteration
The ability to iterate quickly, asking and answering many
questions, is crucial
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
38 / 39
Lessons learned
Data cleaning
Cleaning and normalizing data is a substantial amount of the work
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
38 / 39
Lessons learned
Modeling
Simple methods (e.g., linear models) work surprisingly well,
especially with lots of data
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
38 / 39
Lessons learned
It’s easy to cover your tracks—things are often much more
complicated than they appear
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
38 / 39
Thanks. Questions?
http://messymatters.com/webdemo
http://jakehofman.com
[email protected]
Jake Hofman
([email protected])
Learning from Web Activity
ICSA, 2011.06.27
39 / 39