Releasing Search Queries and Clicks Privately

Transcription

Releasing Search Queries and Clicks Privately
Releasing Search Queries and
Clicks Privately
Aleksandra Korolova
Stanford University
Krishnaram Kenthapadi
Nina Mishra
Microsoft Research – Search Labs
Microsoft Research – Search Labs
Alexandros Ntoulas
Microsoft Research – Search Labs
All examples are fictitious
Why Release Search Logs
Online Ad Campaign
Query Suggestions
Mining Search Data
Social Science
Why Search Logs are Private
Previous Approaches
Anonymize Usernames/Omit IP Addresses
AOL data release, 2006
• CTO resigned, 2 employees fired
• Class action law suit pending
• CNN Money:
“101dumbest moments in business”
Searches by user 4417749
landscapers in lilburn ga.
effects of nicotine
jarrett t. arnold eugene oregon
plastic surgeons in gwinnett county
60 single men
clothes for 60 plus age
lactose intolerant
dog who urinate on everything
Thelma Arnold, 62
from Lilburn, Georgia
3/6/2006
3/7/2006
3/23/2006
3/28/2006
3/29/2006
4/19/2006
4/21/2006
4/28/2006
18:37:26
19:17:19
21:48:01
15:04:23
20:11:52
12:44:03
20:53:51
13:24:07
Ad-hoc Techniques do Not Work
 
Remove names, dates, numbers, locations
 
 
Token-based hashing fails
 
 
[Kumar, Novak, Pang, Tomkins WWW’07]
Release only frequent queries
 
 
“MIT math major with multiple sclerosis”
What’s sufficiently frequent?
Combining data from multiple sources
 
Previous/future releases useful to break privacy
Our Goal
Can we release search logs with
  provable privacy guarantees
  preserving usefulness
Rigorous Privacy Definition
Desired Features of Privacy Definition
 
No assumptions on attacker’s
 
 
 
 
prior knowledge
computational powers
access to other datasets
No assumptions on user’s
 
 
search patterns
what constitutes private information
Differential Privacy [Dwork et al, 2006]
Search Log
Search Log
Randomized release
algorithm A
Randomized release
algorithm A
Released
dataset
Knowledge
about
Released
dataset
(ε,δ)
≈
Knowledge
about
The risk to one’s
privacy increases only
slightly from using the
search engine
Our Approach
Query-Click Graph
Data Release Algorithm
Privacy Guarantees
Query-Click Graph
  Useful
Queries
Shoes
# times
posed
# clicks
700
1200
www.shoes.com
100
en.wikipedia.org/wiki/Shoes
www.zappos.com
100
Running shoes 300
20
70
10
Women’s shoes 900
333
250
URLs
www.shoebuy.com
www.payless.com
for many
applications
Related searches
  Spell corrections
  Expanding
acronyms
  Estimating CTRs
  Computations on
query-click graph
 
Releasing Queries Privately
Determined by desired privacy guarantees
Add random noise
from Laplace distribution
Exceeds
specified threshold?
Query
Count
Noisy Count
Weather in Madrid
1150
1159
WWW 2009
900
903
Data-mining
710
698
Report a stolen passport
20
19
Aleksandra (650) 796-4536
2
7
Released?
Understanding Private Query Release
 
Why add random noise?
 
 
Suppose attacker has a guess for my SSN and poses the query
containing the guess threshold-1 # of times
What if one user disproportionally influences the log?
 
 
Solution: limit each user’s activity to d queries and dc clicks
Caveat: if using multiple computers, treated as two users
Probability of Release Depending on Frequency
Probability of release
100%
90%
80%
70%
60%
50%
Threshold=100, Noise=Laplace(5)
40%
30%
20%
10%
0%
0
20
40
60
80
100
120
Query frequency
140
160
180
200
Releasing Queries and Clicks Privately
Choose:
Desired privacy guarantees (ε,δ)
  Limit on user activity d, dc
 
Threshold
Noise Level
Release Queries:
 
whose noisy frequency counts exceed the threshold
Release URL Click Counts:
Given released query, top 10 URLs returned are public
  Release noisy click counts for top 10 URLs
 
Setting
parameters.
Given the
parameters
a natural question is how to set
them. As
mentioned earlier,
it isnumerous
up to the data
releaser in the theorems just
a naturalthat
question
is how
to nsetis them.
As mentioned
earlier, it is up to the data r
o choose !, while it is advisable
δ < 1/n,
where
the number
of users. What
to choose
", while
is advisable
δ < 1/n,setting
where1 the
n is the number of users
about the remaining parameters?
Lemma
?? it
offers
answers that
for optimally
parameters?
Lemma
?? limit
offersfor
answers
hreshold K and noise b about
when the
the remaining
desired privacy
parameters
and the
the for optimally setti
number of queries per user
are known:
threshold
K and noise b when the desired privacy parameters and the limit
number of queries
per user"are known:
!
ln( 2δ
d )
"
!
K =d 1−
2δ
  Satisfies (ε,δ)-differential
privacy, when ln( d )
!
and
Theorem: Algorithm Provably Private
 
 
 
!
d
ln( 2δ
)
=d 1+
!
"
K =d 1−
"
Threshold
and
d
Noise from Laplace distribution w/ scale
b= .
"
d
Keeping the first db =queries
per user
.
! the optimal choices of threshold K and noise b vary as a f
Table 1 shows how
of the number of queries allowed per user, d, for fixed privacy parameters, e! =
Table ?? shows
how the optimal
choices
of threshold K and noise b vary as a functionqueries
−5
  Quantifies
what
δ = 10
. constitutes “sufficiently frequent”
!
of the number of queries allowed per user, d, for fixed privacy parameters, e = 10 and
δ = 10−5 .
We now state formally the (!, δ)-differential privacy guarantees that our algorithm
provides. We then provide a sketch of the proof that each of the individual steps preerves privacy and, further, that their composition preserves privacy.
Let K, d, dc , b, bq , bc be the parameters of Algorithm such that K ≥ d. Define
1
α = max(e1/b , 1+ 2e(K−1)/b
) and the multiplicative and additive privacy parameters
−1
as
#
! = d · ln max(e1/b , 1 +
1
2e(K−1)/b − 1
$
) + dc /bc
Utility
Released Data Characteristics
Social Science Research
Algorithmic Application
Quantity of Privately Releasable Data
Distinct Queries
Impressions
2.5 million
3.5 billion
Example queries releasable:
 
 
 
 
How to tie a windsor knot
Girl born with 8 limbs
Cash register software
Vintage aluminum Christmas trees
Utility: Studying Human Nature
[Tancer “Click” 2008]
“Fear of …” queries
Rank
Phone Survey
Original Search Log
Released Queries
1
Bugs, mice, snakes
Flying
Flying
2
Heights
Heights
Heights
3
Water
Snakes, spiders
Public Speaking
4
Public transportation
Death
Snakes, spiders
5
Storms
Public speaking
Death
6
Closed spaces
Commitment
Commitment
7
Tunnels and bridges
Intimacy
Abandonment
8
Crowds
Abandonment
The dark
9
Speaking in public
The dark
Intimacy
Social Fears
Utility: Recommending Keywords to Online
Advertisers
 
Launch an online ad campaign around a concept
 
Goal:
 
 
given a seed set of keywords/URLs, suggest relevant keywords.
Solution:
 
 
Random walk on Query-Click Graph
[Fuxman, Tsaparas, Achan, Agrawal, WWW’08]
Recommending Keywords:
Original
Private (13% of Original)
flight travelocity
travelosity
wwwtravelocity com
travalocity
travelosity com
aarp passport
travalosity
travilocity
air fares
travel velocity
travleocity
airfare
travelacity
travlocity
airfares
travellocity
travolicity
cheap flights
travellocity com
travelocity
travolocity
trvelocity
cruises
flight travelocity
travelocity air fares
ww travelocity com
flights
travelocity ca
www travellocity com
last minute travel
travelocity cheap flight
www travelocity
last minute travel deals
travelocity com
www travelocity co
vacation packages
travelocity vacations
www travelocity com
vacations to go
travelociy
www travelosity com
travellosity
traveloscity
flights
Conclusions
Contributions
  Algorithm for releasing queries and clicks with
provable privacy guarantees
 
 
Non-trivial amount of queries, impressions, clicks
Evidence that released data preserves utility
  Releasing
 
frequent queries works
Quantify frequent
  Explored
utility
the trade-offs between privacy and
Future Work
  Grouping
similar queries
  Choosing privacy parameters in practice
  Beyond privacy of users
Thank you!
Questions?