Releasing Search Queries and Clicks Privately
Transcription
Releasing Search Queries and Clicks Privately
Releasing Search Queries and Clicks Privately Aleksandra Korolova Stanford University Krishnaram Kenthapadi Nina Mishra Microsoft Research – Search Labs Microsoft Research – Search Labs Alexandros Ntoulas Microsoft Research – Search Labs All examples are fictitious Why Release Search Logs Online Ad Campaign Query Suggestions Mining Search Data Social Science Why Search Logs are Private Previous Approaches Anonymize Usernames/Omit IP Addresses AOL data release, 2006 • CTO resigned, 2 employees fired • Class action law suit pending • CNN Money: “101dumbest moments in business” Searches by user 4417749 landscapers in lilburn ga. effects of nicotine jarrett t. arnold eugene oregon plastic surgeons in gwinnett county 60 single men clothes for 60 plus age lactose intolerant dog who urinate on everything Thelma Arnold, 62 from Lilburn, Georgia 3/6/2006 3/7/2006 3/23/2006 3/28/2006 3/29/2006 4/19/2006 4/21/2006 4/28/2006 18:37:26 19:17:19 21:48:01 15:04:23 20:11:52 12:44:03 20:53:51 13:24:07 Ad-hoc Techniques do Not Work Remove names, dates, numbers, locations Token-based hashing fails [Kumar, Novak, Pang, Tomkins WWW’07] Release only frequent queries “MIT math major with multiple sclerosis” What’s sufficiently frequent? Combining data from multiple sources Previous/future releases useful to break privacy Our Goal Can we release search logs with provable privacy guarantees preserving usefulness Rigorous Privacy Definition Desired Features of Privacy Definition No assumptions on attacker’s prior knowledge computational powers access to other datasets No assumptions on user’s search patterns what constitutes private information Differential Privacy [Dwork et al, 2006] Search Log Search Log Randomized release algorithm A Randomized release algorithm A Released dataset Knowledge about Released dataset (ε,δ) ≈ Knowledge about The risk to one’s privacy increases only slightly from using the search engine Our Approach Query-Click Graph Data Release Algorithm Privacy Guarantees Query-Click Graph Useful Queries Shoes # times posed # clicks 700 1200 www.shoes.com 100 en.wikipedia.org/wiki/Shoes www.zappos.com 100 Running shoes 300 20 70 10 Women’s shoes 900 333 250 URLs www.shoebuy.com www.payless.com for many applications Related searches Spell corrections Expanding acronyms Estimating CTRs Computations on query-click graph Releasing Queries Privately Determined by desired privacy guarantees Add random noise from Laplace distribution Exceeds specified threshold? Query Count Noisy Count Weather in Madrid 1150 1159 WWW 2009 900 903 Data-mining 710 698 Report a stolen passport 20 19 Aleksandra (650) 796-4536 2 7 Released? Understanding Private Query Release Why add random noise? Suppose attacker has a guess for my SSN and poses the query containing the guess threshold-1 # of times What if one user disproportionally influences the log? Solution: limit each user’s activity to d queries and dc clicks Caveat: if using multiple computers, treated as two users Probability of Release Depending on Frequency Probability of release 100% 90% 80% 70% 60% 50% Threshold=100, Noise=Laplace(5) 40% 30% 20% 10% 0% 0 20 40 60 80 100 120 Query frequency 140 160 180 200 Releasing Queries and Clicks Privately Choose: Desired privacy guarantees (ε,δ) Limit on user activity d, dc Threshold Noise Level Release Queries: whose noisy frequency counts exceed the threshold Release URL Click Counts: Given released query, top 10 URLs returned are public Release noisy click counts for top 10 URLs Setting parameters. Given the parameters a natural question is how to set them. As mentioned earlier, it isnumerous up to the data releaser in the theorems just a naturalthat question is how to nsetis them. As mentioned earlier, it is up to the data r o choose !, while it is advisable δ < 1/n, where the number of users. What to choose ", while is advisable δ < 1/n,setting where1 the n is the number of users about the remaining parameters? Lemma ?? it offers answers that for optimally parameters? Lemma ?? limit offersfor answers hreshold K and noise b about when the the remaining desired privacy parameters and the the for optimally setti number of queries per user are known: threshold K and noise b when the desired privacy parameters and the limit number of queries per user"are known: ! ln( 2δ d ) " ! K =d 1− 2δ Satisfies (ε,δ)-differential privacy, when ln( d ) ! and Theorem: Algorithm Provably Private ! d ln( 2δ ) =d 1+ ! " K =d 1− " Threshold and d Noise from Laplace distribution w/ scale b= . " d Keeping the first db =queries per user . ! the optimal choices of threshold K and noise b vary as a f Table 1 shows how of the number of queries allowed per user, d, for fixed privacy parameters, e! = Table ?? shows how the optimal choices of threshold K and noise b vary as a functionqueries −5 Quantifies what δ = 10 . constitutes “sufficiently frequent” ! of the number of queries allowed per user, d, for fixed privacy parameters, e = 10 and δ = 10−5 . We now state formally the (!, δ)-differential privacy guarantees that our algorithm provides. We then provide a sketch of the proof that each of the individual steps preerves privacy and, further, that their composition preserves privacy. Let K, d, dc , b, bq , bc be the parameters of Algorithm such that K ≥ d. Define 1 α = max(e1/b , 1+ 2e(K−1)/b ) and the multiplicative and additive privacy parameters −1 as # ! = d · ln max(e1/b , 1 + 1 2e(K−1)/b − 1 $ ) + dc /bc Utility Released Data Characteristics Social Science Research Algorithmic Application Quantity of Privately Releasable Data Distinct Queries Impressions 2.5 million 3.5 billion Example queries releasable: How to tie a windsor knot Girl born with 8 limbs Cash register software Vintage aluminum Christmas trees Utility: Studying Human Nature [Tancer “Click” 2008] “Fear of …” queries Rank Phone Survey Original Search Log Released Queries 1 Bugs, mice, snakes Flying Flying 2 Heights Heights Heights 3 Water Snakes, spiders Public Speaking 4 Public transportation Death Snakes, spiders 5 Storms Public speaking Death 6 Closed spaces Commitment Commitment 7 Tunnels and bridges Intimacy Abandonment 8 Crowds Abandonment The dark 9 Speaking in public The dark Intimacy Social Fears Utility: Recommending Keywords to Online Advertisers Launch an online ad campaign around a concept Goal: given a seed set of keywords/URLs, suggest relevant keywords. Solution: Random walk on Query-Click Graph [Fuxman, Tsaparas, Achan, Agrawal, WWW’08] Recommending Keywords: Original Private (13% of Original) flight travelocity travelosity wwwtravelocity com travalocity travelosity com aarp passport travalosity travilocity air fares travel velocity travleocity airfare travelacity travlocity airfares travellocity travolicity cheap flights travellocity com travelocity travolocity trvelocity cruises flight travelocity travelocity air fares ww travelocity com flights travelocity ca www travellocity com last minute travel travelocity cheap flight www travelocity last minute travel deals travelocity com www travelocity co vacation packages travelocity vacations www travelocity com vacations to go travelociy www travelosity com travellosity traveloscity flights Conclusions Contributions Algorithm for releasing queries and clicks with provable privacy guarantees Non-trivial amount of queries, impressions, clicks Evidence that released data preserves utility Releasing frequent queries works Quantify frequent Explored utility the trade-offs between privacy and Future Work Grouping similar queries Choosing privacy parameters in practice Beyond privacy of users Thank you! Questions?