Presentation Slides (PDF format, 2744 KB)
Transcription
Presentation Slides (PDF format, 2744 KB)
Involuntary Information Leakage in Social Network Services Ieng-Fat Lam, Kuan-Ta Chen, and Ling-Jyh Chen Institute of Information Science, Academia Sinica Presenter: Ieng-Fat Lam 2009/2/2 Outline Introduction Motivation Research Method Results Discussion Conclusion 2 Introduction :: Social Networking Services (SNSs) For example • Myspace, Facebook, Orkut, Yahoo! 360 • Mixi, GREE (Japan) • Wretch (Taiwan) Become very popular Hosts millions of profiles 3 Introduction :: Users in SNSs Social Activities • Meet new friends, contact existing friends • Share resources over the Internet Personal Information is usually published • Photos • identity information • Contact information 4 Introduction :: Disclosing personal information Double‐edged sword • Let other people know / search you • But some people may not respond nicely • Risk of personal information used by malicious people II am am Lee-Da Lee-Da Nu! Nu! II love love movie movie II am am 23 23 years years old, old, single!! single!! 5 Introduction :: Not revealing person information? II never never disclose disclose my my info info to to the the Internet! Internet! 6 Introduction :: Information revealed by friends 7 Introduction :: Information revealed [I got it!] [I got it!] Real Name : Real Name : Andrew Richman Andrew Richman Gender: Gender: Male Male Age: Age: 20 ~ 22 20 ~ 22 Education record: Education record: Sunrise elementary school Sunrise elementary school St. John secondary school St. John secondary school St. Paul University St. Paul University 8 Motivation :: Involuntary Information leakage A User may want to protect his/her identity • But it may unintentionally revealed by friends • Hard to detect such leakage | Due to distributed nature of Internet • Becoming a serious threat to privacy 9 Motivation :: In this study We would like to • Investigate the extent of involuntary information leakage • Gather data from Wretch (http://www.wretch.cc) | | The most popular SNS in Taiwan About 4 millions user profiles • Quantify the degree of such leakage | Real Name, age and education record • Discuss potential means to mitigate the problem 10 Research Method :: Data Collection User User ID ID List List (Crawl) (Crawl) john123 1. Pick ID randomly john123 Aron Aron roserose roserose iamboy iamboy … … .... 2. Obtain user profile and friend list (HTML) 4. Add user ID To ID list Andy Orange … 5. Update ID List Frn List Andy Orange … 3. Parse and save crawled user data Frn List 11 Research Method :: An example Friend list Friend list User Profile User Profile 12 Research Method :: Overview of Crawled Data Wretch Data Number of users Number of Effective users 766,972 (20%) 592,548 (15%) Number of Connections 7,619,212 Avg. Connections per user 11.5 *Effective user at least have one “outgoing” friend connection 13 Research Method :: Analysis of Name Leakage Friend annotations in Wretch • A free‐form text to describe a friend • It is used for | | | Classification Real name or nickname of a friend The feature of a friend For example • *Beauty Cathy Brown – The hottest girl of Nightingale High School • [[ School Mate ]] Tony MY BUDDY 14 Research Method :: Name Inference Process 2. Generate Name Candidates 1. Obtain friend annotations (for each profile) Infer First Name 15 Research Method :: Generate name candidates To infer real name of a profile • Collect all of its incoming annotations • Extract name candidates from annotations Andrew!! Andrew!! Aron Yo~ Bros. Andrew!! Yo~ Bros. Andrew!! Old Mr. Richman!! Old Mr. Richman!! Andy Sammy Cool~~ Andrew Richman!! Cool~~ Andrew Richman!! 16 Research Method :: Generate name candidates (cont.) Extract method • Break the text into tokens by | | | Symbols: <space>, <tab>, ‘#’, ‘@’, etc. Punctuation marks: ‘ ” , . () [] Connective words (in Chinese) • Chinese‐specific naming rules | | | 陳寬達 (Chen Kuan‐Ta) Two‐word tokens as first name candidates Three‐word tokens as full name candidates • Duplication Count is associated 17 Research Method :: An example Andrew!! Andrew!! 德榮!! 德榮!! Yo~ Andrew~Bros Yo~ Andrew~Bros Andrew!! Andrew!! 喔~德榮~德榮兄!! 喔~德榮~德榮兄!! Andy Old Mr. Old Mr. Richman~!! Richman~!! 老劉~!! 老劉~!! Cool~~ Cool~~ Andrew Richman!! Andrew Richman!! 超帥~~ 劉德榮!! 超帥~~ 劉德榮!! Name Candidates Name Candidates 德榮 德榮 (Andrew) [1] (Andrew) [1] 超帥 超帥 (Cool) [0] (Cool) [0] 劉德榮 劉德榮 (Andrew Richman) [0] (Andrew Richman) [0] 德榮兄 德榮兄 (Bros Andrew) [0] (Bros Andrew) [0] 喔 喔 (Yo) [0] (Yo) [0] 老劉 老劉 (Old Mr. Richman) [0] (Old Mr. Richman) [0] Full Full name name candidates candidates First First name name candidates candidates Duplication Duplication count count 18 Research Method :: Inference of full name (1 / 5) Common family name • Family name part is a common family name • Duplication count is greater than 1 • For example | | | For full name candidate “Andrew Richman” If “Andrew Richman” exists in more than 1 annotations If “Richman” is a common family name [1] Chih-Hao Tsai, “Common Chinese Names”, http://technology.chtsai.org/namefreq/ 19 Research Method :: Inference of full name (2 / 5) First name as a substring of full name • A first name candidate as a substring | In the right position • Duplication count is greater than 1 • For example | | | For full name candidate “Andrew Richman” If “Andrew Richman” exists in more than 1 annotations If “Andrew” is also a first name candidate 20 Research Method :: Inference of full name (3 / 5) Common full name • Compare with existing full name list • National college exam enrollment list List maintained from 1994 to 2007 | 574, 010 distinguished full names | [2] Chih-Hao Tsai, “A list of Chinese Names”, http://technology.chtsai.org/namelist/ 21 Research Method :: Inference of full name (4 / 5) Nickname decomposition • In Chinese name For “Andrew Richman” For “Andrew Richman” | FN GN1‐GN2 (陳寬達) We also have We also have “Bros “Bros Andrew” Andrew” • Possible nicknames: “Bros” “Bros” is a predefined prefix is a predefined prefix | Prefix + X Removed “Bros” Removed “Bros” we got “Andrew” we got “Andrew” | Prefix + X + X “Andrew” “Andrew” is in “Andrew Richman” is in “Andrew Richman” | X + postfix | Where X can be FN, GN1 or GN2 22 Research Method :: Inference of full name (5 / 5) Common words removal • If no match candidates found in above rules • If duplicate count greater than 1 • If the full name candidate is not a nickname | Does not contain any nickname prefix or postfix • Not a ( or based on a ) common word | Compare to 100,511 common words • Select the one with the highest duplication count 23 Research Method :: Inference of First Name Use same method as inference of full name • Common first name | | Compare with 208,581 first names Required duplication count greater than 1 • Nickname decomposition • Common word removal 24 Results :: Name Inference Results Ratio of inferred names Type of name Ratio of name inference Nickname 60% Real name (full name) 30% First name 72% Real name or first name 78% 25 Results :: Validation Examine real name by manual • Randomly Select 1,000 profiles • 738 of them are unique and correct | More examine is performed, similar result • Wrong case: User’s nickname • Sufficient to support the conjecture | Involuntary real name leakage occurs in real‐life social network systems, and the degree of leakage is significant 26 Results :: Ratio of Name Leakage Figure 2: Ratio of name leakage based on users’ gender Figure 3: Relation of users’ age and ratio of name leakage 27 Results :: Risk Analysis To confirm the identity leakage is involuntary • We check the inferred name with user’s profile | Only less than 0.1% users reveal their real names To quantify the tendency of using real name • Degree of Using Real name (DUR) | Ratio of a user’s outgoing annotation that contain real name of annotation target • Degree of being Called by Real name (DCR) | Ratio of incoming annotations containing user’s real name 28 Results :: Example of DUR and DCR DUR and DCR Criteria DCR First name 1/5 Full name 1/5 Either 2/5 Criteria DUR First name 4/5 Full name 1/5 Either 5/5 “Andrew” [Friend] Raymond Aron [Friend] Sammy Hagar Our King! [Friend] David Jones Bros Andrew Cool~ Andrew Richman Yo~What’s up man [Friend] Jay leno [Friend] John Lennon 29 Results :: Positive relation between DUR and DCR Figure 4: Relation of DUR and DCR 30 Research Method :: Involuntary leakage of age and education records Inferring age • • • • • Round‐based manner If X disclosed age, and have a friend Y If X and Y have relation of “classmate”, “same class”… Assign age of X to Y Then check Y’s “classmate” 31 Research Method :: Involuntary leakage of age and education records Inferring Education records • Same as inferring age • Divided into four education level, infer separately | | | | Elementary School Junior high school Senior high school College • Define relation by keyword | “same school”, “same college”, etc. 32 Results :: Inference results Figure 5: Inference results of users' ages Figure 6: Inference results of users' education records 33 Results :: Validation Cross‐validation • Verify inferred ages | Based on self‐disclosed education records • Verify inferred education records | Based on self‐disclosed ages • Difference of age should be small | To verify our infer result are accurate 34 Results :: Validation Results Figure 7: The inferred age differences between pairs of self-disclosed schoolmates in the four education levels Figure 8: The self-disclosed age differences between pairs of inferred schoolmates in the four education levels 35 Discussion :: Threads caused by identity leakage Stalking Spamming • In our data set | | 46% users disclosed valid email address Spam with friends’ (spoofed) email address Phishing • Spear phishing / Social phishing | | Includes personal information in phishing email Spoof friend’s email address 36 Discussion :: Spear Phishing or Spam Dear Dear Mr. Mr. Andrew Andrew Richman Richman You You win win 100,000,000 100,000,000 USD!! USD!! Which Which from from lottery lottery of of St. St. Paul Paul University University fund. fund. Dear Dear Mr. Mr. Richman, Richman, We We are are eBay eBay customer customer service, service, we we concern concern about about your your security, security, please please update update your your personal personal information. information. 37 Discussion :: Social Phishing or Spam Bros, Bros, II am am David! David! St. St. Paul Paul University University student student association association have have aa party party on on next next month, month, you you need need to to transfer transfer the the registration registration fee fee ASAP, ASAP, see see you you there. there. [email protected] Hay, Hay, Andrew, Andrew, II am am Sammy, Sammy, II recommend recommend you you aa cool cool site!! site!! http://spam.com http://spam.com [email protected] 38 Discussion :: Potential Solutions Three possible ways to mitigate the problem A. B. C. D. Personal privacy settings Browsing scope settings Owner’s confirmation Applying Disclosure Control of Natural Language information (DNCL) ‧ Proposed by Haruno Kataoka et al. 39 Discussion :: Personal Privacy Settings 1. Hide personal information Profile Profile 2. Hide social connections (in level) 3. Deny annotations using certain words Don’t Don’t call call my my real real name, name, call call me me 007! 007! 4. Limit specific users to access friend relations or annotations 40 Discussion :: Browsing Scope Settings Prevent large scale download of user profiles • Includes Third‐party API Limit browsing scope • Group partitioning / “invitation letter” mechanism Malicious man 41 Discussion :: Owner’s Confirmation Every operation related to friend relation At least prevent unintentional personal information leakage I want to use “Cool Andrew Richman”, may I ? Sure!!! My My name name is is public, public, everyone everyone knows knows me!! me!! Hay Hay Mr. Mr. Richman, Richman, you you are are the the lucky lucky winner! winner! Malicious man 42 Discussion :: Applying DNCL (Haruno Kataoka et al.) Ideal way to preserve • • • • • Search ability Availability Connected While no sensitive information is disclosed Rather than “Insecure” or “Un‐enjoyable” Implementation is expected • Different language support is the best 43 Conclusion :: Conclusion We quantify the extent of name leakage • Using Wretch data set • 78% of users suffer from risk of involuntary name leakage • Users’ age and education records are also in risk | Reason by friends’ disclosed information Beware of Internet scams and phishing 44 Thank you! :: Questions? 45 Research Method :: Ratio of self-disclosure Figure 1: Ratio of Self-disclosure 46