Presentation Slides (PDF format, 2744 KB)

Transcription

Presentation Slides (PDF format, 2744 KB)
Involuntary Information Leakage in
Social Network Services
Ieng-Fat Lam, Kuan-Ta Chen, and Ling-Jyh Chen
Institute of Information Science, Academia Sinica
Presenter: Ieng-Fat Lam
2009/2/2
Outline
„
Introduction
„
Motivation
„
Research Method
„
Results
„
Discussion
„
Conclusion
2
Introduction ::
Social Networking Services (SNSs)
„
For example
• Myspace, Facebook, Orkut, Yahoo! 360
• Mixi, GREE (Japan)
• Wretch (Taiwan)
„
Become very popular
„
Hosts millions of profiles
3
Introduction ::
Users in SNSs
„
Social Activities
• Meet new friends, contact existing friends
• Share resources over the Internet
„
Personal Information is usually published
• Photos
• identity information
• Contact information
4
Introduction ::
Disclosing personal information
„
Double‐edged sword • Let other people know / search you
• But some people may not respond nicely
• Risk of personal information used by malicious people
II am
am Lee-Da
Lee-Da Nu!
Nu!
II love
love movie
movie
II am
am 23
23 years
years old,
old,
single!!
single!!
5
Introduction ::
Not revealing person information?
II never
never disclose
disclose my
my info
info
to
to the
the Internet!
Internet!
6
Introduction ::
Information revealed by friends
7
Introduction ::
Information revealed
[I got it!]
[I got it!]
Real Name :
Real Name : Andrew Richman
Andrew Richman
Gender:
Gender: Male
Male
Age:
Age: 20 ~ 22
20 ~ 22
Education record:
Education record:
Sunrise elementary school
Sunrise elementary school
St. John secondary school
St. John secondary school
St. Paul University
St. Paul University
8
Motivation ::
Involuntary Information leakage
„
A User may want to protect his/her identity
• But it may unintentionally revealed by friends
• Hard to detect such leakage
|
Due to distributed nature of Internet
• Becoming a serious threat to privacy
9
Motivation ::
In this study
„
We would like to • Investigate the extent of involuntary information leakage
• Gather data from Wretch (http://www.wretch.cc)
|
|
The most popular SNS in Taiwan
About 4 millions user profiles
• Quantify the degree of such leakage
|
Real Name, age and education record
• Discuss potential means to mitigate the problem
10
Research Method ::
Data Collection
User
User ID
ID List
List (Crawl)
(Crawl)
john123
1. Pick ID randomly john123
Aron
Aron
roserose
roserose
iamboy
iamboy
…
…
....
2. Obtain user profile
and friend list (HTML)
4. Add user ID
To ID list
Andy
Orange
…
5. Update
ID List
Frn List
Andy
Orange
…
3. Parse and save
crawled user data
Frn List
11
Research Method ::
An example
Friend list
Friend list
User Profile
User Profile
12
Research Method ::
Overview of Crawled Data
Wretch Data
Number of users
Number of Effective users
766,972 (20%)
592,548 (15%)
Number of Connections
7,619,212
Avg. Connections per user
11.5
*Effective user at least have one “outgoing” friend connection
13
Research Method ::
Analysis of Name Leakage
„
Friend annotations in Wretch
• A free‐form text to describe a friend
• It is used for
|
|
|
„
Classification
Real name or nickname of a friend
The feature of a friend
For example
• *Beauty Cathy Brown – The hottest girl of Nightingale High School
• [[ School Mate ]] Tony MY BUDDY
14
Research Method ::
Name Inference Process
2. Generate
Name Candidates
1. Obtain friend
annotations
(for each profile)
Infer First Name
15
Research Method ::
Generate name candidates
„
To infer real name of a profile
• Collect all of its incoming annotations
• Extract name candidates from annotations
Andrew!!
Andrew!!
Aron
Yo~ Bros. Andrew!!
Yo~ Bros. Andrew!!
Old Mr. Richman!!
Old Mr. Richman!!
Andy
Sammy
Cool~~ Andrew Richman!!
Cool~~ Andrew Richman!!
16
Research Method ::
Generate name candidates (cont.)
„
Extract method
• Break the text into tokens by
|
|
|
Symbols: <space>, <tab>, ‘#’, ‘@’, etc.
Punctuation marks: ‘ ” , . () []
Connective words (in Chinese)
• Chinese‐specific naming rules
|
|
|
陳寬達 (Chen Kuan‐Ta)
Two‐word tokens as first name candidates
Three‐word tokens as full name candidates
• Duplication Count is associated
17
Research Method ::
An example
Andrew!!
Andrew!!
德榮!!
德榮!!
Yo~ Andrew~Bros
Yo~ Andrew~Bros Andrew!!
Andrew!!
喔~德榮~德榮兄!!
喔~德榮~德榮兄!!
Andy
Old Mr.
Old Mr. Richman~!!
Richman~!!
老劉~!!
老劉~!!
Cool~~
Cool~~ Andrew Richman!!
Andrew Richman!!
超帥~~ 劉德榮!!
超帥~~ 劉德榮!!
Name Candidates
Name Candidates
德榮
德榮 (Andrew) [1]
(Andrew) [1]
超帥
超帥 (Cool) [0]
(Cool) [0]
劉德榮
劉德榮 (Andrew Richman) [0]
(Andrew Richman) [0]
德榮兄
德榮兄 (Bros Andrew) [0]
(Bros Andrew) [0]
喔
喔 (Yo) [0]
(Yo) [0]
老劉
老劉 (Old Mr. Richman) [0]
(Old Mr. Richman) [0]
„
„ Full
Full name
name candidates
candidates
„
„ First
First name
name candidates
candidates
„
„ Duplication
Duplication count
count
18
Research Method ::
Inference of full name (1 / 5)
„
Common family name
• Family name part is a common family name
• Duplication count is greater than 1
• For example
|
|
|
For full name candidate “Andrew Richman”
If “Andrew Richman” exists in more than 1 annotations
If “Richman” is a common family name
[1] Chih-Hao Tsai, “Common Chinese Names”, http://technology.chtsai.org/namefreq/
19
Research Method ::
Inference of full name (2 / 5)
„
First name as a substring of full name
• A first name candidate as a substring
|
In the right position
• Duplication count is greater than 1
• For example
|
|
|
For full name candidate “Andrew Richman”
If “Andrew Richman” exists in more than 1 annotations
If “Andrew” is also a first name candidate
20
Research Method ::
Inference of full name (3 / 5)
„
Common full name
• Compare with existing full name list
• National college exam enrollment list
List maintained from 1994 to 2007
| 574, 010 distinguished full names
|
[2] Chih-Hao Tsai, “A list of Chinese Names”, http://technology.chtsai.org/namelist/
21
Research Method ::
Inference of full name (4 / 5)
„
Nickname decomposition
• In Chinese name
For “Andrew Richman”
For “Andrew Richman”
| FN GN1‐GN2 (陳寬達)
We also have
We also have “Bros
“Bros Andrew”
Andrew”
• Possible nicknames:
“Bros”
“Bros” is a predefined prefix
is a predefined prefix
| Prefix + X
Removed “Bros”
Removed “Bros” we got “Andrew”
we got “Andrew”
| Prefix + X + X
“Andrew”
“Andrew” is in “Andrew Richman”
is in “Andrew Richman”
| X + postfix
| Where X can be FN, GN1 or GN2
22
Research Method ::
Inference of full name (5 / 5)
„
Common words removal
• If no match candidates found in above rules
• If duplicate count greater than 1
• If the full name candidate is not a nickname
|
Does not contain any nickname prefix or postfix
• Not a ( or based on a ) common word
|
Compare to 100,511 common words
• Select the one with the highest duplication count
23
Research Method ::
Inference of First Name
„
Use same method as inference of full name
• Common first name
|
|
Compare with 208,581 first names
Required duplication count greater than 1
• Nickname decomposition
• Common word removal
24
Results ::
Name Inference Results
„
Ratio of inferred names
Type of name
Ratio of name inference
Nickname
60%
Real name (full name)
30%
First name
72%
Real name or first name
78%
25
Results ::
Validation
„
Examine real name by manual
• Randomly Select 1,000 profiles
• 738 of them are unique and correct
|
More examine is performed, similar result
• Wrong case: User’s nickname
• Sufficient to support the conjecture
|
Involuntary real name leakage occurs in real‐life social network systems, and the degree of leakage is significant
26
Results ::
Ratio of Name Leakage
Figure 2: Ratio of name leakage
based on users’ gender
Figure 3: Relation of users’ age and
ratio of name leakage
27
Results ::
Risk Analysis
„
To confirm the identity leakage is involuntary
• We check the inferred name with user’s profile
|
„
Only less than 0.1% users reveal their real names
To quantify the tendency of using real name
• Degree of Using Real name (DUR)
|
Ratio of a user’s outgoing annotation that contain real name of annotation target
• Degree of being Called by Real name (DCR)
|
Ratio of incoming annotations containing user’s real name
28
Results ::
Example of DUR and DCR
„
DUR and DCR
Criteria
DCR
First name
1/5
Full name
1/5
Either
2/5
Criteria
DUR
First name
4/5
Full name
1/5
Either
5/5
“Andrew”
[Friend]
Raymond Aron
[Friend]
Sammy
Hagar
Our King!
[Friend]
David Jones
Bros Andrew
Cool~
Andrew Richman
Yo~What’s
up man
[Friend]
Jay leno
[Friend]
John Lennon
29
Results ::
Positive relation between DUR and DCR
Figure 4: Relation of DUR and DCR
30
Research Method ::
„
Involuntary leakage of age and education
records
Inferring age
•
•
•
•
•
Round‐based manner
If X disclosed age, and have a friend Y
If X and Y have relation of “classmate”, “same class”…
Assign age of X to Y
Then check Y’s “classmate”
31
Research Method ::
„
Involuntary leakage of age and education
records
Inferring Education records
• Same as inferring age
• Divided into four education level, infer separately
|
|
|
|
Elementary School
Junior high school
Senior high school
College
• Define relation by keyword |
“same school”, “same college”, etc.
32
Results ::
Inference results
Figure 5: Inference results of
users' ages
Figure 6: Inference results of
users' education records
33
Results ::
Validation
„
Cross‐validation
• Verify inferred ages |
Based on self‐disclosed education records
• Verify inferred education records
|
Based on self‐disclosed ages
• Difference of age should be small
|
To verify our infer result are accurate
34
Results ::
Validation Results
Figure 7: The inferred age differences
between pairs of self-disclosed
schoolmates in the four education levels
Figure 8: The self-disclosed age
differences between pairs of inferred
schoolmates in the four education levels
35
Discussion ::
Threads caused by identity leakage
„
„
Stalking
Spamming
• In our data set |
|
„
46% users disclosed valid email address
Spam with friends’ (spoofed) email address
Phishing
• Spear phishing / Social phishing
|
|
Includes personal information in phishing email
Spoof friend’s email address
36
Discussion ::
Spear Phishing or Spam
Dear
Dear Mr.
Mr. Andrew
Andrew Richman
Richman
You
You win
win 100,000,000
100,000,000 USD!!
USD!!
Which
Which from
from lottery
lottery of
of St.
St.
Paul
Paul University
University fund.
fund.
Dear
Dear Mr.
Mr. Richman,
Richman,
We
We are
are eBay
eBay customer
customer service,
service,
we
we concern
concern about
about your
your
security,
security, please
please update
update your
your
personal
personal information.
information.
37
Discussion ::
Social Phishing or Spam
Bros,
Bros, II am
am David!
David!
St.
St. Paul
Paul University
University
student
student association
association have
have
aa party
party on
on next
next month,
month,
you
you need
need to
to transfer
transfer the
the
registration
registration fee
fee ASAP,
ASAP,
see
see you
you there.
there.
[email protected]
Hay,
Hay, Andrew,
Andrew, II am
am
Sammy,
Sammy, II recommend
recommend
you
you aa cool
cool site!!
site!!
http://spam.com
http://spam.com
[email protected]
38
Discussion ::
Potential Solutions
„
Three possible ways to mitigate the problem
A.
B.
C.
D.
Personal privacy settings
Browsing scope settings
Owner’s confirmation
Applying Disclosure Control of Natural Language information (DNCL)
‧
Proposed by Haruno Kataoka et al.
39
Discussion ::
Personal Privacy Settings
1. Hide personal information
Profile
Profile
2. Hide social connections (in level)
3. Deny annotations using certain words
Don’t
Don’t call
call my
my
real
real name,
name, call
call
me
me 007!
007!
4. Limit specific users to access friend relations or annotations
40
Discussion ::
Browsing Scope Settings
„
Prevent large scale download of user profiles
• Includes Third‐party API
„
Limit browsing scope
• Group partitioning / “invitation letter” mechanism
Malicious man
41
Discussion ::
Owner’s Confirmation
„
„
Every operation related to friend relation
At least prevent unintentional personal information leakage
I want to use “Cool Andrew Richman”, may I ?
Sure!!!
My
My name
name is
is
public,
public, everyone
everyone
knows
knows me!!
me!!
Hay
Hay Mr.
Mr. Richman,
Richman, you
you
are
are the
the lucky
lucky winner!
winner!
Malicious man
42
Discussion ::
Applying DNCL (Haruno Kataoka et al.)
„
Ideal way to preserve •
•
•
•
•
„
Search ability
Availability
Connected
While no sensitive information is disclosed
Rather than “Insecure” or “Un‐enjoyable”
Implementation is expected
• Different language support is the best
43
Conclusion ::
Conclusion
„
We quantify the extent of name leakage • Using Wretch data set
• 78% of users suffer from risk of involuntary name leakage
• Users’ age and education records are also in risk
|
„
Reason by friends’ disclosed information
Beware of Internet scams and phishing
44
Thank you! ::
Questions?
45
Research Method ::
Ratio of self-disclosure
Figure 1: Ratio of Self-disclosure
46

Similar documents