Social Identity Linkage

Transcription

Social Identity Linkage
HYDRA
Large-scale Social Identity Linkage via
Heterogeneous Behavior Modeling
Siyuan Liu
. University
Carnegie Mellon
Siyuan Liu, Shuhui Wang, Feida Zhu, Jinbo Zhang, Ramayya Krishnan.
HYDRA: Large-scale Social Identity Linkage via Heterogeneous Behavior Modeling.
The 41st ACM SIGMOD International Conference on Management of Data. 2014. Snowbird, USA.
Social Identity Linkage
Link up all the data of the same social user across
different social platforms
Background
• The recent blossom of social network services of
all kinds has revolutionized our social life
– Many different social networks.
– Many different social users.
– Various information shared like never before (e.g.,
microblogs, images, videos, reviews, check-ins).
Problem
Social Identity Linkage
• Completeness
• Cross-platform user linkage would enrich an otherwisefragmented user profile to enable an all-around understanding
of a user’s interests and behavior patterns.
• Consistency
• Cross-checking among multiple platforms helps improve the
consistency of user information.
• Continuity
• User identity linkage makes it possible to integrate useful user
information from those platforms that have over time become
less popular or even abandoned.
Example: Easy
Example: Not Easy
Challenges
• Unreliable Usernames
– Traditional approaches that heavily rely on username
parsing to link users may fail on more diversified
communities.
– Statistical models (e.g. SVM) or rule based models
constructed with mere username and attribute
analysis are far from being robust to accurately
identify user linkage across online social
communities.
Challenges
• Missing Information
– At least 80% of users are missing at least 2 profile
attributes out of the 6 most popular ones, and merely
5% of users have all attributes filled up.
Challenges
• Misaligned
– Information Veracity.
– Platform Difference.
• Heterogeneous behavior:
The user behavior can be
represented by various
types of media, e.g.,
locations, blogs, tweets,
videos and images
– Behavior Asynchrony.
– Data Imbalance.
Question
Can we make it?
Social Identity Linkage Across Different Social
Platforms
Demo: HYDRA
We made it!!
Problem Formulation
• Social Identity Linkage (SIL)
– Given two social network platforms S and S', find a
function f to decide if any two users picked from S
and S’ respectively correspond to the same natural
person, 𝑓: 𝐶𝑆 × 𝐶𝑆 ′ ↦ {0,1}, such that for any pair of
users, we have:
1
𝑓(𝑢𝑖 , 𝑢𝑖 ′ ) =
0
, 𝑖𝑓 𝜙𝑆 (𝑢𝑖 ) = 𝜙𝑆 ′ (𝑢𝑖 ′ )
,
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
HYDRA Framework
Main Steps
• Behavior Similarity Modeling
– Calculate the multi-dimensional similarity vector between
two users of a pair for all user pairs via heterogeneous
behavior modeling.
• Structure Information Modeling
– We construct the structure consistency graph on user pairs
by considering both the core network structure of the users
and their behavior similarities.
• Multi-objective Optimization with Missing Information
– A two-class classification model via optimizing two kinds of
objective functions simultaneously.
HETEROGENEOUS BEHAVIOR MODEL
User Social Data
• User Attributes
– Demographic information, contact, etc.
• UGC(User Generated Content)
– Text (reviews, microblogs, etc.), images, videos and so on.
• User Behavior Trajectory
– Befriend, follow/unfollow, retweet, thumb-up/thumb-down.
• User Core Social Network
– The most frequently contacted friends.
User Attribute Modeling
• The relative importance of different attributes is
modeled to avoid over-matching
– Textual Attributes: age, gender, nationality, company
etc.
– Visual Attributes: a safe facial image verification
strategy to judge if the profile images are the same
person.
User Topic Modeling
• Topic Distribution: modeling the topic distribution
in multiple time ranges
16 days
8 days
4 days
t
t
Ct time buckets 2Ct time buckets 4Ct time buckets
…
• Two kinds of topics are considered:
– Content Genre Distribution.
– Sentiment Pattern Distribution.
…
i
i'
User Style Modeling
• Comments, tweets and re-tweets usually reflect
a user’s opinions and language style.
• Extract the most unique words of each user by a
simple term frequency analysis on the whole
database.
• Conduct simple matching on the unique
patterns.
User Behavior Trajectory
• Consider the asynchronization of the same
behavior (operation) on different platforms.
𝑆𝑚𝑟 =
1
𝑁
1
𝑞
𝑁
𝑠𝑚𝑟 (𝑖 )
𝑞
,𝑞 ≥ 1
𝑖=1
• Typical behavior patterns:
– Location and Mobile Trajectory Information
– Multimedia Content Generation and Sharing
Multi-resolution Behavior Modeling
A neuro-network like behavior pattern matching and
similarity aggregation in different temporal resolutions.
Pattern-matching Sensors
• Location Matching Sensor: detect if users
appear in the same location within a time range.
• Near Duplicate Multimedia Sensor: detect if the
videos / audios / images that two users uploaded
are duplicate with multimedia processing tools.
Core Social Network Modeling
• People may share similar habit with their closest
friends.
• A behavior similarity aggregation of the most
contact friends of users provides informative
description on how users interact with their
social ties.
MULTI-OBJECTIVE MODEL LEARNING
Multi-objective Optimization
Framework
• Supervised Learning
• Structure Consistency Modeling
• Multi-objective Optimization
Remark: A two-class classification problem and construct multi-objective
optimization which jointly optimizes the prediction accuracy on the
labeled user pairs and multiple structure consistency measurements
across different platforms.
A Structure Consistency modeling
framework
Platform S’
Platform S
Bob
Henry
Alice
Bob
Henry
Alice
Black arrows: the ground-truth linkage information.
Red arrows: the correct linkage.
Green arrows: the falsely linked persons.
Multi-objective Optimization
Framework
• Decision Model on Pairwise Similarity
Support vector machine:
• High Order Structure Consistency
An eigen-decomposition on
the structure consistency graph
• Multi-objective Optimization
A generalized semi-supervised learning
framework by optimizing the abovementioned two objective functions.
Discussion of the Model
• The solution of our model is necessary and
sufficient for Pareto optimality.
Proof in sketch: See Athan et. al. and Yu et. al.
• Our model performs social identity linkage on
the core social structure level by MOO rather
than merely person-to-person judgments.
Athan et al. A note on weighted criteria methods for compromise solutions in multiobjective optimization. Engineering Optimization, 27:155–176, 1996.
Yu et al. Multiple-criteria decision making: concepts, techniques, and extensions.
Plenum Press New York, 1985.
EXPERIMENTAL EVALUATION
Experiment Setup
• Chinese (5 million users)
–
–
–
–
–
SinaWeibo
Tecent Weibo
Renren
Douban
Kaixin
• English (5 million users)
– Twitter
– Facebook
Effectiveness: # Labeled Pairs
Effectiveness: # Unlabeled Pairs
Effectiveness: # Social Communities
Effectiveness: Various Social
Platforms
Efficiency: Running Time
Sensitivity: Missing Data
CONCLUSION
Contribution
• Heterogeneous Behavior Model
– Robustly deal with missing information and
misaligned behavior by long-term behavior
distribution construction
– A multi-resolution temporal behavior matching
paradigm
• Structure Consistency
– Leverage users’ core social network structure
• Multi-objective Model Learning
A Question
• Who knows/ understands you better?
• Google?
– Only access 0.001% of all the online information1
• Facebook? Twitter?
• Yourself?!
1
Social Media and Big Data. McKinsey. 2012.
Thank you!