Social Identity Linkage
Transcription
Social Identity Linkage
HYDRA Large-scale Social Identity Linkage via Heterogeneous Behavior Modeling Siyuan Liu . University Carnegie Mellon Siyuan Liu, Shuhui Wang, Feida Zhu, Jinbo Zhang, Ramayya Krishnan. HYDRA: Large-scale Social Identity Linkage via Heterogeneous Behavior Modeling. The 41st ACM SIGMOD International Conference on Management of Data. 2014. Snowbird, USA. Social Identity Linkage Link up all the data of the same social user across different social platforms Background • The recent blossom of social network services of all kinds has revolutionized our social life – Many different social networks. – Many different social users. – Various information shared like never before (e.g., microblogs, images, videos, reviews, check-ins). Problem Social Identity Linkage • Completeness • Cross-platform user linkage would enrich an otherwisefragmented user profile to enable an all-around understanding of a user’s interests and behavior patterns. • Consistency • Cross-checking among multiple platforms helps improve the consistency of user information. • Continuity • User identity linkage makes it possible to integrate useful user information from those platforms that have over time become less popular or even abandoned. Example: Easy Example: Not Easy Challenges • Unreliable Usernames – Traditional approaches that heavily rely on username parsing to link users may fail on more diversified communities. – Statistical models (e.g. SVM) or rule based models constructed with mere username and attribute analysis are far from being robust to accurately identify user linkage across online social communities. Challenges • Missing Information – At least 80% of users are missing at least 2 profile attributes out of the 6 most popular ones, and merely 5% of users have all attributes filled up. Challenges • Misaligned – Information Veracity. – Platform Difference. • Heterogeneous behavior: The user behavior can be represented by various types of media, e.g., locations, blogs, tweets, videos and images – Behavior Asynchrony. – Data Imbalance. Question Can we make it? Social Identity Linkage Across Different Social Platforms Demo: HYDRA We made it!! Problem Formulation • Social Identity Linkage (SIL) – Given two social network platforms S and S', find a function f to decide if any two users picked from S and S’ respectively correspond to the same natural person, 𝑓: 𝐶𝑆 × 𝐶𝑆 ′ ↦ {0,1}, such that for any pair of users, we have: 1 𝑓(𝑢𝑖 , 𝑢𝑖 ′ ) = 0 , 𝑖𝑓 𝜙𝑆 (𝑢𝑖 ) = 𝜙𝑆 ′ (𝑢𝑖 ′ ) , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 HYDRA Framework Main Steps • Behavior Similarity Modeling – Calculate the multi-dimensional similarity vector between two users of a pair for all user pairs via heterogeneous behavior modeling. • Structure Information Modeling – We construct the structure consistency graph on user pairs by considering both the core network structure of the users and their behavior similarities. • Multi-objective Optimization with Missing Information – A two-class classification model via optimizing two kinds of objective functions simultaneously. HETEROGENEOUS BEHAVIOR MODEL User Social Data • User Attributes – Demographic information, contact, etc. • UGC(User Generated Content) – Text (reviews, microblogs, etc.), images, videos and so on. • User Behavior Trajectory – Befriend, follow/unfollow, retweet, thumb-up/thumb-down. • User Core Social Network – The most frequently contacted friends. User Attribute Modeling • The relative importance of different attributes is modeled to avoid over-matching – Textual Attributes: age, gender, nationality, company etc. – Visual Attributes: a safe facial image verification strategy to judge if the profile images are the same person. User Topic Modeling • Topic Distribution: modeling the topic distribution in multiple time ranges 16 days 8 days 4 days t t Ct time buckets 2Ct time buckets 4Ct time buckets … • Two kinds of topics are considered: – Content Genre Distribution. – Sentiment Pattern Distribution. … i i' User Style Modeling • Comments, tweets and re-tweets usually reflect a user’s opinions and language style. • Extract the most unique words of each user by a simple term frequency analysis on the whole database. • Conduct simple matching on the unique patterns. User Behavior Trajectory • Consider the asynchronization of the same behavior (operation) on different platforms. 𝑆𝑚𝑟 = 1 𝑁 1 𝑞 𝑁 𝑠𝑚𝑟 (𝑖 ) 𝑞 ,𝑞 ≥ 1 𝑖=1 • Typical behavior patterns: – Location and Mobile Trajectory Information – Multimedia Content Generation and Sharing Multi-resolution Behavior Modeling A neuro-network like behavior pattern matching and similarity aggregation in different temporal resolutions. Pattern-matching Sensors • Location Matching Sensor: detect if users appear in the same location within a time range. • Near Duplicate Multimedia Sensor: detect if the videos / audios / images that two users uploaded are duplicate with multimedia processing tools. Core Social Network Modeling • People may share similar habit with their closest friends. • A behavior similarity aggregation of the most contact friends of users provides informative description on how users interact with their social ties. MULTI-OBJECTIVE MODEL LEARNING Multi-objective Optimization Framework • Supervised Learning • Structure Consistency Modeling • Multi-objective Optimization Remark: A two-class classification problem and construct multi-objective optimization which jointly optimizes the prediction accuracy on the labeled user pairs and multiple structure consistency measurements across different platforms. A Structure Consistency modeling framework Platform S’ Platform S Bob Henry Alice Bob Henry Alice Black arrows: the ground-truth linkage information. Red arrows: the correct linkage. Green arrows: the falsely linked persons. Multi-objective Optimization Framework • Decision Model on Pairwise Similarity Support vector machine: • High Order Structure Consistency An eigen-decomposition on the structure consistency graph • Multi-objective Optimization A generalized semi-supervised learning framework by optimizing the abovementioned two objective functions. Discussion of the Model • The solution of our model is necessary and sufficient for Pareto optimality. Proof in sketch: See Athan et. al. and Yu et. al. • Our model performs social identity linkage on the core social structure level by MOO rather than merely person-to-person judgments. Athan et al. A note on weighted criteria methods for compromise solutions in multiobjective optimization. Engineering Optimization, 27:155–176, 1996. Yu et al. Multiple-criteria decision making: concepts, techniques, and extensions. Plenum Press New York, 1985. EXPERIMENTAL EVALUATION Experiment Setup • Chinese (5 million users) – – – – – SinaWeibo Tecent Weibo Renren Douban Kaixin • English (5 million users) – Twitter – Facebook Effectiveness: # Labeled Pairs Effectiveness: # Unlabeled Pairs Effectiveness: # Social Communities Effectiveness: Various Social Platforms Efficiency: Running Time Sensitivity: Missing Data CONCLUSION Contribution • Heterogeneous Behavior Model – Robustly deal with missing information and misaligned behavior by long-term behavior distribution construction – A multi-resolution temporal behavior matching paradigm • Structure Consistency – Leverage users’ core social network structure • Multi-objective Model Learning A Question • Who knows/ understands you better? • Google? – Only access 0.001% of all the online information1 • Facebook? Twitter? • Yourself?! 1 Social Media and Big Data. McKinsey. 2012. Thank you!