Chapter 6: Web Spam
Transcription
Chapter 6: Web Spam
Web Dynamics Web Spam Web Spam Dr. Marc Spaniol Dr. Marc Spaniol Saarbrücken, June 24, 2010 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-1/49 Agenda • Web spam Web Spam - Why and what? - Spam taxonomy Overview Strategies in detail Dr. Marc Spaniol o Link spam o Link farms Examples • Countermeasures - Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-2/49 Spam detection Labeling and assessment Combating spam Web spam challenge • Conclusion Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-3/49 “Spam industry had a revenue potential of $4.5 billion in year 2004 if they had been able to completely fool all search engines on all commercially viable queries” [Amitay 2004] Time elapsed to reach hit position Web Spam Time spent looking at hit position Web Spam Why? [Granka, Joachims, Gay 2004] Web Spam What’s the Problem? Web Spam 2004 .de crawl Courtesy: T. Suel Unknown 0.4% Alias 0.3% Empty 0.4% Non-existent 7.9% Dr. Marc Spaniol Ad 3.7% Weborg 0.8% Reputable 70.0% Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-4/49 Spam 16.5% Web Spam • Target of spammers Web Spam Dr. Marc Spaniol - Not end users (directly) - High revenue from customers for search engine “optimization” (especially Google) - Indirect revenue Affiliate programs, Google AdSense Ad display, traffic funneling • Spam taxonomy - Content spam Keywords Popular expressions Mis-spellings - Link spam “farms” Densely connected sites Redirects Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-5/49 - Cloaking and hiding - Spam in social media [Benczúr et al. 2008] Overview Spamming Web Spam Dr. Marc Spaniol Boosting Term Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-6/49 Hiding Links Content Hiding Cloaking Redirection Spammed Ranking Elements Web Spam Dr. Marc Spaniol • Term frequency (tf in the tf.idf, Okapi BM25 etc. ranking schemes) • Term frequency weighted by HTML elements - Title Headers Font size Face • Heaviest weight in ranking - URL, domain name part - Anchor text: <a href=“…”>Best Saarbruecken nightlife</a> • Structural information - Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-7/49 URL length Depth from server root Indegree PageRank Link based centrality ⇒ All Web information retrieval ranking elements spammed Content Spam • Domain name Web Spam adjustableloanmortgagemastersonline.compay.dahannusaprima.co.uk buy-canon-rebel-20d-lens-case.camerasx.com Dr. Marc Spaniol • Anchor text (title, H1, etc) <a href=“target.html”>free, great deals, cheap, inexpensive, cheap, free</a> • Meta keywords <meta name=“keywords” content=“UK Swingers, UK, swingers, swinging, genuine, adult contacts, connect4fun, sex, …”> Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-8/49 [Gyöngyi, Garcia-Molina, 2005] Parking Domain Web Spam Dr. Marc Spaniol <div style="position:absolute; top:20px; width:600px; height:90px; overflow:hidden;"><font size=-1>atangledweb.co.uk currently offline<br>atangledweb.co.uk back soon<br></font><br><br><a href="http://www.atangledweb.co.uk"><font size=-1>atangledweb.co.uk</font></a><br><br><br> Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-9/49 Soundbridge HomeMusic WiFi Media Play<a class=l href="http://www.atangledweb.co.uk/index01.html">-</a>... SanDisk Sansa e250 - 2GB MP3 Player -<a class=l href="http://www.atangledweb.co.uk/index02.html">-</a>... AIGO F820+ 1GB Beach inspired MP3 Pla<a class=l href="http://www.atangledweb.co.uk/index03.html">-</a>... Targus I-Pod Mini Sound Enhancer<a class=l href="http://www.atangledweb.co.uk/index04.html">-</a>... Sony NWA806FP.CE7 4GB video WALKMAN <a class=l href="http://www.atangledweb.co.uk/index05.html">-</a>... Ministry of Sound 512MB MP3 player<a class=l href="http://www.mp3roze.co.uk/cat7000.html">-</a>... Nokia 6125 - Fold Design - 1.3 Megapi<a class=l href="http://www.mp3roze.co.uk/cat7001.html">-</a>... Keyword Stuffing & Generated Copies Web Spam Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-10/49 Google ads Web Spam Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-11/49 Link Spam Web Spam Dr. Marc Spaniol “Hyperlink structure contains an enormous amount of latent human annotation that can be extremely valuable for automatically inferring notions of authority.” (Chakrabarti et. al. ’99) • Hyperlinks: Good, Bad, Ugly Honest link, human annotation No value of recommendation, e.g. “affiliate programs”, navigation, ads … Deliberate manipulation, link spam Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-12/49 PageRank Web Spam PageRank of page p0: Outgoing links from pi p0 = cΣipi/|F(i)| + (1-c) Dr. Marc Spaniol Generalized (vector): damping factor PageRank of pi pointing to p0 random jump (1 – c) 1 cT’p N p= + N Transition matrix Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-13/49 Score vector “1” vector • One page is important if it is pointed to by many other pages • Based on the link structure ⇒ The algorithm of PageRank is vulnerable to link spamming Link Farms • Entry point from the honest web Web Spam Dr. Marc Spaniol - Honey pots: Copies of quality content - Dead links to parking domain - Blog or guestbook comment spam Hijacked WWW Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-14/49 Farm Spam Farm: Pages p1 λ1 λ0 Web Spam Dr. Marc Spaniol p2 pk λ2 ? λk Boosting pages Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-15/49 • Controlled by the spammer • Pointing to the target page in order to increase its PageRank p0 Target page • Each farm has only one • The target of the spammer is to increase this page’s ranking Spam Farm: External Links Web Spam Dr. Marc Spaniol p1 p2 pk λ1 λ2 ? λ0 p0 λk Leakage Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-16/49 • Fractions of PageRank • Link to the pages are added from pages outside the Farm (forum, blog, …) • The spammer has no or limited control on them • λ = λ0 + … + λk Simple Farm Model WWW Web Spam 1 p0 2 Dr. Marc Spaniol • • • • • Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-17/49 PageRank of target page: p0 Number of all pages: N Damping factor: c Leakage contributed by accessible pages: λ PageRank of each farm page: (1-c)/N p0 = λ+ k*c*[(1-c)/N] + (1-c)/N = λ+ [(1-c)(ck+1)]/N ⇒ By making k large, we can make p0 as large as we want ⇒ No multiplier effect for “acquired” page rank k Optimal Farm Model WWW Web Spam Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-18/49 • • • • • Optimal PageRank of target page: q0 Number of all pages: N Damping factor: c Leakage contributed by accessible pages: λ PageRank of each farm page: cq0/k + (1-c)/N q0 = λ+ ck[cq0/k + (1-c)/N] + (1-c)/N = λ+ c2q0 + c(1-c)k/N + (1-c)/N … = λ/(1-c2) + [(1-c)(ck+1)]/N(1-c2) = p0/(1-c2) ⇒ By making k large, we can make q0 as large as we want ⇒ For c = 0.85 “performance” gain: 1/(1-c2) = 3.6 ⇒ Multiplier effect for “acquired” page rank 1 q0 2 k Simple vs. Optimal Farm Web Spam Dr. Marc Spaniol Simple: Each boosting page only points to the target page p0 = cλ + p1 p2 Optimal: Shorttarget The reinforcement points to all loop(s) boosting pages Fewer are There linksno links among boosting pages (1 – c)(ck + 1) N qr00 = p00 / (1 – c22) λ p0 q1 r2 q2 rk Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-19/49 pk λ qk qr00 r1 Optimality without Leakage Web Spam For mathematical simplification only Idea: Interpret leakage as additional boosting pages q1 Dr. Marc Spaniol qk+1 q0 q2 qk+m m = λN / (1-c) qk cλ Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-20/49 (1 – c2) + (1 – c)(ck + 1) (1 – c2)N ! = (1 – c)[c(k + m) + 1] (1 – c2)N Alliance of two Farms Web Spam Dr. Marc Spaniol Intuitive: Each boosting page points to both targets and vice versa p0 p1 q0 p2 pk q1 q2 qm 2(k + m) new links Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-21/49 p0 = cΣ i=1,...,kpi/2 +ofcΣ j=1,...,mqj/2 + (1-c)/N Redistribution PageRank q0 = cΣi=1,...,kpi/2c(k + cΣ qj/2 j=1,...,m+ + m)/2 1 + (1-c)/N pi =pc(p +q (1-c)/N, i= 1,...,k 0 = 0q 0= 0)/(k+m) (1 ++ c)N qj =Convenient c(p0+q0)/(k+m) + (1-c)/N, for the smaller Farmj= 1,...,m Alliance of two Farms Web Spam Dr. Marc Spaniol Better: Only the target pages are interconnected with each other p0 p1 q0 p2 pk q1 q2 only 2 new links Redistribution of PageRank Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-22/49 c(k + m)/2 + 1 p0 = q0 = (1 + c)N Convenient for the smaller Farm qm Alliance between two Farms Optimal: Each target points to the other target The targets have no links to the boosting pages Web Spam Dr. Marc Spaniol p1 p2 p0 q0 MPII-Sp-0710-23/49 q2 qm pk Databases and Information Systems Prof. Dr. G. Weikum q1 p0 = c(Σi=1,...,k 1 ckp+i +c2qm 0) + (1-c)/N = + q0 p=0c(Σ q + p ) + (1-c)/N (1 + c)N N j=1,...,m j 0 pi = (1-c)/N, cm + ci=2k1,...,k 1 + q0q== (1-c)/N, j=1,...,m (1 + c)N N j Multi-Farm Alliance Two fundamental structures: Web ring Complete core Web Spam r1 Dr. Marc Spaniol p1 p2 rn r2 r0 p0 q0 core Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-24/49 pk q1 q2 qm Web Ring Simple and intuitive connection model Web Spam p0 = Dr. Marc Spaniol p1 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-25/49 1 ck + c2m + c3n + (1 + c + c2)N N r1 r2 rn The distance influences the contribution of each farm to the PageRank of the others r0 p0 q0 q1 p2 q2 pk qm Complete Core The core is a completely connected sub-graph p0 = Web Spam 2ck – c2k + c2m + c2n (2 + c)N r1 Dr. Marc Spaniol r2 1 + N rn The contribution of each farm to the PageRank of the other ones is uniform p1 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-26/49 r0 p0 q0 q1 p2 q2 pk qm Evaluation Web Spam Web ring: Target scores for ring/complete cores: PageRank of the target 10 farms of sizes 1.000, 2.000, … , 10.000 The of farm 10 decreases 6000 Scaled Target Page Rank Dr. Marc Spaniol MPII-Sp-0710-27/49 non- 5000 4000 Single Farm 3000 2000 1000 Databases and Information Systems Prof. Dr. G. Weikum compared to the connected farm case Non-connected Farm: Web Ring The PageRank ofCore the Complete target page is linear to the dimension of the farm (numer of boosting pages) 0 core: Complete 1 2increase, 3 4 especially 5 6 7those 8 of9 the 10 All PageRanks Farm Number target of farms with low dimensions Evaluation Contribution of farm 1 to the other targets Web Spam Complete core: Preserves the impact of PageRank with respect to the other targets and gives an identical contribution to the other targets that is much lower than to itself 200 180 Page Rank Contribution Dr. Marc Spaniol 160 140 120 Complete Core 100 Web Ring 80 60 40 20 0 1 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-28/49 2 3 4 5 6 7 8 9 10 Web ring: Farm Number The values of contribution are closer to each other and decrease by the distance from the farm Lessons Learned • Single farm Web Spam Dr. Marc Spaniol - Short loop(s) increase target PageRank - Increase of PageRank is linear with the amount of boosting pages - Leakage should only point to the target page • Leakage can be interpreted as an additional number of boosting pages • Two farms: - Target pages should only link to other targets - In an alliance of two, both participants win • Larger alliances Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-29/49 - Need to be stable to keep all participants happy - Complete core topology: Contribution to the PageRank of others at a relatively “low level” - Web ring topology Contribution to the PageRank of others “slowly” decreasing by distance Link Farms – Example Web Spam • Multi-domain • Multi-IP Dr. Marc Spaniol Honey pot: quality content copy 411amusement.com 411 sites A-Z list Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-30/49 411fashion.com 411 sites A-Z list target 411zoos.com 411 sites A-Z list Cloaking and Hiding • Formatting tricks: Trapping crawlers with simple HTML processing only Web Spam • One-pixel image Dr. Marc Spaniol • White over white • Color, position from stylesheet • Redirection Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-31/49 - Script - Meta-tag with refresh time 0 •… Obfuscated JavaScript Web Spam Dr. Marc Spaniol <SCRIPT language=javascript> var1=100; var3=200; var2=var1 + var3; var4=var1; var5=var4 + var3; if(var2==var5) document.location="http://umlander.info/mega/free_software_ downloads.html"; </SCRIPT> • More sophisticated tricks Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-32/49 - Redirection through window.location - Spam content (text, link) from random looking static data via eval calls - Content generation by document.write HTTP Level Cloaking • User agent, client host filtering Web Spam Dr. Marc Spaniol • Different for users and for GoogleBot • “Collaboration service” of spammers for Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-33/49 - Crawler IPs - Agents - Behavior Spam in Social Media Guest books Web Spam Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-34/49 Fake Blogs Web Spam Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-35/49 Spam Detection • Crawl-time vs. post-processing Web Spam Dr. Marc Spaniol • Simple filters in crawler - Cannot handle unseen sites - Require large bootstrap crawl - Need to run rendering and script execution • Crawl time feature generation and classification - Needs interface in crawler to access content - Needs model from bootstrap or external crawl (may be smaller) - Sounds expensive but needs to be done only once per site • The hard work is done post-processing both cases Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-36/49 Assessment Interface and Collaboration Infrastructure Web Spam Docs (WARC) Dr. Marc Spaniol Local storages access May share features, extracts across institutions Feature generation (crawl-time) “Interaction” Active learning feature feed text files Classifier • Build model • Apply model (crawl-time) Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-37/49 Spam Labeling • Manual labels (black AND white lists) primarily determine quality Web Spam Dr. Marc Spaniol • Can blacklist only a tiny fraction - Recall 10% of sites are spam - Needs machine learning • Central to the service - Aid manual assessment - Aid information and label sharing - Catch spam farms that span different top-level domains Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-38/49 ⇒ No free lunch: No fully automatic filtering Web Spam Interface Web Spam Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-39/49 Combating Spam Web Spam Dr. Marc Spaniol • Refresh detection • Conceal crawling by - Headers: Browser vs. crawler - Access: “Random” vs. BFS • TrustRank method • Supervised learning features - Number of words in the pages Average word length Number of words in the page title Fraction of visible content Amount of anchor text Compressibility • Partition the Web into different blocks Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-40/49 ⇒ Never stop! On-going process Query Marketability Web Spam Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-41/49 Google AdWords Competition 10k 10th wedding anniversary 128mb, 1950s, … abc, abercrombie, … b2b, baby, bad credit, … digital camera earn big money, easy, … f1, family, flower, fantasy gameboy, gates, girl, … hair, harry potter, … ibiza, import car, … james bond, janet jackson karate, konica, kostenlose ladies, lesbian, lingerie, … … Generative Content Models Web Spam Dr. Marc Spaniol Spam topic 7 Honest topic 4 Honest topic 10 loan (0.080) club (0.035) music (0.022) unsecured (0.026) team (0.012) band (0.012) credit (0.024) league (0.009) film (0.011) home (0.022) win (0.009) festival (0.009) Excerpt: 20 spam and 50 honest topic models Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-42/49 [Bíró, Szabó, Benczúr 2008] TrustRank • Basic idea: Approximate isolation Web Spam - Honest pages rarely point to spam - Spam cites many, many spam Dr. Marc Spaniol • Sample a set of “seed pages” from the web • “Oracle” (human) identifies good and spam pages in the seed set • The subset of seed pages that are identified as “good” are called “trusted pages” Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-43/49 ⇒ Expensive! Make seed set as small as possible Trust Propagation • Set trust of each trusted page to 1 Web Spam Dr. Marc Spaniol • Propagate trust through links - Each page gets a trust value between 0 and 1 - Use a threshold value and mark all pages below the trust threshold as spam • Trust attenuation - The degree of trust conferred by a trusted page decreases with distance • Trust splitting Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-44/49 - The larger the number of outlinks from a page, the less scrutiny the page author gives each outlink - Trust is “split” across outlinks Trust Computation Web Spam Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-45/49 1. Predicted spamicity p(v) for all pages 2. Target page u, new feature f(u) by neighbor p(v) aggregation 3. Reclassification by adding the new feature v7 v1 ? v2 u PageRank Supporter Distribution Web Spam ρ=0.61 ρ=0.97 Dr. Marc Spaniol low PageRank Honest: fhh.hamburg.de Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-46/49 high low PageRank high Spam: radiopr.bildflirt.de (part of www.popdata.de farm) [Benczúr, Csalogány, Sarlós, Uher 2005] Web Spam Challenge • WEBSPAM-UK2006 Web Spam Dr. Marc Spaniol - 77M pages 11,402 hosts 7,373 labeled 26% spam • WEBSPAM-UK2007 Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-47/49 100M pages 114,529 hosts 6,479 labeled 6% spam [WebSpam08] Summary • Web Spam Web Spam Dr. Marc Spaniol - Aims at earch engine optimization “Attacks” indexing crawlers Slows down archiving crawlers “Pollutes” the archive Social Web is a “threat” • Spamming techniques - Boosting - Hiding - Combinations of various techniques • Countermeasures - Manual assessment - Machine learning techniques - Hybrid approaches Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-48/49 ⇒ All Web information retrieval ranking elements spammed References [Benc08] A. Benczur: “Web spam survey for the Archivist”. 8th International Web Archiving Workshop (IWAW 2008), Århus, Denkmark, Sept. 18, 2008. http://liwa-project.eu/index.php/video/33/ [last access: June 23, 2010] [BSS*08] A. Benczur, D. Siklosi, J. Szabo et al.: “Web spam survey for the Archivist”. Proceedings of the 8th International Web Archiving Workshop (IWAW 2008), Århus, Denkmark, Sept. 18, 2008. http://iwaw.net/08/IWAW2008-Benczur.pdf [last access: June 23, 2010] [GyGa05] Z. Gyöngyi and H. Garcia-Molina: “Link Spam Alliances”. Technical Report, Stanford University, March 2, 2005. http://www-db.stanford.edu/~zoltan/publications/gyongyi2005link.pdf [last access: June 23, 2010] Web Spam Dr. Marc Spaniol Databases and Information Systems Prof. Dr. G. Weikum MPII-Sp-0710-49/49 [WebSpam08] Web Spam Challenge: “Home page”. 2008. http://webspam.lip6.fr/wiki/pmwiki.php [last access: June 23, 2010]