Chapter 6: Web Spam

Transcription

Chapter 6: Web Spam
Web Dynamics
Web Spam
Web Spam
Dr. Marc Spaniol
Dr. Marc Spaniol
Saarbrücken, June 24, 2010
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-1/49
Agenda
• Web spam
Web Spam
- Why and what?
- Spam taxonomy
 Overview
 Strategies in detail
Dr. Marc Spaniol
o Link spam
o Link farms
 Examples
• Countermeasures
-
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-2/49
Spam detection
Labeling and assessment
Combating spam
Web spam challenge
• Conclusion
Dr. Marc Spaniol
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-3/49
“Spam industry had a revenue
potential of $4.5 billion in year 2004
if they had been able to completely
fool all search engines on all
commercially viable queries”
[Amitay 2004]
Time elapsed to reach hit position
Web Spam
Time spent looking at hit position
Web Spam
Why?
[Granka, Joachims, Gay 2004]
Web Spam
What’s the Problem?
Web Spam
2004 .de crawl
Courtesy: T. Suel
Unknown 0.4%
Alias 0.3%
Empty 0.4%
Non-existent 7.9%
Dr. Marc Spaniol
Ad 3.7%
Weborg 0.8%
Reputable 70.0%
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-4/49
Spam 16.5%
Web Spam
• Target of spammers
Web Spam
Dr. Marc Spaniol
- Not end users (directly)
- High revenue from customers for search engine “optimization” (especially Google)
- Indirect revenue
 Affiliate programs, Google AdSense
 Ad display, traffic funneling
• Spam taxonomy
- Content spam
 Keywords
 Popular expressions
 Mis-spellings
- Link spam “farms”
 Densely connected sites
 Redirects
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-5/49
- Cloaking and hiding
- Spam in social media
[Benczúr et al. 2008]
Overview
Spamming
Web Spam
Dr. Marc Spaniol
Boosting
Term
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-6/49
Hiding
Links
Content Hiding
Cloaking
Redirection
Spammed Ranking Elements
Web Spam
Dr. Marc Spaniol
• Term frequency (tf in the tf.idf, Okapi BM25 etc. ranking schemes)
• Term frequency weighted by HTML elements
-
Title
Headers
Font size
Face
• Heaviest weight in ranking
- URL, domain name part
- Anchor text: <a href=“…”>Best Saarbruecken nightlife</a>
• Structural information
-
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-7/49
URL length
Depth from server root
Indegree
PageRank
Link based centrality
⇒ All Web information retrieval ranking elements spammed
Content Spam
• Domain name
Web Spam
adjustableloanmortgagemastersonline.compay.dahannusaprima.co.uk
buy-canon-rebel-20d-lens-case.camerasx.com
Dr. Marc Spaniol
• Anchor text (title, H1, etc)
<a href=“target.html”>free, great deals, cheap, inexpensive, cheap, free</a>
• Meta keywords
<meta name=“keywords” content=“UK Swingers, UK, swingers, swinging,
genuine, adult contacts, connect4fun, sex, …”>
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-8/49
[Gyöngyi, Garcia-Molina, 2005]
Parking Domain
Web Spam
Dr. Marc Spaniol
<div style="position:absolute; top:20px; width:600px; height:90px; overflow:hidden;"><font size=-1>atangledweb.co.uk
currently offline<br>atangledweb.co.uk back soon<br></font><br><br><a href="http://www.atangledweb.co.uk"><font
size=-1>atangledweb.co.uk</font></a><br><br><br>
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-9/49
Soundbridge HomeMusic WiFi Media Play<a class=l href="http://www.atangledweb.co.uk/index01.html">-</a>...
SanDisk Sansa e250 - 2GB MP3 Player -<a class=l href="http://www.atangledweb.co.uk/index02.html">-</a>...
AIGO F820+ 1GB Beach inspired MP3 Pla<a class=l href="http://www.atangledweb.co.uk/index03.html">-</a>...
Targus I-Pod Mini Sound Enhancer<a class=l href="http://www.atangledweb.co.uk/index04.html">-</a>...
Sony NWA806FP.CE7 4GB video WALKMAN <a class=l href="http://www.atangledweb.co.uk/index05.html">-</a>...
Ministry of Sound 512MB MP3 player<a class=l href="http://www.mp3roze.co.uk/cat7000.html">-</a>...
Nokia 6125 - Fold Design - 1.3 Megapi<a class=l href="http://www.mp3roze.co.uk/cat7001.html">-</a>...
Keyword Stuffing & Generated Copies
Web Spam
Dr. Marc Spaniol
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-10/49
Google ads
Web Spam
Dr. Marc Spaniol
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-11/49
Link Spam
Web Spam
Dr. Marc Spaniol
“Hyperlink structure contains an enormous amount of latent
human annotation that can be extremely valuable for automatically
inferring notions of authority.”
(Chakrabarti et. al. ’99)
• Hyperlinks: Good, Bad, Ugly
Honest link, human annotation
No value of recommendation, e.g. “affiliate programs”, navigation, ads …
Deliberate manipulation, link spam
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-12/49
PageRank
Web Spam
PageRank of page p0:
Outgoing links from pi
p0 = cΣipi/|F(i)| + (1-c)
Dr. Marc Spaniol
Generalized (vector):
damping
factor
PageRank of
pi pointing to p0
random jump
(1 – c) 1
cT’p
N
p=
+
N
Transition matrix
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-13/49
Score vector
“1” vector
• One page is important if it is pointed to by many other pages
• Based on the link structure
⇒ The algorithm of PageRank is vulnerable to link spamming
Link Farms
• Entry point from the honest web
Web Spam
Dr. Marc Spaniol
- Honey pots: Copies of quality content
- Dead links to parking domain
- Blog or guestbook comment spam
Hijacked
WWW
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-14/49
Farm
Spam Farm: Pages
p1
λ1
λ0
Web Spam
Dr. Marc Spaniol
p2
pk
λ2
?
λk
Boosting pages
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-15/49
• Controlled by the spammer
• Pointing to the target page in order
to increase its PageRank
p0
Target page
• Each farm has only one
• The target of the spammer is to increase
this page’s ranking
Spam Farm: External Links
Web Spam
Dr. Marc Spaniol
p1
p2
pk
λ1
λ2
?
λ0
p0
λk
Leakage
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-16/49
• Fractions of PageRank
• Link to the pages are added from pages outside
the Farm (forum, blog, …)
• The spammer has no or limited control on them
• λ = λ0 + … + λk
Simple Farm Model
WWW
Web Spam
1
p0
2
Dr. Marc Spaniol
•
•
•
•
•
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-17/49
PageRank of target page: p0
Number of all pages: N
Damping factor: c
Leakage contributed by accessible pages: λ
PageRank of each farm page: (1-c)/N
p0 = λ+ k*c*[(1-c)/N] + (1-c)/N
= λ+ [(1-c)(ck+1)]/N
⇒ By making k large, we can make p0 as large as we want
⇒ No multiplier effect for “acquired” page rank
k
Optimal Farm Model
WWW
Web Spam
Dr. Marc Spaniol
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-18/49
•
•
•
•
•
Optimal PageRank of target page: q0
Number of all pages: N
Damping factor: c
Leakage contributed by accessible pages: λ
PageRank of each farm page: cq0/k + (1-c)/N
q0 = λ+ ck[cq0/k + (1-c)/N] + (1-c)/N
= λ+ c2q0 + c(1-c)k/N + (1-c)/N
…
= λ/(1-c2) + [(1-c)(ck+1)]/N(1-c2)
= p0/(1-c2)
⇒ By making k large, we can make q0 as large as we want
⇒ For c = 0.85 “performance” gain: 1/(1-c2) = 3.6
⇒ Multiplier effect for “acquired” page rank
1
q0
2
k
Simple vs. Optimal Farm
Web Spam
Dr. Marc Spaniol
Simple:
Each boosting page only points to the
target page
p0 = cλ +
p1
p2
Optimal:
Shorttarget
The
reinforcement
points to all
loop(s)
boosting pages
Fewer are
There
linksno links among boosting pages
(1 – c)(ck + 1)
N
qr00 = p00 / (1 – c22)
λ
p0
q1
r2
q2
rk
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-19/49
pk
λ
qk
qr00
r1
Optimality without Leakage
Web Spam
For mathematical simplification only
Idea: Interpret leakage as additional boosting pages
q1
Dr. Marc Spaniol
qk+1
q0
q2
qk+m
m = λN / (1-c)
qk
cλ
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-20/49
(1 –
c2)
+
(1 – c)(ck + 1)
(1 – c2)N
!
=
(1 – c)[c(k + m) + 1]
(1 – c2)N
Alliance of two Farms
Web Spam
Dr. Marc Spaniol
Intuitive:
Each boosting page points to both targets and vice versa
p0
p1
q0
p2
pk
q1
q2
qm
2(k + m) new links
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-21/49
p0 = cΣ
i=1,...,kpi/2 +ofcΣ
j=1,...,mqj/2 + (1-c)/N
Redistribution
PageRank
q0 = cΣi=1,...,kpi/2c(k
+ cΣ
qj/2
j=1,...,m+
+ m)/2
1 + (1-c)/N
pi =pc(p
+q
(1-c)/N, i= 1,...,k
0 = 0q
0=
0)/(k+m)
(1 ++ c)N
qj =Convenient
c(p0+q0)/(k+m)
+ (1-c)/N,
for the smaller
Farmj= 1,...,m
Alliance of two Farms
Web Spam
Dr. Marc Spaniol
Better:
Only the target pages are interconnected with each other
p0
p1
q0
p2
pk
q1
q2
only 2 new links
Redistribution of PageRank
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-22/49
c(k + m)/2 + 1
p0 = q0 =
(1 + c)N
Convenient for the smaller Farm
qm
Alliance between two Farms
Optimal:
Each target points to the other target
The targets have no links to the boosting pages
Web Spam
Dr. Marc Spaniol
p1
p2
p0
q0
MPII-Sp-0710-23/49
q2
qm
pk
Databases and
Information Systems
Prof. Dr. G. Weikum
q1
p0 = c(Σi=1,...,k
1
ckp+i +c2qm
0) + (1-c)/N
=
+
q0 p=0c(Σ
q
+
p
)
+
(1-c)/N
(1
+
c)N
N
j=1,...,m j
0
pi = (1-c)/N,
cm + ci=2k1,...,k 1
+
q0q== (1-c)/N, j=1,...,m
(1
+
c)N
N
j
Multi-Farm Alliance
Two fundamental structures:
Web ring
Complete core
Web Spam
r1
Dr. Marc Spaniol
p1
p2
rn
r2
r0
p0
q0
core
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-24/49
pk
q1
q2
qm
Web Ring
Simple and intuitive connection model
Web Spam
p0 =
Dr. Marc Spaniol
p1
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-25/49
1
ck + c2m + c3n
+
(1 + c + c2)N
N
r1
r2
rn
The distance influences the
contribution of each farm to
the PageRank of the others
r0
p0
q0
q1
p2
q2
pk
qm
Complete Core
The core is a completely connected sub-graph
p0 =
Web Spam
2ck – c2k + c2m + c2n
(2 + c)N
r1
Dr. Marc Spaniol
r2
1
+
N
rn
The contribution of each farm to
the PageRank of the other ones
is uniform
p1
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-26/49
r0
p0
q0
q1
p2
q2
pk
qm
Evaluation
Web Spam
Web ring:
Target scores for ring/complete cores:
PageRank of the target
10 farms of sizes 1.000, 2.000, … , 10.000 The
of farm 10 decreases
6000
Scaled Target Page Rank
Dr. Marc Spaniol
MPII-Sp-0710-27/49
non-
5000
4000
Single Farm
3000
2000
1000
Databases and
Information Systems
Prof. Dr. G. Weikum
compared to the
connected farm case
Non-connected
Farm:
Web Ring
The PageRank
ofCore
the
Complete
target page is linear to
the dimension of the
farm (numer of boosting
pages)
0 core:
Complete
1
2increase,
3
4 especially
5
6
7those
8 of9 the
10
All PageRanks
Farm Number
target of farms with low dimensions
Evaluation
Contribution of farm 1 to the other targets
Web Spam
Complete core:
Preserves the impact of PageRank with
respect to the other targets and gives an
identical contribution to the other targets
that is much lower than to itself
200
180
Page Rank Contribution
Dr. Marc Spaniol
160
140
120
Complete Core
100
Web Ring
80
60
40
20
0
1
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-28/49
2
3
4
5
6
7
8
9
10
Web ring:
Farm Number
The values of contribution are closer to each other and
decrease by the distance from the farm
Lessons Learned
• Single farm
Web Spam
Dr. Marc Spaniol
- Short loop(s) increase target PageRank
- Increase of PageRank is linear with the amount of boosting pages
- Leakage should only point to the target page
• Leakage can be interpreted as an additional number of boosting pages
• Two farms:
- Target pages should only link to other targets
- In an alliance of two, both participants win
• Larger alliances
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-29/49
- Need to be stable to keep all participants happy
- Complete core topology:
Contribution to the PageRank of others at a relatively “low level”
- Web ring topology
Contribution to the PageRank of others “slowly” decreasing by distance
Link Farms – Example
Web Spam
• Multi-domain
• Multi-IP
Dr. Marc Spaniol
Honey pot: quality content copy
411amusement.com
411 sites A-Z list
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-30/49
411fashion.com
411 sites A-Z list
target
411zoos.com
411 sites A-Z list
Cloaking and Hiding
• Formatting tricks: Trapping crawlers with simple HTML processing only
Web Spam
• One-pixel image
Dr. Marc Spaniol
• White over white
• Color, position from stylesheet
• Redirection
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-31/49
- Script
- Meta-tag with refresh time 0
•…
Obfuscated JavaScript
Web Spam
Dr. Marc Spaniol
<SCRIPT language=javascript>
var1=100;
var3=200;
var2=var1 + var3;
var4=var1;
var5=var4 + var3;
if(var2==var5)
document.location="http://umlander.info/mega/free_software_ downloads.html";
</SCRIPT>
• More sophisticated tricks
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-32/49
- Redirection through window.location
- Spam content (text, link) from random looking static data via eval calls
- Content generation by document.write
HTTP Level Cloaking
• User agent, client host filtering
Web Spam
Dr. Marc Spaniol
• Different for users and for GoogleBot
• “Collaboration service” of spammers for
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-33/49
- Crawler IPs
- Agents
- Behavior
Spam in Social Media
Guest books
Web Spam
Dr. Marc Spaniol
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-34/49
Fake Blogs
Web Spam
Dr. Marc Spaniol
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-35/49
Spam Detection
• Crawl-time vs. post-processing
Web Spam
Dr. Marc Spaniol
• Simple filters in crawler
- Cannot handle unseen sites
- Require large bootstrap crawl
- Need to run rendering and script execution
• Crawl time feature generation and classification
- Needs interface in crawler to access content
- Needs model from bootstrap or external crawl (may be smaller)
- Sounds expensive but needs to be done only once per site
• The hard work is done post-processing both cases
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-36/49
Assessment Interface and
Collaboration Infrastructure
Web Spam
Docs
(WARC)
Dr. Marc Spaniol
Local
storages
access
May share
features,
extracts
across
institutions
Feature
generation
(crawl-time)
“Interaction”
Active learning
feature
feed
text files
Classifier
• Build model
• Apply model
(crawl-time)
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-37/49
Spam Labeling
• Manual labels (black AND white lists) primarily determine quality
Web Spam
Dr. Marc Spaniol
• Can blacklist only a tiny fraction
- Recall 10% of sites are spam
- Needs machine learning
• Central to the service
- Aid manual assessment
- Aid information and label sharing
- Catch spam farms that span different top-level domains
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-38/49
⇒ No free lunch: No fully automatic filtering
Web Spam Interface
Web Spam
Dr. Marc Spaniol
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-39/49
Combating Spam
Web Spam
Dr. Marc Spaniol
• Refresh detection
• Conceal crawling by
- Headers: Browser vs. crawler
- Access: “Random” vs. BFS
• TrustRank method
• Supervised learning features
-
Number of words in the pages
Average word length
Number of words in the page title
Fraction of visible content
Amount of anchor text
Compressibility
• Partition the Web into different blocks
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-40/49
⇒ Never stop! On-going process
Query Marketability
Web Spam
Dr. Marc Spaniol
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-41/49
Google AdWords
Competition
10k
10th wedding anniversary
128mb, 1950s, …
abc, abercrombie, …
b2b, baby, bad credit, …
digital camera
earn big money, easy, …
f1, family, flower, fantasy
gameboy, gates, girl, …
hair, harry potter, …
ibiza, import car, …
james bond, janet jackson
karate, konica, kostenlose
ladies, lesbian, lingerie, …
…
Generative Content Models
Web Spam
Dr. Marc Spaniol
Spam topic 7
Honest topic 4
Honest topic 10
loan (0.080)
club (0.035)
music (0.022)
unsecured (0.026)
team (0.012)
band (0.012)
credit (0.024)
league (0.009)
film (0.011)
home (0.022)
win (0.009)
festival (0.009)
Excerpt: 20 spam and 50 honest topic models
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-42/49
[Bíró, Szabó, Benczúr 2008]
TrustRank
• Basic idea: Approximate isolation
Web Spam
- Honest pages rarely point to spam
- Spam cites many, many spam
Dr. Marc Spaniol
• Sample a set of “seed pages” from the web
• “Oracle” (human) identifies good and spam pages in the seed set
• The subset of seed pages that are identified as “good” are called
“trusted pages”
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-43/49
⇒ Expensive! Make seed set as small as possible
Trust Propagation
• Set trust of each trusted page to 1
Web Spam
Dr. Marc Spaniol
• Propagate trust through links
- Each page gets a trust value between 0 and 1
- Use a threshold value and mark all pages below the trust threshold as spam
• Trust attenuation
- The degree of trust conferred by a trusted page decreases with distance
• Trust splitting
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-44/49
- The larger the number of outlinks from a page, the less scrutiny the page author
gives each outlink
- Trust is “split” across outlinks
Trust Computation
Web Spam
Dr. Marc Spaniol
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-45/49
1. Predicted spamicity
p(v) for all pages
2. Target page u,
new feature f(u)
by neighbor p(v)
aggregation
3. Reclassification by
adding the new feature
v7
v1
?
v2
u
PageRank Supporter Distribution
Web Spam
ρ=0.61
ρ=0.97
Dr. Marc Spaniol
low
PageRank
Honest:
fhh.hamburg.de
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-46/49
high
low
PageRank
high
Spam:
radiopr.bildflirt.de
(part of www.popdata.de farm)
[Benczúr, Csalogány, Sarlós, Uher 2005]
Web Spam Challenge
• WEBSPAM-UK2006
Web Spam
Dr. Marc Spaniol
-
77M pages
11,402 hosts
7,373 labeled
26% spam
• WEBSPAM-UK2007
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-47/49
100M pages
114,529 hosts
6,479 labeled
6% spam
[WebSpam08]
Summary
• Web Spam
Web Spam
Dr. Marc Spaniol
-
Aims at earch engine optimization
“Attacks” indexing crawlers
Slows down archiving crawlers
“Pollutes” the archive
Social Web is a “threat”
• Spamming techniques
- Boosting
- Hiding
- Combinations of various techniques
• Countermeasures
- Manual assessment
- Machine learning techniques
- Hybrid approaches
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-48/49
⇒ All Web information retrieval ranking elements spammed
References
[Benc08]
A. Benczur: “Web spam survey for the Archivist”. 8th International Web Archiving
Workshop (IWAW 2008), Århus, Denkmark, Sept. 18, 2008.
http://liwa-project.eu/index.php/video/33/
[last access: June 23, 2010]
[BSS*08]
A. Benczur, D. Siklosi, J. Szabo et al.: “Web spam survey for the Archivist”.
Proceedings of the 8th International Web Archiving Workshop
(IWAW 2008), Århus, Denkmark, Sept. 18, 2008.
http://iwaw.net/08/IWAW2008-Benczur.pdf
[last access: June 23, 2010]
[GyGa05]
Z. Gyöngyi and H. Garcia-Molina: “Link Spam Alliances”. Technical Report,
Stanford University, March 2, 2005.
http://www-db.stanford.edu/~zoltan/publications/gyongyi2005link.pdf
[last access: June 23, 2010]
Web Spam
Dr. Marc Spaniol
Databases and
Information Systems
Prof. Dr. G. Weikum
MPII-Sp-0710-49/49
[WebSpam08] Web Spam Challenge: “Home page”. 2008.
http://webspam.lip6.fr/wiki/pmwiki.php
[last access: June 23, 2010]