The Web has Spam…

Transcription

The Web has Spam…
Have you ever used the Web…
!
!
The effects of Web Spam on
The Evolution of Search Engines
to get informed?
to help you make decisions?
n 
n 
n 
n 
!
CS315-Web Search and Mining
Financial
Medical
Political
Religious…
The Web is huge
n  > 1 trillion (! ?)
n 
n 
n 
The Web has Spam…
We depend on search engines
to find information static pages publicly available,
… and growing every day
Much larger,
if you count the “deep web”
Infinite,
if you count pages created
on-the-fly
Any controversial issue will be spammed
Search results steroid drug HGH (human growth hormone) Search results for mental disease ADHD (a:en;on-­‐deficit/hyperac;vity disorder) Page 1
Political issues will be spammed
… you like it or not!
Search results for Senatorial candidate John N. Kennedy, 2008 USA Elec;ons Famous search results for “miserable failure” Why is there Web Spam?
A Brief History of Search Engines
!
1st Generation (ca 1994):
n 
n 
AltaVista, Excite, Infoseek…
Ranking based on Content:
w  Pure Information Retrieval
!
2nd Generation (ca 1996):
n 
n 
Lycos
Ranking based on Content + Structure
w  Site Popularity
!
n 
3rd Generation (ca 1998):
n 
! Web Spam:
n 
Attempt to modify the web (its structure and contents),
and thus influence search engine results
in ways beneficial to web spammers
Google, Teoma, Yahoo
Ranking based on Content + Structure + Value
w  Page Reputation
!
In the Works
n 
Page 2
Ranking based on “the user’s need behind the query”
1st Generation: Content Similarity
1st Generation: How to Spam
! “Keyword stuffing”:
!   Content Similarity Ranking:
Add keywords, text, to increase content similarity
The more rare words two documents share,
the more similar they are
! Documents are treated as “bags of words”
(no effort to “understand” the contents)
! Similarity is measured by vector angles
! Query Results are ranked
t3
d
2
by sorting the angles
between query and documents
θ
! How To Spam?
d1
Page stuffed
with casinorelated keywords
t1
t2
2nd Generation: Add Popularity
2nd Generation: How to Spam
! Create “Link Farms”:
! A hyperlink
Heavily interconnected owned sites spam popularity
from a page in site A
www.aa.com
1
to some page in site B
is considered a popularity vote
from site A to site B
! Rank similar documents
according to popularity
! How To Spam?
www.bb.com
2
www.cc.com
1
www.dd.com
2
www.zz.com
0
Interconnected
sites owned by
vespro.com
promote main site
Page 3
3rd Generation: Add Reputation…
3rd Generation: How to Spam
! The reputation “PageRank” of a page Pi =
! Organize Mutual Admiration Societies:
the sum
of a fraction of the reputations
of all pages Pj that point to Pi
“link farms” of irrelevant reputable sites
! Idea similar to academic co-citations
! Beautiful Math behind it
n 
n 
PR = principal eigenvector
of the web’s link matrix
PR equivalent to the chance
of randomly surfing to the page
! HITS algorithm tries to recognize
“authorities” and “hubs”
! How To Spam?
Mutual Admiration Societies
An Industry is Born
via Link Exchange
!
!
!
Page 4
“Search Engine Optimization” Companies
Advertisement Consultants
Conferences
“Google-bombs” spam Anchor Text…
3rd Generation: Reputation & Anchor Text
! Anchor text tells
you what the
reputation is about
!
Business weapons
!
Political weapon in pre-election season
n 
Page A
Page B
Anchor
n 
n 
n 
! How To Spam?
www.ibm.com
n 
Big Blue today announced
record profits for the quarter
n 
“Egypt”
“Jew”
Other uses we do not know?
!
n 
… mostly for political purposes
Promote steroids
Discredit AD/HD research
Activism / online protest
!
n 
Joe’s computer hardware links
Compaq
HP
IBM
“miserable failure”
“waffles”
“Clay Shaw” (+ 50 Republicans)
Misinformation
!
n 
Armonk, NY-based computer
giant IBM announced today
“more evil than satan”
“views expressed by the sites in your results are not in any way
endorsed by Google…”
Search Engines vs Web Spam
!
Search Engine’s Action
!
Web Spammers Reaction
!
1st Generation: Similarity
!
Add keywords so as
to increase content similarity
+ Create “link farms” of heavily
interconnected sites
+ Organize “mutual admiration
societies” of irrelevant reputable
sites
+ Googlebombs
n 
!
2nd Generation: + Popularity
n 
!
“miserable failure hits
Obama in January 2009
!
!
Content + Structure + Value
4th Generation (in the Works)
n 
Ac;vists openly collabora;ng to Google-­‐bomb search results of poli;cal opponents in 2006 Content + Structure
3rd Generation: + Reputation
+ Anchor Text
n 
!
Content
Ranking based on the user’s
“need behind the query”
!
??
Can you guess what
they will do?
Is there a pattern on how to spam?
Page 5
Societal Trust is (also) a Graph
And Now For Something Completely(?) Different
!
Propaganda:
n 
Attempt to modify human behavior,
and thus influence people’s actions
in ways beneficial to propagandists
!
Theory of Propaganda
!
Propagandistic Techniques (and ways of detecting propaganda)
n 
n 
Developed by the Institute for Propaganda Analysis 1938-42
Word games - associate good/bad concept with social entity
! Web Spam:
w  Glittering Generalities — Name Calling
n 
n 
n 
n 
n 
Attempt to modify the Web Graph,
and thus influence users through search engine results
in ways beneficial to web spammers
Transfer - use special privileges (e.g., office) to breach trust
Testimonial - famous non-experts’ claims
Plain Folk - people like us think this way
Bandwagon - everybody’s doing it, jump on the wagon
Card Stacking - use of bad logic
! Propaganda:
Attempt to modify the Societal Trust Graph
and thus influence people
in ways beneficial to propagandist
Web Spammers as Propagandists
Propaganda in Graph Terms
! Web Spammers can be seen as
! Word Games
employing propagandistic techniques
in order to modify the Web Graph
n 
n 
! There is a pattern on how to spam!
!
!
!
!
!
Page 6
! Modify Node weights
Name Calling
Glittering Generalities
Transfer
Testimonial
Plain Folk
Card stacking
Bandwagon
!
!
!
!
!
n 
n 
Decrease node weight
Increase node weight
Modify Node content + keep weights
Insert Arcs b/w irrelevant nodes
Modify Arcs
Mislabel Arcs
Modify Arcs
& generate nodes