Report on "Advances in Web Archiving - LiWA

Transcription

Report on "Advances in Web Archiving - LiWA
European Commission Seventh Framework Programme
Call: FP7-ICT-2007-1, Activity: ICT-1-4.1
Contract No: 216267
Report on “Advances in
Web Archiving Technologies”
D6.5
Version 1.0
Editor:
EA
Work Package:
WP6
Status:
Final Version
Date:
M22
Dissemination Level: PU
LiWA
Project Overview
Project Name:
LiWA – Living Web Archives
Call Identifier:
FP7-ICT-2007-1
Activity Code:
ICT-1-4.1
Contract No:
216267
Partners:
1.
2.
3.
4.
5.
6.
7.
8.
Coordinator: Universität Hannover, L3S Research Center, Germany
European Archive Foundation (EA), Netherlands
Max-Planck-Institut für Informatik (MPG), Germany
Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA
SZTAKI), Hungary
Stichting Nederlands Instituut voor Beeld en Geluid (BeG), Netherlands
Hanzo Archives Limited (HANZO), United Kingdom
National Library of the Czech Republic (NLP), CZ
Moravian Library (MZK), CZ
Document Control
Title:
D6.5 Report on “Advances in Web Archiving Technologies”
Author/Editor:
Radu Pop, Julien Masanes (EA)
Mark Williamson (HANZO)
Andras Benczur (MTA)
Marc Spaniol (MPG)
Thomas Risse (L3S)
Document History
Version
Date
Author/Editor
Description/Comments
0.1
04/05/2009
Radu Pop
Document plan
0.2
01/06/2009
all
First draft
0.3
22/06/2009
all
Updates
0.4
12/08/2009
all
First version
0.5
22/11/2009
JM
Introduction, Section 2 completed
Page 2 of 60
LiWA
Legal Notices
The information in this document is subject to change without notice.
The LiWA partners make no warranty of any kind with regard to this document, including, but
not limited to, the implied warranties of merchantability and fitness for a particular purpose. The
LiWA Consortium shall not be held liable for errors contained herein or direct, indirect, special,
incidental or consequential damages in connection with the furnishing, performance, or use of
this material.
Page 3 of 60
LiWA
Table of Contents
1 Introduction.........................................................................................................................5
2 Current Challenges in Web Archiving...............................................................................6
2.1 Archive Fidelity ..............................................................................................................6
2.2 Archive Coherence.........................................................................................................6
2.3 Archive Interpretability....................................................................................................7
3 Archive’s completeness.....................................................................................................8
3.1 A new crawling paradigm...............................................................................................8
3.2 Capturing Streaming Multimedia..................................................................................10
3.3 State of the Art on the Streaming Capture Software.....................................................11
3.4 Rich Media Capture Module.........................................................................................13
3.5 Integration into the LiWA Architecture..........................................................................13
3.6 Evaluation and Optimizations.......................................................................................14
3.7 References...................................................................................................................17
4 Spam Cleansing................................................................................................................19
4.1 State of the Art on Web Spam......................................................................................19
4.2 Spam Filter Module......................................................................................................24
4.3 Evaluation....................................................................................................................26
4.4 Integration into the LiWA Architecture..........................................................................27
4.5 References...................................................................................................................31
5 Temporal Coherence.........................................................................................................33
5.1 State of the Art on Archive Coherence.........................................................................34
5.2 Temporal Coherence Module.......................................................................................34
5.3 Evaluation and Visualization........................................................................................42
5.4 Integration into the LiWA Architecture..........................................................................46
5.5 References...................................................................................................................47
6 Semantic Evolution...........................................................................................................48
6.1 State of the Art on Terminology Evolution....................................................................48
6.2 Detecting Evolution......................................................................................................51
6.3 Terminology Evolution Module.....................................................................................54
6.4 Evaluation....................................................................................................................55
6.5 Integration into the LiWA Architecture..........................................................................56
6.6 References...................................................................................................................58
Page 4 of 60
LiWA
1
Introduction
Archiving the web is now on the agenda of many organizations. From companies that
are required by regulation to preserve their websites or intranets to national libraries
whose collecting mission encompasses entire national domains or national archives
required to preserve the entire governmental web presence, more and more
organizations are engaging with preserving Web content.
Yet, Web preservation is still a very challenging task. In addition to the “usual”
challenges of digital preservation (media decay, technological obsolescence,
authenticity and integrity issues, etc.), Web preservation has its own unique difficulties:
!
Rapidly evolving publishing and encoding technologies, which challenge the
ability to capture Web content in an authentic and meaningful way that
guarantees long-term preservation and interpretability,
!
Distribution and temporal properties of online content, with unpredictable aspects
such as transient unavailability,
!
Huge number of actors (organizations and individuals) contributing to the Web,
!
Large variety of needs that Web content preservation will have to serve.
The Living Web Archive (LiWA) project is the first extensive R&D project entirely
devoted to address some of these challenges. This document present a Report on
advances made in this domain, at mid-term of this project. The focus of LiWA is
described in Section 2. Section 3 addresses completeness of archives, explaining
challenges and achievement to capture the entire content of sites. Section 4 describes
the problem that Web spam raises for Web archives and the methods developed to
detect and filter it. Section 5 deals with archive temporal coherence, with methods to
measure, evaluate and visualize it. Finally Section 6 describes the issue of semantic
evolution in Web archives, and proposes methods to make archives easier to search in
the future.
Page 5 of 60
LiWA
2
Current Challenges in Web Archiving
This section presents the main research challenges that LiWA is addressing. We have
grouped them in three main problem areas: archive fidelity, temporal coherence, and
interpretability.
2.1
Archive Fidelity
The first problem area is the archive's fidelity and authenticity to the original. Fidelity
comprises, on the one hand, the ability to capture all types of content, including nonstandard types of Web content such as streaming media, which can often not be
captured at all by in an existing Web crawler technology. In Web archiving today, state
of the art crawlers, based on page parsing for link extraction and human monitoring of
crawls, are at their intrinsic limits. Highly skilled and experienced staff and technologydependent incremental improvement of crawlers are permanently required to keep up
with the evolution of the Web; this increases the barrier to entry in this field and often
produces dissatisfying results due to poor fidelity. Consequently this leads to increased
costs of storage and bandwidth due to the unnecessary capture of irrelevant content.
Current crawlers fail to capture all Web content, because the current Web comprises
much more than simple HTML pages: dynamically created pages, e.g., based on
JavaScript or flash; multimedia content that is delivered using media-specific streaming
protocols; hidden Web content that resides in data repositories and contentmanagement systems behind Web site portals.
In addition to the resulting completeness challenges, one also needs to avoid useless
content, typically Web spam. Spam classification and page-quality assessment is a
difficult issue for search engines; for archival systems it is even more challenging as
they lack information about usage patterns (e.g., click profiles) at capture time, which
should ideally filter spam during the crawl process.
LiWA has developed novel methods for content gathering of high-quality Web archives.
They are presented in Section 3 (on completeness) and 4 (on filtering Web spam) of
this report.
2.2
Archive Coherence
The second problem area is a consequence of the Web's intrinsic organization and of
the design of Web archives. Current capture methods for instance are based on
snapshot crawls and “exact duplicate” detection. The archive's integrity and temporal
coherence – proper dating of content and proper cross-linkage - is therefore entirely
dependent on the temporal characteristics (duration, frequency, etc.) of the crawl
process. Without judicious measures that address these issues, proper interpretation of
archived content would be very difficult if possible at all.
Ideally, the result of a crawl is a snapshot of the Web at a given time point. In practice,
however, the crawl itself needs an extended time period to gather the contents of a Web
site. During this time span, the Web continues to evolve, which may cause
Page 6 of 60
LiWA
incoherencies in the archive. Current techniques for content dating are not sufficient for
archival use, and require extensions for better coherence and reduced cost of the
gathering process. Furthermore, the desired coherence across repeated crawls, each
one operating incrementally, poses additional challenges, but also opens up
opportunities for improved coherence, specifically to improve crawl revisit strategies.
These issues will be addressed in Section 5 of this report (Temporal coherence).
2.3
Archive Interpretability
The third problem area is related to the factors that will affect Web archives over the
long-term, such as the evolution of terminology and the conceptualization of domains
underlying and contained by a Web archive collection. This has the effect that users
familiar with and relying upon up-to-date terminology and concepts will find it
increasingly difficult to locate and interpret older Web content. This is particularly
relevant for long-term preservation of Web archives, since it is not sufficient to just be
able to store and read Web pages in the long run – a "living" Web archive is required,
which will also ensure accessibility and coherent interpretation of past Web content in
the distant future.
Methods for extracting key terms and their relations from a document collection produce
a terminology model at a given time. However, they do not consider the semantic
evolution of terminologies over time. Three challenges have to be tackled to capture this
evolution: 1) extending existing models to take the temporal aspect into account, 2)
developing algorithms to create relations between terminology snapshots in view of
changing meaning and usage of terms, 3) presenting the semantic evolution to users in
an easily comprehensible manner. The availability of temporal information opens new
opportunities to produce higher-quality terminology models. Advance in this domain are
presented in Section 6 of this report (Semantic evolution).
Page 7 of 60
LiWA
3
Archive’s completeness
One of the key problems in Web Archiving is the discovery of resources necessary to
fetch them. Starting from known pages, tools to capture Web content have to discover
all linked resources, including embeds (images, cuss etc.), and this even when they
belong to the same site as no listing function is implemented in the http protocol.
‘Crawlers’, software tools that automatically parse known pages to extract links from the
HTML code and add them to a queue, called the frontier, traditionally do this.
This method has been designed at a time where the Web was entirely made of simple
HTML pages and did work perfectly in this context. When navigational links started
becoming coded with more sophisticated means, like scripts or executable code,
embedded or not in html, this methods has shown its limits.
We can classify navigational links in broadly three categories depending on the type of
code in which they are encoded.
1. Explicit links (source code is available and full path is explicitly stated)
2. Variable links (source code is available but use variables to encode the path)
3. Opaque links (source code not available)
Current crawling technologies, only address the first and partially the second category.
For the latter, crawlers use heuristics to append file and path names to reconstitute
URL. Heritrix even has a mode in which every possible combination of path and file
name found in embedded JavaScript are combined and tested. This method has a high
cost in terms of number of fetch. Besides, it still misses the cases where variable are
used as a parameter to code the URL.
For those cases as well as for the navigational links of third category, the only solution is
to actually execute the code to get the links. This is what LiWA has been exploring.
Although the result of this research is proprietary technology, we will expose at a
general level the approach taken.
3.1
A new crawling paradigm
Executing pages for capturing sites requires mainly three things.
The first is to run an execution environment (HTML plus JavaScript, Flash etc.) in a
controlled manner so that discoverable links can be extracted systematically. Web
browser can provide this functionality but they are designed to execute and fetch links
one at a time following the user interaction. The solution consists in tethering these
browsers so that they execute all code containing links and extract this links without
directly fetching the linked resources, but adding it to a list (similar to a crawler frontier).
The second challenge is to encapsulate these headless browsers in a crawler-like
workflow, which main purpose is to systematically explore all the branches of the
hypertext tree. The difficulty comes from the fact that some contextual variables can be
used in places, which make a simple one-pass execution of the target code (HTML plus
Page 8 of 60
LiWA
JavaScript, Flash etc.) incomplete. This challenge has been called non-determinism
[MBD*07].
The last but not the least of the challenges, is to optimize this process so that it can
scale to the size required for archiving sites. In most of documented cases
These challenges have been addressed separately in the literature for different
purposes. It has been for instance the case for malware detection [WBJR06,
MBD*07,YKLC08], site adaptation [JHBa08] and site testing [BTGH07].
However to the best of our knowledge, LiWA is the first attempt to address the three
together, and this for archiving purposes.
This is currently being implemented in the new crawler that one of the partner has been
developing (Hanzo Archives Ltd) and it is already used in production by them to archive
a wide range of sites that can’t be archived by pre-existing crawlers, as well as in testing
by another of the LiWA partner, the European Archive.
Page 9 of 60
LiWA
3.2
Capturing Streaming Multimedia
The Internet is becoming an important medium for the dissemination of multimedia
streams. However, the protocols used for traditional applications were not designed to
account for the specificities of multimedia streams, namely their size and real-time
needs. At the same time networks are shared by millions of users and have limited
bandwidth, unpredictable delay and availability. The design of real-time protocols for
multimedia applications is a challenge that multimedia networking must face.
Multimedia applications need a transport protocol to handle a common set of services.
The transport protocol does not have to be complex as TCP. The goal of the transport
protocol is to provide end-to-end services that are specific to multimedia applications
and that can be clearly distinguished from conventional data services:
!
a basic framing service is needed, defining the unit of transfer, typically common
with the unit of synchronization;
!
multiplexing (combining two or more information channels onto a common
transmission medium) is needed to identify separate media in streams;
!
timely delivery is needed;
!
synchronization is needed between different media and it is also a common
service to networked multimedia applications.
The transfer protocols in the streaming technologies are used to carry message packets
and communication takes place only through them.
Despite the growth in multimedia, there have been few studies that focus on
characterizing streaming audio and video stored on the Web. Mingzhe Li et al.
presented in [LCKN05] the investigation's results on nearly 30,000 streaming audio and
video clips identified on 17 million Web pages from diverse geographic locations. The
streaming media objects were analyzed to determine attributes such as media type,
encoding format, playout duration, bitrate, resolution, and codec. The streaming media
content encountered is dominated by proprietary audio and video formats with the top
four commercial products being RealPlayer, Windows Media Player, MP3 and
QuickTime. Like similar Web phenomena, the duration of streaming media follows a
power-law distribution.
A more focused study was conducted in [BaSa06], analyzing the crawl sample of the
media collection for several Dutch radio-TV Web sites. Three quarters of the streaming
media were represented by RealMedia files and almost one quarter were Windows
Media files. The detection of streaming objects during the crawl proved to be difficult, as
there are no conventions on file extensions and mime types.
Another extensive data-driven analysis on the popularity distribution of user-generated
video contents is presented by Meeyoung Cha et al. in [CKR*07]. Video content in
standard Video-on-Demand (VoD) systems has been historically created and supplied
by a limited number of media producers. The advent of User-Generated Content (UGC)
Page 10 of 60
LiWA
has reshaped the online video market enormously, but also the way people watch video
and TV. The paper analysis YouTube, the world's largest UGC VoD system, serving 100
million distinct videos and 65.000 uploads daily. The study is focused on the nature of
the user behaviour, different cache designs and the implications of different UGC
services on the underlying infrastructures.
YouTube alone is estimated to carry 60% of all videos online, corresponding to a
massive 50-200 Gb/s of server access bandwidth on a traditional client-server model.
3.3
State of the Art on the Streaming Capture Software
There are many tools, usually called streaming media recorders, allowing to record
streaming audio and video content from the Internet. Most of them are commercial
software, especially running on Microsoft Windows and few of them are really able to
capture all kind of streams.
Several research prototypes related to video streaming applications often include
recording or capturing functionalities. But in general, the proper capture and storage of
the video content do not represent the central features of the proposed systems. Each
prototype typically deals with a particular type stream distribution or stream analysis.
We give in the following a brief overview on existing tools or software related to stream
capturing, grouped into commercial, open-source or research projects.
3.3.1
Off-the-shelf commercial software
Some commercial software such as GetASFStream [ASFS] and CoCSoft Stream Down
[CCSD] are able to capture streaming content through various streaming protocols. But
this software is usually not free and if so, they often had legal difficulties, like
StreamBox VCR [SVCR], which has been prosecuted in justice.
Some useful information on capturing streaming media is summarised on the following
Web sites:
http://all-streaming-media.com
http://www.how-to-capture-streaming-media.com
The most interesting software charted on these sites are those running on Linux
platform, as they all are freeware, open-source and command-line based software. This
last point is very important, as a command-line based software may easily be integrated
in simple shell scripts or Java programs, whereas GUIs (most Windows software) don't.
3.3.2
Open-source software: MPlayer Project
MPlayer [MPlP] is an open-source media player project developed by voluntary
programmers around the world. The MPlayer project is also supported by the Swiss
Federal Institute of Technologies in Zürich (ETHZ), which hosts the www4.mplayerhq.hu
mirror, an alias for mplayer.ethz.ch. MPlayer is a command-line based media player,
Page 11 of 60
LiWA
which also comes with an optional GUI. It allows playing and capturing a wide range of
streaming media formats over various protocols. As of now, it supports streaming via
HTTP/FTP, RTP/RTSP, MMS/MMST, MPST, SDP. In addition, MPlayer can dump
streams (i.e. download them and save to files on the disk) and supports HTTP, RTSP,
MMS protocols to record Windows Media, RealMedia and QuickTime video content.
Since the MPlayer project is under constant development, new features, modules and
codecs are constantly added. Besides, MPlayer offers a good documentation and
manual available on its Web site, with a continuous help (for bug reports) on the mailing
list and archives.
MPlayer runs on many platforms (Linux, Windows and MacOS), including a large set of
codecs and libraries.
3.3.3
Research projects
Research projects usually focus on one of the following two aspects: the analysis of the
streaming content (audio/video encoding codecs, compression and optimizations) or
architectures for the distribution or efficient broadcast of the streams (content delivery
networks, P2P overlays, network traffic analysis, etc.).
However, several research projects provide capturing capabilities for the streaming
media and deal with the real-time protocols used for the broadcast.
A complex system for video streaming and recording is proposed by the HYDRA
(Highperformance Data Recording Architecture) project [ZPD*04]. It focuses on the
acquisition, transmission, storage, and rendering of high-resolution media such as highquality video and multiple channels of audio. HYDRA consists of multiple components to
achieve its overall functionality. Among these, the data-stream recorder includes two
interfaces to interact with data sources: a session manager to handle RTSP
communications and multiple recording gateways to receive RTP data streams. A data
source connects to the recorder by initiating an RTSP session with the session
manager, which performs the following functions: controls admission for new streams;
maintains RTSP sessions with sources; and manages the recording gateways.
Malanik et al. describe a modular system, which provides capability for capturing videos
and screen casts from lectures and presentations in any academic or commercial
environment [MDDC08]. The system is based on a client-server architecture. Client
node sends streams from available multimedia devices to local area network. The
server provides functions for capturing video from streams and for distributing the
captured video files using torrent.
The FESORIA system [PMV*08] is an analysis tool, which is able to process the logs
gathered from the streaming servers and proxies. It combines the extracted information
with other types of data, such as content metadata, content distribution networks
architecture, user preferences, etc. All this information is analyzed in order to generate
reports on service performance, access evolution and users' preferences, and thus to
improve the presentation of the services.
With regard to the TCP streaming, delivered over HTTP a recent measurement study
Page 12 of 60
LiWA
[WKST08] indicated that a significant fraction of Internet streaming media is currently
delivered over HTTP. TCP generally provides good streaming performance when the
achievable TCP throughput is roughly twice the media bitrate, with only a few seconds
of startup delay.
3.4
Rich Media Capture Module
The Rich Media Capture module is designed to enhance the capturing capabilities of
the crawler, with regards to different multimedia content types. The current version of
Heritrix is mainly based on the HTTP/HTTPS protocol and it cannot treat other content
transfer protocols widely used for the multimedia content (such as streaming).
The Rich Media Capturing module delegates the multimedia content retrieval to an
external application (such as MPlayer) that is able to handle a larger spectrum of
transfer protocols. The main performance indicator for this module is therefore related to
the number of additionally archived multimedia types.
3.5
Integration into the LiWA Architecture
The module is constructed as an external plugin for Heritrix. Using this approach, the
identification and retrieval of streams is completely decoupled, allowing the use of more
efficient tools to analyze video and audio content. At the same time, using the external
tools helps in reducing the burden on the crawling process.
The module is composed of several subcomponents that communicate through
messages. We use an open standard communication protocol called Advanced
Message Queuing Protocol (AMQP).
The integration of the Rich Media Capturing module is shown in the Figure 3.2 and the
workflow of the messages can be summarized as follows.
The plugin connected to Heritrix detects the URLs referencing streaming resources and
it constructs for each one of them an AMQP message. This message is passed to a
central Messaging Server. The role of the Messaging Server is to decouple the Heritrix
crawler from the clustered streaming downloaders (i.e. the external capturing tools). The
Messaging Server stores the URLs in queues and when one of the streaming
downloaders is available, it sends the next URL for processing.
In the software architecture of the module we identify three distinct sub modules:
!
a first control module responsible for accessing the Messaging Server, starting
new jobs, stopping them and sending alerts;
!
a second module used for stream identification and download (here an external
tool is used, such as the MPlayer);
!
a third module which repacks the downloaded stream into an format recognized
by the access tools.
When available, a streaming downloader connects to the Messaging Server to request
Page 13 of 60
LiWA
a new streaming URL to capture. Upon receiving the new URL, an initial analysis is
done in order to detect some parameters, among others the type and the duration of the
stream. Of course, if the stream is live, a fixed configurable duration may be chosen.
After a successful identification the actual download starts. The control module
generates a job which is passed to the MPlayer along with safeguards to ensure that
the download will not take longer than the initial estimation.
After a successful capture, the last step consists in wrapping the captured stream into
ARC/WARC format and in moving it to the final storage.
Figure 3.2: Streaming capture module interacting with the crawler
3.6
Evaluation and Optimizations
We conducted several test crawls using the new capturing module on the GOV.UK
collection. This UK governmental Web site collection is regularly crawled and enriched
monthly by the European Archive. During the last 3 monthly crawls, the capturing
module was successfully used to retrieve the multimedia content accessible from these
Web sites, which has not been possible to archive with conventional archiving
technologies.
The table below gives some examples of the discovered URIs. They use the standard
RTSP or MMS schemes, but one can notice that the video files are generally hosted on
a different Web server than the Web site.
Protocol
rtsp
Web site
http://www.epsrc.ac.uk
URI
rtsp://rn.groovygecko.net/groovy/epsrc/
EPSRC_Aroll_041208_hb.rv
Page 14 of 60
LiWA
mms
http://www2.cimaglobal
.com
mms
http://www.businesslink
.gov.uk
rtsp://rn.groovygecko.net/groovy/epsrc/
Pioneers09_hb.rv
...
mms://groovyg.edgestreams.net/groovyg/clients/
Markettiers4dc/Video%20Features/11725/
11725_cima_employers2_HowToTV.wmv
...
mms://msnvideo.wmod.llnwd.net/a392/d1/cmg/prb/Solutio
ns_EPM_Final.wmv
mms://msvcatalog-2.wmod.llnwd.net/a2249/e1/ft/share2/
b297/0/Solutions – Sales_OSC_English-1.wmv
...
In the next test round we extended the testbed collection to some particular television
Web sites, such as www.swr.de or www.ard.de, where the number of video streams is
considerably larger.
We performed 2 complete crawls of the Web sites, including the video collection, as well
as several weekly crawls, capturing the last updates of the site and the new published
video content. The table below gives an insight on the total number of video URIs
discovered and captured, using rtsp, rtmp or mms protocols.
Protocol
Date of capture
Number of video
content URIs
rtsp
May crawl
7089
mms
May crawl
74
rtmp
May crawl
2636
rtsp
July crawl
5058
rtmp
July crawl
3320
rtsp
20th July update
184
rtmp
20th July update
354
rtmp
24th July update
282
rtmp
30th July update
341
rtmp
06th August update
458
These tests have put in evidence several issues raised by the video capture on a larger
scale, such as for the media collection of a large Web site:
!
the number of video/audio URIs is relatively large (in the order of several
thousands) and they represent an essential part of the site content. Missing the
capture of these resources would greatly impact on the quality of the archive.
!
the video content changes frequently (several hundreds of new videos are
published per week, replacing some older ones), therefore the capturing
Page 15 of 60
LiWA
mechanisms need optimizations in order to ensure the complete capture of all
the videos.
!
scheduling policy for the video capture should differ from the one used for the
other type of resources (html, text, images, etc.). The size of a video stream, in
terms of data, generally ranges from several KB to 100MB or more, according to
the length (in seconds) and quality of the video. The total downloading time is
therefore unpredictable at the start of the Web site crawl. Moreover, the
downloading speed could vary while capturing a long list of video streams.
A brief reflexion on these aspects brought us to several possible optimizations of the
module.
The main issues emerging from the initial tests were related to the synchronization
between the crawler and the external capture module. In the case of a large video
collection hosted on the Web site, a sequential download of each video would definitely
take longer time than the crawling process of the text pages. The crawler would
therefore have to wait for the external module to finish the video download.
A speed-up of the video capture process can be indeed obtained by multiplying the
number of downloaders. On the other hand, parallelising this process would be limited
by the maximum bandwidth available at the streaming server.
A feasible solution for managing the video downloaders would be to completely
decouple the video capture module from the crawler and launch it in the post
processing phase. That implies the replacement of the crawler plugin with a log reader
and an independent manager for the video downloaders.
The advantages of this approach would be:
!
a global view on the total number of video URIs
!
a better management of the resources (number of video downloaders sharing the
bandwidth)
The main drawback of the method is related to the incoherencies that might appear
between the crawl time of the Web site and the video capture in the post processing
phase:
!
some video content might disappear (during one or two days delay)
!
the video download is blocked waiting for the end of the crawl
Therefore, there is a trade-off to be done when managing the video downloading,
between: shortening the time for the complete download, error handling (for video
contents served by slow servers), and optimizing the total bandwidth used by multiple
downloaders.
Page 16 of 60
LiWA
3.7
References
[ASFS]
GetASFStream – Windows Media streams recorder
http://yps.nobody.jp/getasf.html
[BaSa06]
N. Baly, F. Sauvin. Archiving Streaming Media on the Web, Proof of Concept and
First Results. In the 6th International Web Archiving Workshop (IWAW'06),
Alicante, Spain, 2006.
[BTGH07]
Brown, C. Titus, Gheorghe Gheorghiu, et Jason Huggins. 2007. An
introduction to testing web applications with twill and selenium. O'Reilly.
[CCSD]
CoCSoft Stream Down – Streaming media download tool
http://stream-down.cocsoft.com/index.html
[CKR*07]
Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn and
Sue Moon “I tube, you tube, everybody tubes: analyzing the world's largest
user generated content video system”, In Proceedings of the 7th ACM
SIGCOMM conference on Internet measurement, California 2007.
[JHBa08]
Nichols Jeffrey, Zhigang Hua, et John Barton. 2008. Highlight: a system for
creating and deploying mobile web applications. Dans Proceedings of the
21st annual ACM symposium on User interface software and technology,
249-258. Monterey, CA, USA: ACM.
[LCKN05]
Mingzhe Li, Mark Claypool, Robert Kinicki and James Nichols
“Characteristics of streaming media stored on the Web”, In ACM
Transactions on Internet Technology (TOIT) 2005.
[MBD*07]
Moshchuk Alexander, Tanya Bragin, Damien Deville, Steven D. Gribble, et
Henry M. Levy. 2007. SpyProxy: execution-based detection of malicious
web content. Dans Proceedings of 16th USENIX Security Symposium on
USENIX Security Symposium, 1-16. Boston, MA: USENIX Association.
[MDDC08]
David Malaník, Zdenek Drbálek, Tomá! Dulík and Miroslav "ervenka
“System for capturing, streaming and sharing video files”, In Proceedings
of the 8th WSEAS international conference on Distance learning and web
engineering, Santander, Spain, 2008.
[MPlP]
MPlayer Project – http://www.mplayerhq.hu
[PMV*08]
Xabiel García Pañeda, David Melendi, Manuel Vilas, Roberto García,
Víctor García, Isabel Rodríguez: “FESORIA: An integrated system for
analysis, management and smart presentation of audio/video streaming
services”, In Multimedia Tools and Applications, Volume 39, 2008
[RFC3550]
A Transport Protocol for Real-Time Applications (RTP)
IETF Request for Comments 3550:
[RFC2326]
http://tools.ietf.org/html/rfc3550
Real Time Streaming Protocol (RTSP)
IETF Request for Comments 2326:
Page 17 of 60
http://tools.ietf.org/html/rfc2326
LiWA
[SVCR]
StreamBox VCR – Video stream recorder
http://www.afterdawn.com/software/audio_software/audio_tools/streambox_vcr.cfm
[WBJR06]
Wang Yi-Min, Doug Beck, Xuxian Jiang, et Roussi Roussev. 2006.
Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites that
Exploit
Browser
Vulnerabilities.
Microsoft
Research.
http://research.microsoft.com/apps/pubs/default.aspx?id=70182.
[WKST08]
Bing Wang, Jim Kurose, Prashant Shenoy, Don Towsley: “Multimedia
streaming via TCP: An analytic performance study”, In ACM Transactions
on Multimedia Computing, Communications, and Applications
(TOMCCAP), 2008
[YKLC08]
Yu Yang, Hariharan Kolam, Lap-Chung Lam, et Tzi-cker Chiueh. 2008.
Applications of a feather-weight virtual machine. Dans Proceedings of the
fourth ACM SIGPLAN/SIGOPS international conference on Virtual
execution environments, 171-180. Seattle, WA, USA: ACM.
[ZPD*04]
Roger Zimmermann, Moses Pawar, Dwipal A. Desai, Min Qin and Hong
Zhu “High resolution live streaming with the HYDRA architecture”, In
Computers in Entertainment (CIE) 2004.
Page 18 of 60
LiWA
4
Spam Cleansing
The ability to identify and prevent spam is a top priority issue for the search engine
industry [HMS] but less studied by Web archivists. The apparent lack of a widespread
dissemination of Web spam filtering methods in the archival community is surprising in
view of the fact that, under different measurement and estimates, roughly 10% of the
Web sites and 20% of the individual HTML pages constitute spam. The above figures
directly translate to 10–20% waste of archive resources in storage, processing and
bandwidth.
Spam filtering is essential in Web archives even if we acknowledge the difficulty of
defining the boundary between Web spam and honest search engine optimization.
Archives may have to tolerate more spam compared to search engines in order not to
loose some content misclassified as spam that the users may want to retrieve later. Also
they might want to have some representative spam either to preserve an accurate
image of the Web or to provide a spam corpus for researchers. In any case, we believe
that the quality of an archive with completely no spam filtering policy in use will greatly
be deteriorated and significant amount of resources will be wasted as the effect of Web
spam.
Spam classification and page-quality assessment is a difficult issue for search engines;
for archival systems it is even more challenging as they lack information about usage
patterns (e.g., click profiles) at capture time. We survey methods that fit best the needs
of an archive that are capable of filtering spam during the crawl process or in a
bootstrap sequence of crawls. Our methods combine classifiers based on terms over
the page and on features built from content, linkage and site structure.
Web spam filtering know-how became widespread with the success of the Adversarial
Information Retrieval Workshops, airweb.cse.lehigh.edu, since 2005 that host the Web
Spam Challenges since 2007. Our mission is to disseminate this know-how and adapt it
to the special needs of the archival institutions. This implies putting a particular
emphasis on periodic recrawls and the time evolution of spam such as the
disappearance of quality sites that become parking domains used for spamming
purposes or spam, once blacklisted, reappearing under a new domain. In order to tie
the bonds between the two communities we intend to provide time-aware Web spam
benchmark data sets for future Web Spam Challenges.
4.1
State of the Art on Web Spam
As Web spammers manipulate several aspects of content as well as linkage [GGM2],
effective spam hunting must combine a variety of content [FMN, FMN2, NNFM] and link
[GGMP, WGD, BCS] based methods.
Page 19 of 60
LiWA
4.1.1
Content features
The first generation of search engines relied mostly on the classic vector space model
of information retrieval. Thus Web spam pioneers manipulated the content of Web
pages by stuffing it with keywords repeated several times. A large amount of machine
generated spam pages such as the one in Figure 4.1 are still present in today’s Web.
These pages can be characterized as outliers through statistical analysis [FMN]
targeting the template like nature: their term distribution, entropy or compressibility
distinguishes them from normal content. Large numbers of phrases appearing in other
Web pages as well also characterize spam [FMN2]. Sites exhibiting excessive phrase
reuse are either template driven or spam, employing the so called stitching technique.
Ntoulas et al. [NNFM] describe content spamming characteristics including overly large
number of words either in the entire page or in the title or anchor text, as well as the
fraction of page drawn from popular words and the fraction of most popular words that
appear in the page.
Figure 4.1: Machine generated page with copied content
As most spammers act for financial gain [GGM], spam target pages are stuffed with a
large number of keywords that are either of high advertisement value or highly
spammed, including misspelled popular words such as “googel” or “accomodation” as
seen among the top hits of a major search engine in the Figure 4.2. A page full of
Google ads and maybe even no other content at all is also a typical spammer technique
to misuse Google AdSense for financial gains [BBCS] as seen in the Figure 4.3. Similar
misuses of eBay or the German Scout24.de affiliate program is also common practice
[BCSV]. It is realized in [BBCS] that spam is characterized by its success in a search
engine that does not deploy spam filtering over popular or monetize-able queries. Lists
of such queries can be obtained from search engine query logs or via AdWords,
Google’s flagship pay-per-click advertising product (http://adwords.google.com).
Page 20 of 60
LiWA
Figure 4.2: Spam in search engine hit list
Page 21 of 60
LiWA
Figure 4.3: Parked domain filled with Google ads.
Community content is in particular sensible to the so-called comment spam: responses,
posts or tags not related to the topic containing link to a target site or advertisement.
This form of spam appears whenever there is no restriction for users putting their own
content such as blogs [MCL], bookmarking systems [KHS] and even YouTube
[BRAAZR]. We have experimented with tools based on language model disagreement
[MCL].
Based on the existing literature on content spam, a sample of the LiWA baseline
features include:
!
the number of pages in the host;
!
the number of characters in the host name;
!
number of words in the home page and maximum PageRank page;
!
average word length, average length of the title;
!
precision and recall for frequent and monetize-able queries.
Page 22 of 60
LiWA
4.1.2
Link features
Following Google’s success all major search engines quickly incorporated link analysis
algorithms such as HITS [K] and PageRank [PBMW] into their ranking schemes. The
birth of the highly successful PageRank algorithm [PBMW] was indeed partially
motivated by the easy spammability of the simple in-degree count. Unfortunately
PageRank (together with probably all known link based ranking schemes) are prone to
spam. Spammers build so-called link farms, large collections of tightly interconnected
Web sites over diverse domains that eventually all point to the targeted page. The rank
of the target will be large regardless of the ranking method due to the large number of
links and the tightly connected structure. An example of a well-known link farm in
operation for several years now is the 411Web page collection; the content of these
sites is likely not spam (indeed they are not excluded from Google) but form a strongly
optimized sub-graph that illustrates the operation of a link farm well.
Based on the existing literature on content spam, a sample of the LiWA baseline
features include link-based features for the hosts, measured in both the home page and
the page with the maximum PageRank in each host such as:
!
in-degree, out-degree;
!
PageRank, TrustRank, Truncated PageRank;
!
edge reciprocity, assortativity coefficient;
!
estimation of supporters;
along with simple numeric transformations of the link-based features for the hosts.
4.1.3
Stacked Graphical Learning
Recently several results have appeared that apply rank propagation to extend initial
trust or distrust judgments over a small set of seed pages or sites to the entire Web,
such as trust [GGMP, WGD], distrust propagation in the neighborhood or their
combination [WGD] as well as graph based similarity measures [BCS]. These methods
are either based on propagating trust forward or distrust backwards along the hyperlinks
based on the idea that honest pages predominantly point to honest ones, or, stated the
other way, spam pages are backlinked only by spam pages.
v7
v2
?
u
Figure 4.4: Schematic idea of stacked graphical learning
Stacked graphical learning introduced by Kou and Cohen [KC] is a simple
implementation of propagation that outperforms the computationally expensive variants.
Page 23 of 60
LiWA
It is performed under the classifier combination framework as follows, see Figure 4.4
above. First the base classifiers are built and combined that give prediction p(u) for all
the unlabeled nodes u. Next for each node v we construct new features based on the
predicted p(u) of its neighbors and the weight of the connection between u and v as
described in [CBSL] and classify them by a decision tree. Finally, classifier combination
is applied to the augmented set of classification results; this procedure is repeated in
two iterations as suggested by [CDGMS].
As new results we used stacked graphical features based on the “Connectivity Sonar” of
Amitay et al. Our new features include the distribution of in and outlinks labeled spam
within the site; the average level of spam in and outlinks; the top and leaf level link
spamicity. As a novelty, we also tested various edge weights is our use of weights
inferred from a generative “linked LDA” model.
4.2
Spam Filter Module
The main objective of spam cleansing is to reduce the amount of fake content the
archive will have to deal with. The envisioned toolkit will help prioritize crawls by
automatically detecting content of value and exclude artificially generated manipulative
and useless content based possibly on models built in a bootstrap procedure.
In addition to individual solutions for specific archives, LiWA services intend to provide
collaboration tools to share known spam hosts and features across participating
archival institutions. A common interface to a central knowledge base will be built in
which archive operators may label sites or pages as spam based on own experience or
suggested by the spam classifier applied to the local archives. The purpose of the
planned LiWA Web spam assessment interface is twofold:
!
It aids the Archive operator in selecting and blacklisting spam sites, possibly in
conjunction with an active learning environment where human assistance is
asked for example in case of contradicting outcome by the classifier ensemble;
!
It provides a collaboration tool for the Archives with a possible centralized
knowledge base through which the Archive operators are able to share their
labels, comments and observations as well as start discussion on the behaviour
of certain questionable hosts.
The Spam Filter module described in D3.1 Archive Filtering Technology V1 takes
WARC format crawls as input and outputs a list of the sites with a predicted spamicity
(strength of similarity in content or behaviour to spam sites) as a value between 0 and 1.
The current LiWA solution is based on the lessons learned from the Web Spam
Challenges [CCD]. As it has turned out, the feature set described in [CDGMS] and the
bag of words representation of the site content [ACC] give a very strong baseline with
only minor improvements achieved by the Challenge participants. We use the
v1 order of their strength, of the following classifiers:
combination, listed in the observed
Page 24 of 60
LiWA
SVM over tf.idf; an augmented set of the statistical spam features of [CDGMS] together
with transformed feature variants; graph stacking [CBSL]; text classification by latent
Dirichlet allocation [BSB] as well as by compression [BFCLZ, C].
The LiWA baseline content feature set consists of the following languageindependent measures:
!
the number of pages in the host
!
the number of characters in the host name, in the text, title, anchor text etc;
!
the fraction of code vs. text
!
the compression rate and entropy;
!
the rank of a page for popular queries.
Whenever a feature refers to a page instead of the host, we select the home page as
well as the maximum PageRank page of the host in addition to host-level averages and
standard deviation.
We also classify based on the average tf.idf vector of the host.
In the LiWA baseline link feature set we use the measures for in and outdegree,
reciprocity, assortivity, (truncated) PageRank, Trustrank [GGMP] and
neighborhood sizes, together with the logarithm and other derivatives for most values.
Next we describe our main findings related to the use by archives. As a key element,
archives may possess a large number of different time snapshots for the same domain.
In this setup, we observe that our classifiers are
!
stable across snapshots. We may apply an old model for a more recent collection
without major deterioration of quality despite of the fact that there is relative large
change due to the appearance and disappearance of hosts in time.
!
instable across different crawl strategies. The WEBSPAM-UK2007 test data was
collected by a very different crawl strategy and contains only 14K sites whereas
all other research snapshots of the .uk domain have more than 100,000. Here
the WEBSPAM-UK2007 model badly fails for all other crawls.
In conclusion we may reuse the same classifier with little modification for a near future
crawl, but in order to apply a model generated by another institution under a different
domain or crawling strategy needs further research.
Beyond the state-of-the-art, we were able to improve classification quality by the time
change of features including the variance, stability, and the use of normalized
versions as well as by the selection of stable hosts for training. As new features we
investigate the
!
creation and disappearance of new sites, pages;
!
burst and decay of neighborhood;
!
change in degree, rank;
Page 25 of 60
LiWA
!
percent of change in content.
Prior to our research in LiWA, only statistical characteristics of this collection were
investigated [BSV, BBDSV].
4.3
Evaluation
Novel to the LiWA project, a sequence of periodic recrawls is made available for the
purposes of spam filtering development for the first time outside the major search
engine operators. The data set of 13 UK snapshots (UK-2006-05 … UK-2007-05 where
the first snapshot is WEBSPAM-UK2006 and the last is WEBSPAM-UK2007) provided
by the Laboratory for Web Algorithmics of the Università degli studi di Milano [BCSV]
supported from DSI-DELIS project processed. The LiWA test bed consists of more than
10,000 manual labels that proved to be useful over this data.
We conducted the test on 16-April-2009 over the WEBSPAM-UK2007 data set
converted into WARC 0.19 by SZTAKI as part of the LiWA test bed. For testing and
training we used the predefined labeled subsets of WEBSPAM-UK2007.
For testing purposes, the output of the evaluation script is a Weka classifier output that
contains a summary of the relevant performance measures over a predefined labelled
test set. The results of the test are given in Table 4.1 below.
Training set
Test set
Size
4000
2053
True positive
236
72
True negative
2461
1242
False positive
2
24
False negative
1301
715
Correctly classified
2697
1314
Incorrectly classified
1303*
739*
Precision
0.154*
0.091*
Recall
0.992
0.75
F1
0.266
0.163
AUC (ROC area)
0.895
0.756
Table 4.1: Quality measures of spam classification. Starred values may be improved by
increasing the threshold and reducing recall.
When comparing to the baseline, we use the AUC measure since all other measures
are sensitive to changing the threshold used to separate the spam and non-spam
classes in the prediction. The best performing Web Spam Challenge 2008 participant
reached an AUC of 0.85 [GJW] while our result reached 0.80. Some of the research
Page 26 of 60
LiWA
codes still require industry level tested implementations and will gradually be added to
the LiWA code base. We are also expecting progress in reducing the resource needs for
the feature generation code.
4.4
Integration into the LiWA Architecture
The LiWA Spam Filtering Architecture is summarized in Figure 4.5.
1. The data source is always local in the form of a WARC archive. When acting as
a crawl-time plug-in, the WARC at a checkpoint may be analysed to build a new
model with updated blacklist for the next crawling phase. Local data is typically
huge and cannot be transferred to another location.
2. When accessing the raw data, host (or in certain applications, page) level
features are generated. This data portion is small and can be easily stored,
retrieved and shared even across different institutions.
3. The main step of the procedure is the model building and classification step.
Training a model is costly and is done batch between crawls. Models are small
and easy to distribute and they can be applied in crawl-time plug-ins.
4. A key aspect of a successful spam filter is the quality of the manually labelled
training data. To this end, the design involves an active learning environment in
which the classifier presents cases where the decision is uncertain so that the
largest accuracy gain is achieved by the new labels.
Figure 4.5: Overview of LiWA Spam Filter Architecture.
Page 27 of 60
LiWA
4.4.1
Batch feature generation and classification
In the LiWA Spam Classifier Ensemble we split features into related sets and for each
we use the best fitting classifier. These classifiers are then combined by random forest,
a method that, in our cross validation experiment, outperformed logistic regression
suggested by [HMS]. We used the classifier implementations of the machine learning
toolkit Weka [WF] as it is open source, mature in quality, and it gives high quality
implementation for most state-of-the-art classifiers.
The computational resources for the filtering procedure are moderate. Content features
are generated by reading the collection once. For link features typically only a host
graph has to be built, which is very small even for large batch crawls. Training the
classifier for a few 100,000 sites can be completed within a day on a single CPU on a
commodity machine with 4-16GB RAM; here costs strongly depend on the classifier
implementation. Given the trained classifier, a new site can be classified even at crawl
time if the crawler is able to compute the required feature set for the new site
encountered.
4.4.2
Assessment interface design
While no single Web archive will likely have spam filtering resources comparable to a
major search engine, our envisioned method facilitates the collaboration and knowledge
sharing between specialized archives, in particular for spam that spans across domain
boundaries. To illustrate, assume that an archive targets the .uk domain. The crawl
encounters site www.discountchildrensclothes.co.uk that contains redirection to the
.com domain that further redirects to .it. These .it sites were already flagged spam by
another partner; hence, their knowledge can be incorporated into the current spam
filtering procedure.
The LiWA developments are planned to aid the international Web archiving community
in building and maintaining a world wide data set of Web spam. Since most features,
and in particular link features are language independent, a global collection will help all
archives regardless of their target to level domain.
The need for manual labeling is the single most important blocker of high-quality spam
filtering. In addition to label sharing, the envisioned solution will also act in coordinating
the labeling efforts in an active learning environment: Manual assessment will be
supported by a target selection method that proposes sites of a target domain
ambiguously classified based on existing common knowledge.
The mockup of the assessment interface is modeled by the Web Spam Challenge 2007
volunteer interface [CCD]. The right side of Figure 4.6 is for browsing in a tabbed
fashion. In order to integrate the temporal dimension of an archive, the available crawl
times are shown (called access bar). Upon clicking, the page which appears is the one
with crawl date the closest to the crawl date of the linking page.
Page 28 of 60
LiWA
The selected version of the linked page can be either cached at some partner archive or
the current version downloaded from the Web. We use Firefox extension techniques
similar to Zotero to note and organize information without messing about with rendering,
frames and redirection. The possibility to select between a stored and the currently
available pages also helps in detecting cloaking.
The right side also contains in and outlinks as well as list or sample pages of the site.
By clicking on an in or outlink, we may obtain all possible information in all the
subwindows from the central service.
The upper part of the left side is to do the assessment. Button “NEXT” links to the next
site to be assessed in the active learning framework and “BACK” to the review page.
When “NEXT” or “BACK” is pushed, the assigned label is saved. Before saving a
“spam” or “borderline” label, a popup window appears requesting for explanation and
spam type. Spam type can be: general, link or content and the appropriate types should
be ticked. The ticked types appear as part of the explanation. Although not shown on
the figure, a text field is available for commenting the site. The explanations and
comments appear in the review page.
The lower right part of the assessment page contains four windows in a tabbed fashion.
!
The labels already assigned to this site (number of the four possible types each)
with comments if any.
!
Various spam classifier scores, and an LDA based content model [BSB].
!
Various site attributes selected by the classification model as most appropriate
for deciding the label of the site.
!
Whois information, with links to other sites of the owner.
In a particular first implementation we fill this interface with the 12 crawl snapshots of
the .uk domain gathered by the UbiCrawler [BCSV] between June 2006 and May 2007
[CCD]. In the “Links and pages” tab we show 12 bits for the presence of the page while
the access bar in the bottom of the assessment page shows “jun, jul, . . . , may” and
“now”, color coded for availability and navigation.
Page 29 of 60
LiWA
Figure 4.6: LiWA Spam Assessment Interface.
The above interface will be part of the LiWA WP10 Community Platform that forms a
centralized service and will act as a knowledge base also for crawler traps, crawling
strategies for various content management technologies and other issues related to
Web host archival behavior.
4.4.3
Crawl-time filtering infrastructure design
In order to save bandwidth and storage, we have to filter out spam at crawl time. For a
host that is already blacklisted, we may simply discard all URI to save bandwidth.
However for a yet unseen host we have to obtain a few pages to build a model and then
apply our classifier again at crawl time. From then on no more URI will be retrieved from
spam hosts.
The crawl-time spam filter accepts/rejects URIs and/or domains based on the spam
analysis results of either earlier crawls or the previous checkpoint of the current crawl.
The crawler(s) continuously write WARC(s) that, at some points in time, the spam filter
also reads and processes. We synchronize concurrent access at checkpoints where the
crawlers start writing a new WARC and the earlier ones are then free to be used by the
filter.
Page 30 of 60
LiWA
The main function
public class SpamDecideRule extends DecideRule {
protected abstract DecideResult innerDecide(ProcessorURI uri){...}
}
simply checks the black and whitelist and commands the crawler to discard the URI if it
belongs to a possible spam host. The lists are updated from text files whenever their
content changes (push method).
The spam filter also interacts with the host queue prioritization of the crawler.
4.5
References
[ACC]
J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In
Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb), 2008.
[BBC]
A. A. Benczúr, I. Bíró, K. Csalogány, and T. Sarlós. Web spam detection via commercial intent
analysis. In Proceedings of the 3th International Workshop on Adversarial Information Retrieval
on the Web (AIRWeb), held in conjunction with WWW2007, 2007.
[BCS]
A. A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. In
Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb), held in conjunction with SIGIR2006, 2006.
[BRAAZR] F. Benevenuto, T. Rodrigues, V. Almeida, J. Almeida, C. Zhang, and K. Ross. Identifying video
spammers in online social networks. In AIRWeb ’08: Proceedings of the 3rd international
workshop on Adversarial information retrieval on the Web. ACM Press, 2008.
[BSB]
I. Bíró, J. Szabó, and A. A. Benczúr. Latent Dirichlet Allocation in Web Spam Filtering. In
Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb), 2008.
[BCSV]
P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed Web
crawler. Software: Practice & Experience, 34(8):721–726, 2004.
[BSV]
P. Boldi, M. Santini, and S. Vigna. A Large Time Aware Web Graph. SIGIR Forum, 42, 2008.
[BBDSV] I. Bordino, P. Boldi, D. Donato, M. Santini, and S. Vigna. Temporal evolution of the uk Web,
2008.
[BFCLZ] A. Bratko, B. Filipiˇc, G. Cormack, T. Lynam, and B. Zupan. Spam Filtering Using Statistical
Data Compression Models. The Journal of Machine Learning Research, 7:2673–2698, 2006.
[CCD]
C. Castillo, K. Chellapilla, and L. Denoyer. Web spam challenge 2008. In Proceedings of the 4th
International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
[CDGMS] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam
detection using the Web topology. Proceedings of the 30th annual international ACM SIGIR
conference on Research and development in information retrieval, pages 423–430, 2007.
[C]
G. Cormack. Content-based Web Spam Detection. In Proceedings of the 3rd International
Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2007.
[CBSL]
K. Csalogány, A. Benczúr, D. Siklósi, and L. Lukács. Semi-Supervised Learning: A Comparative
Study for Web Spam and Telephone User Churn. In Graph Labeling Workshop in conjunction
with ECML/PKDD 2007, 2007.
Page 31 of 60
LiWA
[FMN]
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics – Using statistical
analysis to locate spam Web pages. In Proceedings of the 7th International Workshop on the
Web and Databases (WebDB), pages 1–6, Paris, France, 2004.
[FMN2]
D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide
Web. In Proceedings of the 28th ACM International Conference on Research and Development
in Information Retrieval (SIGIR), Salvador, Brazil, 2005.
[GJW]
Geng, G.G. and Jin, X.B. and Wang, C.H. CASIA at Web Spam Challenge 2008 Track III. In
Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb), 2008.
[GGM]
Z. Gyöngyi and H. Garcia-Molina. Spam: It’s not just for inboxes anymore. IEEE Computer
Magazine, 38(10):28–34, October 2005.
[GGM2] Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International
Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.
[GGMP] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. In
Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages
576–587, Toronto, Canada, 2004.
[HMS]
M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in Web search engines. SIGIR
Forum, 36(2):11–22, 2002.
[K]
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM,
46(5):604–632, 1999.
[KC]
Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random
fields. In SDM 07, 2007.
[KHS]
B. Krause, A. Hotho, and G. Stumme. The anti-social tagger - detecting spam in social
bookmarking systems. In Proc. of the Fourth International Workshop on Adversarial Information
Retrieval on the Web, 2008.
[MCL]
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement.
In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the
Web (AIRWeb), Chiba, Japan, 2005.
[NNMF] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content
analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages
83–92, Edinburgh, Scotland, 2006.
[PBMW] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order
to the Web. Technical Report 1999-66, Stanford University, 1998.
[WF]
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition,
June 2005.
[WGD]
B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote Web spam. In
Workshop on Models of Trust for the Web, Edinburgh, Scotland, 2006.
Page 32 of 60
LiWA
5
Temporal Coherence
The coherence of data in terms of proper dating and proper cross-linkage is influenced
by the temporal characteristics (duration, frequency, etc.) of the crawl process. Web
archiving is commonly understood as a continuous process that aims at archiving the
entire Web (broad scope). However, a typical scenario in archiving institutions or
companies is to periodically (e.g. monthly) create high quality captures of a certain Web
site. These periodic domain scope crawls of Web sites aim at obtaining a best possible
representation of a site.
Figure 5.1 contains an abstract representation of such a domain scope crawling
process. This Web site consists of n pages (p1,…,pn). Each of them consists of several
successive versions, indicated by the horizontal lines (e.g., pn has three different
versions in [t;t’]). Ideally, the result of a crawl would be a complete and instantaneous
snapshot of all pages at a given point of time.
Crawl interval
p1
p2
Web
site
pn
t
Crawling progress
t’
time
Figure 5.1: Web site crawling process (domain scope)
In reality, one crawl requires an extended time period to gather all pages of a site while
being potentially modified in parallel, causing thus incoherencies in the archive. The risk
of incoherence increases further due to politeness constraints and need for
sophisticated time stamping mechanisms.
An ideal approach to Web archiving would be to have captures for every domain at any
point in time whenever there is a (small) change in any of the domain's pages. Of
course, this is absolutely infeasible given the enormous size of the Web, high contentproduction rates in blogs and other Web 2.0 venues, the disk and server costs of a Web
archive, and also the politeness rules that Web sites impose on crawlers.
We therefore settle for the realistic goal capturing Web sites at convenient points
(whenever the crawler decides to devote resources to the site and the site does not
seem to be highly loaded), but when doing so, the capture should be as “authentic” as
possible. In order to ensure an “as of time point x (or interval [x;y])“ capture of a Web
Page 33 of 60
LiWA
site, we therefore develop strategies that ensure coherence of crawls regarding a time
point or interval, and identify those contents of the Web site that violate coherence
[SDM*09].
5.1
State of the Art on Archive Coherence
The most comprehensive overview on Web archiving is given by Masanès [Masa06].
He describes the various aspects involved in the archiving as well as the subsequent
accessing process. The issue of coherence is introduced as well, but only some
heuristics how to measure and improve the archiving crawler's front line are suggested.
Other related research mostly focuses on aligning crawlers towards more efficient and
fresher Web indexes. B. E. Brewington and G. Cybenko [BrCy00] analyze changes of
Web sites and draw conclusions about how often they must be reindexed. The issue of
crawl efficiency is addressed by Cho et al. [CGPa98]. They state that the design of a
good crawler is important for many reasons (e.g. ordering and frequency of URLs to be
visited) and present an algorithm that obtains more relevant pages (according to their
definition) first. In a subsequent study Cho and Garcia-Molina describe the development
of an effective incremental crawler [ChGa00]. They aim at improving the collection's
freshness by bringing in new pages in a timelier manner. Into the same direction head
their studies on effective page refresh policies for Web crawlers [ChGa03a]. Here, they
introduce a poisson process based change model of data sources. In another study,
they estimated the frequency of change of online data [ChGa03b]. For that purpose,
they developed several frequency estimators in order to improve Web crawlers and
Web caches. In a similar direction goes research of Olston and Pandey [OlPa08] who
propose a recrawl schedule based on information longevity in order to achieve good
freshness. Another study about crawling strategies is presented by Najork and Wiener
[NaWi01]. They have found out that breadth-first search downloads hot pages first, but
also that the average quality of the pages decreases over time. Therefore, they suggest
performing strict breadth-first search in order to enhance the likeliness to retrieve
important pages first.
However, aiming at an improved crawl performance against the background of assured
crawl coherence requires a slightly different alignment. Our task therefore is to achieve
both: Increase the probability of obtaining largely coherent crawls and identify those
contents violating coherence.
5.2
Temporal Coherence Module
In order to identify contents violating coherence and improve the crawling strategy with
respect to temporal coherence, proper dating of Web contents is needed. Hence,
techniques for (meta-)data extraction of Web contents have been implemented and the
correctness of these methods has been tested. Unfortunately, the reliability of last
modified stamps cannot be guaranteed due to missing trustworthiness of Web servers.
To this end, we will first introduce our strategies to ensure proper dating of Web
contents and subsequently introduce our coherence improving crawling strategy on top
of these properly dated Web contents.
Page 34 of 60
LiWA
Proper Dating of Web Contents
Proper dating technologies are required to know how fresh a Web page is – that means
– what is the date (and time) of last modification. The canonical way for time stamping a
Web page is to use its Last-Modified HTTP header, which is unfortunately unreliable.
For that reason, another dating technique is to exploit the content’s semantic
timestamps. This might be a global timestamp (for instance, a date preceded by “Last
modified:” in the footer of a Web page) or a set of timestamps for individual items in the
page, such as news stories, blog posts, comments, etc. However, the extraction of
semantic timestamps requires the application of heuristics, which imply a certain level of
uncertainty. Finally, the most costly – but 100% reliable – method is to compare a page
with its previously downloaded version. Due to cost and efficiency reasons we pursue a
potentially multistage change measurement procedure:
1) Check HTTP timestamp. If it is present and is trustworthy, stop here.
2) Check content timestamp. If it is present and is trustworthy, stop here.
3) Compare a hash of the page with previously downloaded hash.
4) Elimination of non-significant differences (ads, fortunes, request timestamp):
a) only hash text content, or “useful” text content
b) compare distribution of n-grams (shingling)
c) or even compute edit distance with previous version.
On the basis on these dating technologies we are able to develop coherence improving
capturing strategies that allow us to reconcile temporal information across multiple
captures and/or multiple archives.
Coherence Measurement
Due to the aforementioned unreliability of last modified stamps due to missing
trustworthiness of Web servers, the only 100% reliable method is to self create a “virtual
time stamp” by comparing the page's etag or content hash with its previously
downloaded version. To this end, we introduce an induced coherence measure that
allows the crawler to gain full control over the contents being compared.
We apply a crawl-revisit sequence !(c;r), where c denotes a crawl of Web pages (p1,…,
pn) and r being a subsequent revisit. In this consecutive revisit process we obtain a
second (and potentially different) version of the previously crawled pages denoted as
(p1’,… ,pn’). Hence, the crawl-revisit sequence !(c;r) consists of n crawl-revisit tuples "
(pi;pi’) having i " {1;n}. While the time of downloading page pi is denoted as t(pi) = tj
having j " [1;n], the time of revisiting page pi is denominated as t(pi’) = tk now having k >
[n; 2n !1]. Technically, the last crawled page pv having t(pv) = n is not revisited again,
but considered as crawled and revisited page at the same time. Hence, the revisit takes
Page 35 of 60
LiWA
place in the time interval [tn+1;t2n!1]. For convenience [ts;te] denotes the crawl interval,
where ts = t1 is the starting point (download of the first page) of the crawl and te is the
ending point of the crawl (download of the last page). Similarly, we denote [t’s;t’e] to be
the revisit interval, where te = ts’ = tn is the starting point of the revisit (download of the
last visited page that is at the same time the first revisited page) and t e’ is the ending
point of the revisit (download of the last revisited page). In addition, we define the etag
or content hash of a page or a revisited page as #(m) having m " {pi;pi’}. Overall, a
complete crawl-revisit sequence !(c;r) spans the interval [t1; t2n!1]. It starts at ts = t1 with
the first download of the crawl and ends at t e’ = t2n!1 with the last revisit download.
Now, coherence of two or more pages exists if there is a time point tcoherence between the
visit of pages t(pi) and the subsequent revisit t(pi’) where the etag or content hash # of
corresponding pages (#(m) having m " {pi;pi’}) has not changed. This is formally
denoted as:
Figure 5.2 highlights the functioning of our coherence measure applied to a Web site
consisting of n pages. We assume a download sequence p1,…,p4 spanning the crawl
interval [t1;t4] and an inverted subsequent revisit sequence p3,…, p1 spanning the revisit
interval [t5;t7]. This figure depicts n successful coherence tests. This results in an
assurable coherence statement for the entire Web site valid at time point tcoherence = t4.
Page 36 of 60
LiWA
Figure 5.2: Web site crawling with successful coherence tests
By contrast, Figure 5.3 indicates a failed inducible coherence test for the crawl-revisit
tuple "(p2;p2’). In this case, page p2 was modified elsewhere between t(p2) = t2 and t(p2’)
= t6, which results in a failed inducible coherence test. Due to non-existing or nonreliable last modified stamps we are not able to determine the exact time point of
modification. To this end, we are only able to discover a boolean result because of a
failed etag or hash comparison for the crawl-revisit tuple "(p2,p’2). The whole interval is
flagged as insecure, even though, the modification might have taken place far beyond
the aspired coherence time point (tcoherence = t4). Thus, despite being coherent from a
global point of view for tcoherence = t4, a real life crawler is not be able to figure this out.
Consequently, there might not be given an assurable coherence statement for the entire
Web site, since there is an insecure time interval with respect to the crawl-revisit tuple "
(p2,p2’).
Page 37 of 60
LiWA
Figure 5.3: Web site crawling with failed coherence test for crawl-revisit tuple "(p2;p’2)
In reality and against the background of large Web sites it is almost unfeasible to
achieve an assurable coherence statement for an entire Web site based on this
coherence measure. Though, we might still be interested in specifying how “coherent”
the remaining parts of our crawl c are. For that purpose, we introduce a metric that
allows us to express the quality of a crawl c. The error function f("(pi,p i’)) that counts the
occurring incoherences for crawl-revisit tuple "(pi,p i’) of the crawl-revisit sequence !
(c,r) is defined as:
f # p i ,pi ' $=
{
, if ! # pi $ =! # pi ' $
, else
0
1
}
The overall quality of a crawl c is then evaluated as:
n
& f # pi ,p i ' $
C # c $=1% i=1
n
, n'1
Since, the risk of a single Web page pi being incoherent heavily depends on its position
in the crawl-revisit sequence !(c,r) we will now introduce our coherence improving
capturing strategy.
Page 38 of 60
LiWA
Coherence Improving Capturing Strategy
In order to increase the overall quality of crawl, we examine the probability (and thus the
risk) of crawling incoherent contents. The probability of a single page pi being
incoherent with respect to the reference time point or time interval tcoherence is an
important parameter to consider when scheduling a crawl. Incoherence occurs, when a
page pi is subject to one or more modifications µi* that are in “conflict” with the ongoing
crawl. A conflict with respect to coherence occurs if:
(µi*: (µi*" [t(p i),t(pi’)]
That means, a page has been modified at least once since its download during the
crawling phase t(pi) and its revisit t(pi’). Given a page's change probability $i (which can
be statistically predicted based on page type [e.g., MIME types], depth within the site
[e.g., distance to site entry points], and URL name [e.g., manually edited user
homepages vs. pages generated by content management systems]) its download time
t(pi) and its revisit time t(pi’), the probability of conflict %(pi) is given as:
%(pi)=1 ! (1 !$i)t(p ’)!t(p )
i
i
Potentially conflicting slots in applying inducible coherence are shown in Figure 5.4. In
this example, a crawl ordering from top to bottom (p1,…,pn) and revisits from bottom to
top (pn!1,…,p1) is being applied. The illustration differentiates between those slots where
a change of page pi affects the coherence of crawl c and others that do not. This results
in a set of concatenated slots (different in size) that represents (overall) the risk of a
crawl being affected by changes.
Page 39 of 60
LiWA
slot1
D
1
…
n-i-2 n-i-1
n-i
n-i+1 n-i+2 n-i+3
…
2(n-1)
…
…
…
…
…
…
…
…
…
D
D
…
D
1
2
3
4
D
…
D
slotn-1
D
D
…
D
D
1
2
D
D
…
D
slotn
D
D
D
D
D
D
D
D
D
…
D
t1
t2
…
tn-2
tn-1
tn
tn+1
tn+2
tn+3
…
t2n-1
…
slotn-2
Legend:
Periled slot & exp. in %(pi)
Visit/revisit of pi
D
Don’t Care Slot
Figure 5.4: Periled slots in crawl/revisit-pairs applying “virtual time stamping”
Coherence Improving Crawl Scheduling
As a consequence, from the previous observations we can identify two factors that
influence the potential incoherence of a page pi with respect to the reference coherence
time point tcoherence: The change probability $i of page pi and its download (and revisit)
time t(pi) (and t(pi’)). Hence, our coherence optimized crawling strategy incorporates
both factors.
Starting point is a list of pages pi to be crawled sorted in descending order according to
their change probabilities $i. Like before, the intention is to identify those pages that
might overstep the readiness to assume risk threshold &. Since now all pages need to
be scheduled according to the reference time point treference = tn being the last page to be
crawled during the crawling phase, we need a different queuing strategy: We try to
create a V-like access schedule having the (large) slots of stable pages on top and the
(small) slots of instable ones at bottom (cf. Figure 5.4). Again, we start with assigning
the uncritical slot (as we assume that changes of Web pages might occur only per time
unit immediately before download) with length 0 to the most critical content at the first
position (pslot = 1) of our queue. Since, initially, the length of the slot in the “joker”
position (tn) to be assigned is zero, the threshold condition does not hold. However, from
now on t (and thus the size of slots) increases stepwise so that any download bears the
risk of being incoherent. To this end, we evaluate the current page's conflict probability
%(pi) against the user defined threshold (%(pslot) ) &). As it is rarely possible to include all
pages in this V-like structure, we split the download schedule into a promising section
and a hopeless section. In case, the given threshold is exceeded we move the page at
Page 40 of 60
LiWA
pslot to the lastpromising position, which is the (at this point in time) the first position after
those pages not exceeding the conflict threshold &. Otherwise, the page will be
scheduled for download at pslot. This process is continued until all pages pi have been
scheduled either in the promising section or the hopeless section. In the next stage, the
crawl itself starts. During the crawling phase, we begin with the most hopeless ones first
until we continue with those pages that have been allocated in the promising section.
After completion, we directly initiate the revisit phase in the reverse order. We begin with
the first element after the “joker” position (pslot = 2) until the revisit of the remaining
pages has been completed. A pseudo code implementation of the strategy described is
shown in Figure 5.5.
input:
p1,…,pn - list of pages in descending order of $i,
& - readiness to assume risk threshold
begin
Start with: slot = 1, lastpromising = n
while slot * lastpromising
do
if %(pslot) ) & then
/* conflict expected! */
Move pslot to position lastpromising
Decrease promising boundary: lastpromising !!
end
else
Increase promising boundary: promising ++
end
end
slot = n while slot ) 1
do
/* visit from hopeless to promising */
Download page pslot
Decrease slot counter: slot !!
end
slot = 2 while slot * n
do
/* revisit from promising to hopeless */
Revisit page pslot
Increase slot counter: slot ++
end
end
Figure 5.5: Pseudo code of coherence improved crawl scheduling
Page 41 of 60
LiWA
5.3
Evaluation and Visualization
The main performance indicator for the temporal coherence module is the fraction of
accurately dated content and the crawl cost measures. The cost of the crawl can be
measured by parameters such as the number of downloads, bandwidth consumed or
crawl duration. As we have outlined before, a full guarantee on properly dated content
requires the more sophisticated “virtual time stamping mechanism” (compared with a
single access strategy relying on the accuracy of last modified time stamps). This
implies that temporal coherence and crawl cost are contradictory objectives. However, it
is possible to ensure and evaluate proper dating of contents and reduce the crawl cost
in a subsequent step.
Evaluation of Coherence Improved Crawling
In terms of coherence improved crawling, we measure the percentage of content in a
Web site that is coherently crawled (that means “as of the same time point or time
interval”). Conventional implementations of archiving crawlers are based on a prioritydriven variant of the breadth-first-search (BFS) crawling strategy and do not incorporate
revisits. However, “virtual time stamping” is unavoidable in order to determine
coherence under real life crawling conditions. Therefore, the performance of our
algorithms is evaluated against comparable modifications of conventional crawling
strategies such as BFS-LIFO (breadth-first-search combined with last-in-first-out) or
BFS-FIFO (breadth-first-search combined with first-in-first-out). In addition, we indicate
baselines for optimal and worst case crawling strategies, which are obtained from full
knowledge about changes within all pages pi during the entire crawl-revisit interval.
Hence, these baselines are only considerable as theoretical achievable limits of
coherence.
Experiments were run on synthetic data in order to investigate the performance of
versatile crawling strategies within a controlled test environment. All parameters are
freely adjustable in order to resemble real life Web site behavior. Each experiment
follows the same procedure, but varies in size of Web contents and change rate. We
model site changes by Poisson processes with page-specific change rates. These rates
can be statistically predicted based on page type (e.g., MIME types), depth within the
site (e.g., distance to site entry points), and URL name (e.g., manually edited user
homepages vs. pages generated by content management systems).
Each page of the data set has a change rate $i. Based on a Poisson process model, the
time between two successive changes of page pi is then exponentially distributed with
parameter $i:
P[time between changes of pi is less than time unit '] = 1 – e
– $i '
Equivalently, the probability that pi changes k times in a time interval of length n follows
a Poisson distribution:
% "i
"k e
P[k changes of pi in one time unit] = i
k!
Page 42 of 60
LiWA
Within the simulation environment a change history is generated, which registers every
change per time unit. The probability that page pi changes at ti then is:
P[pi has at least one change] = 1 – e
– $i
In order to resemble real life conditions, we simulated small to medium size crawls of
Web sites consisting of 10.000 - 50.000 contents. In addition, we simulated the sites'
change behavior to vary from nearly static to almost unstable. All experiments followed
the same procedure, but varied in size of Web contents and change rate. Each page of
the data set has a change probability $i in the interval [0;1]. Within the simulation
environment a change history was generated, which registered every change per time
unit. The probability that page pi changed at tj is P(µi)=P[((tj) * $i] where ((tj) is a
function that generates per time unit a uniformly distributed random number in [0;1].
Figure 5.6: Comparison of inducible crawling strategies in a Web site of 10.000
contents
Figure 5.6 depicts the results of our improved inducible crawling strategy compared with
its “competitors” BFS-LIFO and BFS-FIFO. Our improved crawling strategy always
performs better than the best possible conventional crawling strategy. Experiments are
based on a Web site containing 10.000 contents and different readiness to assume risk
thresholds & ranging from [0.45;0.7]. In addition, our strategy performs about 10% better
given non-pathological Web site behaviour (neither completely static nor almost
unstable). Values of & between [0;0.45) or (0.7;1] perform less effective. They induce an
either too “risk-avoidant” (& > [0;0.45)) or too “risk-ignorant” (& >(0.7;1]) scheduling with
minor (or even zero) performance gain, e.g. when acting “risk-ignorant” in heavily
changing sites or “risk-avoidant” in mostly static sites. Comparable results have also
been produced given larger (and smaller) Web sites having similar change distributions
in numerous experiments.
Page 43 of 60
LiWA
Figure 5.7: Excerpt of a crawl’s automatically generated temporal coherence report
Analysis and Visualization of Crawl Coherence
The analysis of coherence defects measures the quality of a capture either directly at
runtime or between two captures. To this end, we have developed methods for
automatically generating sophisticated statistics per capture (e.g. number of defects
occurred sorted by defect type) as part of our analysis environment. Figure 5.7 contains
a screenshot that contains an excerpt of such a temporal coherence report. In addition,
the capturing process is traced and enhanced with statistical data for exports in
graphML. Hence, it is also possible to layout a capture’s spanning tree and visualize its
coherence defects by applying graphML compliant software. This visual metaphor is
intended as an additional means to automated statistics for understanding the problems
that occurred during capturing. Figure 5.8 depicts a sample visualization of an
mpi-inf.mpg.de domain capture (about 65.000 Web contents) with the Visone software
(cf. http://visone.info/ for details). Depending on the nodes’ size, shape, and color the
user gets an immediate overview on the success or failure of the capturing process. In
particular, a node’s size is proportional to the amount of coherent Web contents
contained in its sub-tree. In the same sense, a node’s color highlights its “coherence
status”. While green stands for coherence, the signal colors yellow and red indicated
(content modifications and/or link structure changes). The most serious defect class of
Page 44 of 60
LiWA
missing contents is colored in black. Finally, a node’s shape indicates its MIME type
ranging from circles (HTML contents), hexagons (multimedia contents), rounded
rectangles (Flash or similar), squares (PDF contents and other binaries) to triangles
(DNS lookups). Altogether, the analysis and visualization features developed aim at
helping the crawl engineer to better understand the nature of change(s) within or
between Web sites and – consequently – to adapt the crawling strategy/frequency for
future captures. As a result, this will also help increase the overall archive’s coherence.
Figure 5.8: Coherence defect visualization of a sample domain capture (mpiinf.mpg.de) by Visone
Page 45 of 60
LiWA
5.4
Integration into the LiWA Architecture
Technically, the LiWA Temporal Coherence module is subdivided into a modified version
of the Heritrix crawler (including a LiWA coherence processor V1) and its associated
(Oracle) database. Here, (meta-)data extracted within the modified Heritrix crawler are
stored and made accessible as distinct capture-revisit tuples. In addition, arbitrary
captures can be combined as artificial capture-revisit tuples of “virtually” decelerated
captures. In parallel, we created a simulation environment that employs the same
algorithms we have developed in the measuring environment, but gives us full control
over the content (changes) and allows us to perform extreme tests (in terms of change
frequency, crawling speed and/or crawling strategy). Thus, experiments employing our
coherence ensuring crawling algorithms can be carried out with different expectations
about the status of Web contents and can be compared against ground truth.
Figure 5.9 depicts a flowchart highlighting the main aspects of the LiWA Temporal
Coherence processor V1 in Heritrix. Those elements in green contain unchanged
elements compared with the standard Heritrix crawler. The bluish items represent
methods of the existing crawler that have been adapted to our revisit strategy. Finally,
the red unit represents an additional processing step of the LiWA temporal coherence
module.
Figure 5.9: Flowchart of LiWA Temporal Coherence processor V1 in Heritrix
Page 46 of 60
LiWA
The second component of the LiWA Temporal Coherence module is the analysis and
visualization environment. It serves as a means to measure the quality of capture either
directly at runtime (online) or between two captures (offline). For that purpose, statistical
data per capture (e.g. number of defects occurred sorted by defect type) is computed
from the associated (Oracle) database after crawl completion as part of LiWA Temporal
Coherence processor V1 in Heritrix.
5.5
References
[BrCy00]
B. E. Brewington and G. Cybenko. Keeping up with the changing Web. Computer,
33(5):52-58, May 2000.
[ChGa00]
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an
incremental crawler. In VLDB '00: Proc. of the 26 th intl. conf. on Very Large Data
Bases, pages 200-209, San Francisco, CA, USA, 2000. Morgan Kaufmann
Publishers Inc.
[ChGa03a] J. Cho and H. Garcia-Molina. Effective Page Refresh Policies for Web Crawlers.
ACM Transactions on Database Systems, 28(4), 2003.
[ChGa03b] J. Cho and H. Garcia-Molina. Estimating Frequency of Change. ACM Trans. Inter.
Tech., 3(3):256-290, Aug. 2003.
[CGPa98]
J. Cho, H. Garcia-Molina, and L. Page. Efficient Crawling through URL ordering. In
WWW7: Proc. of the 7th intl. conf. on World Wide Web 7, pages 161-172,
Amsterdam, The Netherlands, 1998. Elsevier Science Publishers B. V.
[Masa06]
J.Masanès. Web Archiving. Springer, New York, Inc., Secaucus, NJ, 2006.
[NaWi01]
M. Najork and J. L. Wiener. Breadth-First Search Crawling Yields High-Quality
Pages. In In Proc. of the 10th intl. World Wide Web conf., pages 114-118, 2001.
[OlPa08]
C. Olston and S. Pandey. Recrawl Scheduling based on Information Longevity. In
WWW '08: Proceeding of the 17th intl. conf. on World Wide Web, pages 437-446.
ACM, 2008.
[SDM*09]
M. Spaniol, D. Denev, A. Mazeika, P. Senellart and G. Weikum. Data Quality in
Web Archiving. Proceedings of the 3rd Workshop on Information Credibility on the
Web (WICOW 2009) in conjunction with the 18th World Wide Web Conference
(www2009), Madrid, Spain, April 20, pp. 19-26, 2009.
Page 47 of 60
LiWA
6
Semantic Evolution
Preserving knowledge for future generations is a major reason for collecting all kinds of
publications, web pages, etc. in archives. However, ensuring the archival of content is
just the first step toward ``full'' content preservation. It also has to be guaranteed that
content can be found and interpreted in the long run.
This type of semantic accessibility of content suffers due to changes in language over
time, especially if we consider time frames beyond ten years. Language changes are
triggered by various factors including new insights, political and cultural trends, new
legal requirements, high-impact events, etc. Due to this terminology development over
time, searches with standard information retrieval techniques, using current language or
terminology would not be able to find all relevant content created in the past, when
other terms were used to express the same sought content. To keep archives
semantically accessible it is necessary to develop methods for automatically dealing
with terminology evolution.
6.1
State of the Art on Terminology Evolution
The act of automatically detecting terminology evolution given a corpus can be divided
into two subtasks. The first one is to automatically determine, from a large digital
corpus, the senses of terms. This task is generally referred to as Word Sense
Discrimination. For this task we will present state of the art and give a description of the
different approaches available. The second task takes place once several snapshots of
words and their senses have been created using corpora from different periods of time.
After having obtained these snapshots, terminology evolution can be detected between
any two (or a series of) instances. To our knowledge little or no previous work has been
done directly in this topic and thus we investigate state of the art in related areas such
as evolution of clusters in dynamic networks and ontology evolution.
6.1.1
Word Sense Detection
In this section we give a general overview and state of the art in Word Sense
Discrimination (WSD) as well as related fields.
Word Sense Discrimination is a subtask of Word Sense Disambiguation. The task of
Word Sense Disambiguation is, given an occurrence of an ambiguous word and its
context (usually sentence or surrounding words in a window), to determine which sense
is referred to. Usually the senses used in Word Sense Disambiguation come from
explicit knowledge sources (e.g. ontologies) such as thesauri or ontologies. Word
Sense Discrimination on the other hand is the task of automatically finding the senses of
words present in a collection. If an explicit knowledge sources is not used, Word Sense
Discrimination can be considered a subtask of Word Sense Disambiguation.
Using WSD instead of a thesaurus or other explicit knowledge sources has several
advantages. Firstly, the method can be applied to domain specific corpora where few or
no knowledge sources can be found. Examples are for instance detailed technical data
Page 48 of 60
LiWA
such as biology or chemistry and on the other end of the spectra, blogs where much
slang or gadget names are used. Secondly, manmade thesauri often contain specific
word senses which might not be related to the particular corpus. For example, WordNet
[Mil95] depicts the word “computer” as “an expert in operating calculating machines”,
which is definitely considered to be a less frequent utilization.
The output from WSD is normally a set of terms to describe senses found in a
collection. This grouping of terms is derived from clustering. We refer to such an
automatically found sense as a cluster and throughout this document we will be using
the terms clusters and senses interchangeably. Clustering techniques can be divided
into hard and soft clustering algorithms. In hard clustering an element can only appear
in one cluster, while soft clustering allows each element to appear in several. Due to the
polysemous property of words, soft clustering is most appropriate for Word Sense
Discrimination.
6.1.2
Word Sense Discrimination
Word Sense Discrimination techniques can be divided into two major groups,
supervised and unsupervised. Due to the vast amounts of data available on the Web
and – as a consequence – stored in Web archives, we will be focusing on unsupervised
techniques only.
Automatic Word Sense Discrimination
According to Schütze [Sch98], the basic idea of context group discrimination is to
induce senses from contextual similarities. Each occurrence of an ambiguous word in a
training set is mapped to a point in word space, called first order co-occurrence vectors.
The similarity between two points is measured by cosine similarity. A context vector is
then considered as the centroid (or sum) or the vectors of the words occurring in the
context. This set of context vectors, also considered second order co-occurrence
vectors, are then clustered into a number of coherent clusters or contexts using
Buckshot, which is a combination of the Expectation Maximization-algorithm and
agglomerative clustering. The representation of a sense is the centroid of its cluster.
Occurrences of ambiguous words from the test set are mapped in the same way as
words from the training set and labelled using the closest sense vector. The method is
completely unsupervised since manual tagging is not required. However, disadvantages
are that the clustering is hard and the number of clusters has to be predetermined.
A systematically comparison of unsupervised WSD techniques for clustering instances
of words using both vector and space similarity is conducted by Purandare et. al.
[PP04]. The authors compares the aforementioned method of Schütze and Pedersen
at el. [PB97] using first order co-occurrence vectors. The result is twofold: second order
context vectors have an advantage over first order vectors for small training data;
however, for larger amounts of homogeneous data such as “Line, Hard and Serve data“
[HLS], first order context vector representation with UPPGMA 1 clustering algorithm is
1
Unweighted Pair Group Method with Arithmetic Mean
Page 49 of 60
LiWA
the most effective at WSD.
Word Sense Discrimination using dependency triples
The use of dependency triples are one alternative for WSD algorithms, first described in
[Lin98]. In this paper a word similarity measure is proposed and an automatically
created thesaurus using this similarity is evaluated. The measure is based on one
proposed by Lin in 1997 [Lin97]. This method has the restriction of using hard
clustering. The author reports the method to work well but no formal evaluation is done.
In 2002, Pantel et al. published “Discovering Word Senses from Text” [PL02]. In the
paper a clustering algorithm called Clustering By Committee (CBC) is presented. The
paper also proposes a method for evaluating the output of a word sense clustering
algorithm to WordNet. Till then, the method has been widely used [Dor07, DfML07,
RKZ08, Fer04]2. In addition, it has been implemented in the WordNet::Similarity
Package3 by Ted Pedersen et al. Pantel et al. implemented several other algorithms like
Buckshot, K-means and Average Link and showed that CBC outperforms all algorithms
implemented, in both recall and precision.
Graph algorithms for Word Sense Discrimination
“Using curvature and Markov clustering in graphs for lexical acquisition and word sense
discrimination” by Dorow et. al. [DWL+05] represent the third category of unsupervised
Word Sense Discrimination techniques. A word graph G is build using nouns and noun
phrases extracted from the the British National Corpus [BNC]. Each noun or noun
phrase becomes a node in the graph. Edges exist between all pairs of nodes that have
co-occurences in the corpus. More precisely if the terms are separated by “and”, “or” or
commas. The curvature curv of a node w is defined as follows:
curv(w)= # triangles w participates in / # of triangles w could participate in
As a triangle we consider three nodes, of which w is a one, where all are connected.
This is also referred to as the clustering coefficient of a node. Curvature is a way of
measuring semantic cohesiveness of the neighbours of a word. If a word has a stable
meaning, the curvature value will be high. On the other hand, if a word is ambiguous the
curvature value will be low because the word is linked to members from different senses
which are not interrelated. The results show that curvature value is particularly suited for
measuring the degree of ambiguity of words.
A more thorough investigation of the curvature measure and the curvature clustering
algorithm is given in [Dor07]. An analysis of the curvature algorithm is made on the BNC
corpus and the evaluation method proposed in [PL02] is employed. A high performance
of the curvature clustering algorithm is noted which comes at the expense of low
coverage.
Another work related to clustering networks is a paper by Palla et al. [PD+05] dealing
with the “Uncovering the overlapping community structure of complex networks in
More papers using Lin-measure implemented by WordNet::Similarity package can be found on
http://www.d.umn.edu/~tpederse/wnsim-bib/
3
A description of this package can be found on http://www.d.umn.edu/~tpederse/similarity.html
2
Page 50 of 60
LiWA
nature and society”. They define a cluster as a “k-clique community”, i.e. a union of k
cliques (complete sub graphs of size k) that can be reached from each other through a
series of adjacent k-cliques. Adjacent k-cliques share k-1 nodes. They also conclude
that relaxing this criterion often is equivalent to decreasing k. The experiments are
conducted by clustering three types of graphs: co-authorship, word association and
protein interaction. For the first two graphs the average clustering coefficient for the
communities found is 0.44 and 0.56. This leads us to believe that 0.5 as a clustering
coefficient for curvature clustering could be a good threshold, as used in [Dor07,
DWL+05].
6.1.3
Summary
We have investigated three main methods for automatically discovering word senses
from large corpora. The method represented by Pantel et al. in [PL02] gives clusters
where each element has some likelihood of belonging to the cluster. This has the
advantage of assigning more significant elements to a cluster. The third method
presented by Dorow et.al in [DWL+05] uses a graph theoretical approach and has
reported higher precision than the one found in [PL02]. The findings of [PD+05] give us
an indication of which value to use as curvature threshold for the curvature clustering
algorithm.
6.2
Detecting Evolution
Analysis of communities and their temporal evolution in dynamic networks has been a
well studied field in recent years [LCZ+08, PBV07]. A community can be modelled as a
graph where each node represents an individual and each edge represent interaction
among individuals. As an example, in a co-authorship graph each author is considered
as a node and collaboration between any two authors is represented by an edge. When
it comes to detecting evolution the traditional approach has been to first detect
community structure for each time slice and then compare these to determine
correspondence. These methods tend to introduce dramatic evolutions in short periods
of time and can hence be less appropriate to noisy data .
A different path for detecting evolutions is by modelling the community structure at
current time by taking into account also previous structures [LCZ+08]. This can help to
prevent dramatic changes introduced by noise.
A naïve way of determining cluster evolution in the traditional setting would be to simply
consider each cluster from a time slice, as a set of term or nodes and then do a line up
with all the clusters from consecutive time slice using a Jaccard similarity. This
measures the number of overlapping nodes between two clusters divided by the total
number of distinct nodes in the clusters. We could then conclude that the clusters with
the highest overlap from two consecutive time slots are connected and than one
evolved into the other. A more sophisticated way to detect evolution which also takes in
to the account the edge structure within clusters has been proposed by Palla et al.
[PBV07].
Page 51 of 60
LiWA
6.2.1
Quantifying social group evolution
Using the clique percolation method (CPM) [PD+05], communities in a network are
discovered. The communities of each time step from two types of graphs are extracted
using the clique percolation method. The communities are then tracked over time steps.
The basic events that can occur in the lifetime of a community are the following:
1. A community can grow or contract
2. A community can merge or split
3. A community can be born while others may disappear
Similar cluster events are used in “MONIC – Modeling and Monitoring Cluster
Transitions” presented in 2006 by Spiliopoulou et al. [SNTS06]. The methods proposed
in MONIC are not based on topological properties of clusters, but on the contents of the
underlying data stream. A typification of cluster transitions and a transition detection
algorithm are proposed. The approach assumes a hard clustering where clusters are
non overlapping regions described through a set of attributes. In this framework internal
transitions are monitored for clusters that exist for more than one time point only. Size,
compactness and shift of centre point are monitored. The disadvantages of the method
are that the algorithm assumes a hard clustering and that each cluster is considered a
set of elements without respect to the links between the elements of the cluster.
A method for describing and tracking evolutions can be found in “Discovering
Evolutionary Theme Patterns from Text” by Mei et al. [MZ05]. In this paper discovering
and summarizing the evolutionary patterns of themes in a text stream is investigated.
The problem is defined as follows:
1. Discover latent themes from text
2. Construct an evolution graph of themes
3. Analyze life cycles of themes
The method proposed is suitable using text streams in which meaningful time stamps
exist and is tested in the news paper domain as well as for abstracts of scientific
papers. The theme evolution graphs proposed in this paper seem particularly suitable
for describing terminology evolution. The technology used for finding clusters
corresponding to word senses will differ from the probabilistic topic model chosen here.
Therefore the definitions must be modified to suit our purposes, for example our clusters
cannot be defined in the same way as themes are in this paper and similarity beween
clusters cannot be measured like similarity between two themes. In FacetNet [LCZ+08]
communities are discovered from social network data and an analysis of the community
evolution is made in a manner that differs from the view of static graphs. The main
difference to the traditional method is that FacetNet discovers communities that jointly
maximize the correspondence of the observed data and the temporal evolution.
FacetNet discovers community structure at a given time step t which is determined both
by the data at t and by the historic community pattern. The method is unlikely to
discover community structure that introduces dramatic evolutions in a very short time
Page 52 of 60
LiWA
period. The method also uses a soft community membership where each element
belongs to any community with a certain probability. The method is shown to be more
robust to noise. It also introduces a method where all members do not contribute
equally to the evolution.
A third method of finding evolutions in networks is influenced by “Identification of TimeVarying Objects on the Web” by Oyama et al. [OST08]. They proposed a method to
determine whether data found on the Web origin from the same or different objects. The
method takes into account the possibility of changes in the attribute values over time.
The identification is based on two estimated probabilities. The probability that observed
data is from the same object that has changed over time and the probability that
observed data are from different objects. Each object is assigned certain attributes like
age, job, salary etc. Once the object schema is determined, objects are clustered using
an agglomerative clustering approach. Observations in the same cluster are assumed
to belong to the same object. The experiments conducted show that the methods
proposed improves precision and recall of object identification compared to methods
that regard all values as constants. If the objects are people, the method is able to
identify a person even when he/she has aged or changed jobs.
The disadvantage of this method is that attribute types, as well as the probability models
and their parameters must be determined using domain knowledge. For terminology
evolution, once clusters are found from different time periods, each sense can be
considered an object. Each cluster found in the snapshot can be considered an
observation with added terms, removed terms etc. The terms in the cluster can be
considered the attributes. We can then cluster observations from different snapshots in
order to determine which senses are likely to belong to the same object and be
evolutions of one another. An observation outside of a cluster can be considered similar
to the sense represented by the cluster, but not as an evolved version of that sense.
6.2.2
Summary
Several approaches for tracking evolution of clusters have been investigated. Based on
the structure of the terminology graph and its evolution with respect to noise, how fast
clusters change and appear/disappear a decision will be made for one of the
approaches. If our data contains much noise, we will initially start with the methods
introduced in [LCZ+08], if the clusters seem stable the methods in [PD+05], seem more
suitable. As mentioned in Section 6.1.1, methods that assume hard clustering are not
well suited for WSD. Moreover the internal structuring of the clusters should be taken
into consideration. It is appropriate to consider two clusters with almost the same
elements as different based on the edge structure of the cluster members.
On top of the clustering process, a method for describing the overall evolution is
needed. For this purpose, the model described in [MZ05] seems particularly suitable. It
is likely that the model is not directly applicable for describing terminology evolution, but
needs to be modified or extended for our purposes.
Page 53 of 60
LiWA
6.3
Terminology Evolution Module
The problem of automatically detecting terminology evolution can be split into three
different sub problems: terminology snapshot creation, merging of terminology
snapshots and mapping concepts to graphs. First, we need to identify and represent the
relation between terms and their intended meanings (concepts) at a given time. We call
such a representation a term-concept graph and a set of these a terminology snapshot.
Such a snapshot is always based on a given document collection D"ti which is a set of
documents d taken from a domain " in the time interval [ti!1 , ti], where i = 0, . . . , n, ti "
T, where T is a the set of timestamps.
5.2.1 Terminology Snapshot Creation
Each document dj " D" ti contains a set of terms w " W"ti. The set W"ti is an ideal domain
specific term set containing the complete set of terms, independent of the considered
corpora, that were ever used in domain " from time t0 since time ti. Since W#ti is not
known, we define its approximation W'#ti. At time t0 the set is empty and W'" ti = W'" ti!1 U
terms(D" ti ) for i = 1, . . . , N, where terms(D" ti ) = {w : (d w " d + d " D" ti }.
To represent the relation between terms and their meanings we introduce the notion of
concept and represent meanings as connections between term and concept nodes in a
graph. Let C be the universe of all concepts. The semantics of a term w " W" ti is
represented by connecting it to its concepts. The edges between terms and concepts
inherit a temporal annotation depending on the collection’s time point. For every term w
" W" ti, at least one term-concept edge has to exist.
We introduce the function $ to be a representation of term-concept relations as follows:
# : W $ T % (W $ P(C $ P(T))).
P denotes a power set, i.e. the set of all subsets. Although # only generates one
timestamp for each term-concept relation, we introduce the power set already at this
point to simplify terminology snapshot fusion. The term-concept relations defined by #
can be seen as a graph of a term and edges to all its concepts, referred to as a termconcept graph.
5.2.2 Terminology Snapshot Fusion
After we have created several separate terminology snapshots, we want to merge them
to detect terminology evolution. A term’s meaning has evolved if its concept relations
have changed from one snapshot to another.
The fusion of two terminology snapshots is (in general) more complicated than a simple
graph merging. For example, we might merge two concepts from the source snapshots
to a single concept in the target graph. As part of the fusion process we also need to
merge the timestamps of the edges. When term and concept are identical in both
snapshots, the new annotation is the union of both source annotations. Thus, we
represent the concept relations of a term w " W as set of pairs (ci , {ti1 , . . . , tik }). To
shorten the notation we define Ti as a set of timestamps tik, and the pairs can be written
Page 54 of 60
LiWA
as (ci , Ti). We note that a concept does not have to be continuously related to a term;
instead the respective term meaning/usage can lose popularity and gain it again after
some time has passed. Therefore, &i is not necessarily a set of consecutive timestamps.
We introduce the function % which fuses two term-concept graphs. ' represents
relations between concepts from different snapshots.
' : (W $P(C $& ))$(W $P(C $& )) % (W $P(C $& )).
It should be clear that the set of concepts in the resulting graph of % does not
necessarily have to be a subset of the set of concepts from the source graphs.
5.2.3 Mapping Concepts to Terms
The graph resulting from snapshot fusion allows us to identify all concepts which have
been related to a given term over time. We cannot directly exploit these relations for
information retrieval, but we need to map the concepts back to terms used to express
them. To represent this mapping, we introduce a function & as follows:.
( : C % P(W $ & )
For a given concept c, & returns the set of terms used to express c, together with
timestamp sets which denote when the respective terms were in use.
The characteristics of & are clearly dependent on the merging operation of the concepts
in %. For instance, in case two concepts are merged, the term assignment has to reflect
this merge.
6.4
Evaluation
To the best of our knowledge there are no published evaluation methods or benchmark
datasets to rely on. Therefore we have developed the following strategy for evaluating
the Terminology Evolution module developed in LiWA:
Starting point is the manual creation of a set of example terms that we call the
test set. This test set should contain approx. 60-80 terms where there is evidence
of an evolution. Examples terms are St. Petersburg or fireman/firefighter, for
more examples please see [TIRNS08]. For counter reference we will include in
the test set, a set of terms where there is no evidence of evolution. Using the
Liwa Terminology Evolution Module we will identify how many of the evolutionary
terms we are able to find in this automatic fashion. Every term evolution that is
found (or not found) that correctly corresponds to our knowledge of the term will
be considered a success.
The procedure described above aims at discovering terminology evolution detectable in
the archive. Because this does not address information extraction or search, we will
also apply the following strategy:
Corresponding to the test set of terms that we have described above we will
manually identify the relevant documents present in the archive. These
Page 55 of 60
LiWA
documents will constitute our target set. In order to create a baseline, we use
only one of these query terms to search among the target set and note how
many of the relevant results that are returned. Then we extend the query term
with additional terms found by the Terminology Evolution Module and query the
target set again using this extended set of terms. If we are now able to extract
more documents than the baseline that contain the extended set of terms, we
count this query as a success. We can also measure with what percentage we
succeeded as well as an average success rate etc.
This evaluation will measure two related entities. Firstly, we measure how well our
module can model the evolution found in an archive. Here the focus lies on finding all
evolutions present in the archive. Secondly, we measure to what extent the information
found by the module can aid in information retrieval. Are we able to return a higher
number of relevant documents for the query using the information found by the
Terminology Evolution Module?
6.5
Integration into the LiWA Architecture
The Terminology Evolution Module is subdivided into terminology extraction and tracing
of terminology evolution. Both sub-modules are integrated via UIMA pipelines as
presented in Figure 6.1. The terminology extraction sub-module is automatically
triggered when a crawl or a partial crawl is finished. The terminology evolution submodule is manually triggered by the archive curator based on the crawl statistics
gathered during terminology extraction.
“The Unstructured Information Management Architecture (UIMA) is an architecture and
software framework for creating, discovering, composing and deploying a broad range
of multi-modal analysis capabilities and integrating them with search technologies.”
([UIMA])
Page 56 of 60
LiWA
Crawler
Post-Processing
….
put (crawlId, List <newWarcFiles>, Boolean final)
Asynchron
UIMA Processing Chain
Job Queue
Crawl
Statistics
WARC Extraction
POS Tagger
WARC
Files
TermEvolv
DB
Word Sense
Discrimination
Cluster
Tracking
Lammertizer
Cooccurence
Analysis
Evolution
Detection
UIMA Processing Chain
Curator
Figure 6.1: Work flow for the Semantic Evolution module V1
6.5.1
Terminology Extraction Pipeline.
The WARC Collection Reader (WARC Extraction) extracts the text and time metadata
for each site archived in the input crawl. The POS (Part Of Speech) Tagger is an
aggregate analysis engine from Dextract ([Dextract]). It consists of a tokenizer, a
language independent part of speech tagger and lemmatizer ([TreeTagger]). In the Term
Extraction sub-module, we read the annotated sites, extract the lemmas and the
different occurring parts of speech that were identified for the archived sites. After that,
we index the terms in an database (MySQL) index, see Figure 6.2. In the Cooccurrence Analysis we extract lemma or noun co-occurrence matrices for the indexed
crawl from the database index.
Page 57 of 60
LiWA
Figure 6.2: Database Terminology Index
6.5.2
Terminology Evolution Pipeline
After extracting the co-occurrence matrices for lemmas for different crawls captured at
different moments in time, for the different time intervals, we cluster the lemmas with a
curvature clustering algorithm. The clusters from different time intervals are then
analyzed and compared in order to detect term evolution.
6.6
References
[BGBZ07] Jordan Boyd-Graber, David Blei, and Xiaojin Zhu. A Topic Model for Word Sense
Disambiguation. 2007
[BNC]
BNC Consortium: British National Corpu,
http://www.natcorp.ox.ac.uk/corpus/index.xml.ID=products
[Brown]
Brown University: Brown Corpus,
http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM
[DfML07]
Koen Deschacht, Marie Francine Moens, and Interdisciplinary Centre For Law. Text
analysis for automatic image annotation. In Proceedings of the 45th Annual Meeting
of the Association for Computational Linguistics. East Stroudsburg: ACL, 2007
[dMD04]
Marie-Catherine de Marneffe and Pierre Dupont. Comparative study of statistical
word sense discrimination techniques, 2004
[Dor07]
B. Dorow. A graph model for words and their meanings. PhD thesis, University of
Stuttgart, 2007
[DW03]
Beate Dorow and Dominic Widdows. Discovering corpus specific word senses. In
EACL '03: Proceedings of the tenth conference on European chapter of the
Association for Computational Linguistics, Morristown, NJ, USA, 2003
Page 58 of 60
LiWA
[DWL+05] B. Dorow, D. Widdows, K. Ling, J-P. Eckmann, and D. Sergian d E. Moses. Using
curvature and Markov clustering in graphs for lexical acquisi tion and word sense
discrimination. In MEANING-2005, 2nd Workshop organized by the MEAN-ING
Project, Trento, Italy, February 3-4 2005
[Fer04]
Olivier Ferret. Discovering word senses from a network of lexical cooccurrences. In
COLING '04: Proceedings of the 20th international conference on Computational
Linguistics, page 1326, Morristown, NJ, USA, 2004. Association for Computational
Linguistics.
[HLS]
Leacock, Chodorow and Miller. Hard and Serve Leacock, Towell and Voorhees. Line
http://www.d.umn.edu/~tpederse/data.html
[KH07]
Gabriela Kalna and Desmond J. Higham. A clustering coefficient for weighted
networks, with application to gene expression data. 2007
[LADS06] Yaozhong Liang, Harith Alani, David Dupplaw, and Nigel Shadbold. An approach to
cope with ontology changes for ontology-based applications. The 2nd AKT Doctoral
Symposium and workshop, 2006
[LCZ+08] Yu-Ru Lin, Yun Chi, Shenghuo Zhu, Hari Sundaram, and Belle L. Tseng. Facetnet: a
framework for analyzing communities and their evolutions in dynamic networks. In
WWW '08: Proceeding of the 17th international conference on World Wide Web,
pages 685-694, New York, NY, USA, 2008. ACM.
[Lin97]
Dekang Lin. Using syntactic dependency as local context to resolve word sense
ambiguity. In ACL-35: Proceedings of the 35th Annual Meeting of the Association for
Computational Linguistics and Eighth Conference of the European Chapter of the
Association for Computational Linguistics, pages 64-71, Morristown, NJ, USA, 1997.
Association for Computational Linguistics.
[Lin98]
Dekang Lin. Automatic retrieval and clustering of similar words. In Proceedings of
the 17th international conference on Computational linguistics, pages 768-774,
Morristown, NJ, USA, 1998. Association for Computational Linguistics.
[LSB06]
Esther Levin, Mehrbod Shari, and Jerry Ball. Evaluation of utility of lsa for word
sense discrimination. In Proceedings of the Human Language Technology
Conference of the NAACL, Companion Volume: Short Papers, New York City, USA,
June 2006. Association for Computational Linguistics.
[Mil95]
George A. Miller. Wordnet: A lexical database for english. Communications of the
ACM, 1995.
[MKWC04] Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Finding predominant
word senses in untagged text. In ACL'04: Proceedings of the 42nd Annual Meeting
on Association for Computational Linguistics, page 279, Morristown, NJ, USA, 2004.
Association for Computational Linguistics.
[MZ05]
Qiaozhu Mei and ChengXiang Zhai. Discovering evolutionary theme patterns from
text: an exploration of temporal text mining. In KDD '05: Proceedings of the eleventh
ACM SIGKDD international conference on Knowledge discovery in data mining, New
York, NY, USA, 2005. ACM.
[Nav09]
Roberto Navigli. Word sense disambiguation: A survey. ACM Comput. Surv., 2009.
[OST08]
Satoshi Oyama, Kenichi Shirasuna, and Katsumi Tanaka. Identication of timevarying objects on the web. In JCDL '08: Proceedings of the 8th ACM/IEEE-CS joint
conference on Digital libraries, New York, NY, USA, 2008. ACM.
Page 59 of 60
LiWA
[PB97]
Ted Pedersen and Rebecca Bruce. Distinguishing word senses in untagged text. In
Proceedings of the Second Conference on Empirical Methods in Natural Language
Processing. 1997
[PBV07]
Gergely Palla, Albert-Laszlo Barabasi, and Tamas Vicsek. Quantifying social group
evolution. Nature,April 2007.
[PD+05]
Gergely Palla, Imre Derenyi, Illes Farkas, and Tamas Vicsek. Uncovering the
overlapping community structure of complex networks in nature and society. Nature.
2005
[PL02]
Patrick Pantel and Dekang Lin. Discovering word senses from text. In In
Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data
Mining, 2002.
[PP04]
Amruta Purandare and Ted Pedersen, Word Sense Discrimination by Clustering
Contexts in Vector and Similarity Spaces. In: Proceedings of CoNLL-2004, Boston,
MA, USA, 2004, pp. 41-48.
[Rap04]
Reinhard Rapp. A practical solution to the problem of automatic word sense
induction. In Proceedings of the ACL 2004 on Interactive poster and demonstration
sessions, page 26, Morristown, NJ, USA, 2004. Association for Computational
Linguistics.
[RKZ08]
Sergio Roa, Valia Kordoni, and Yi Zhang. Mapping between compositional semantic
representations and lexical semantic resources: Towards accurate deep semantic
parsing. In Proceedings of ACL-08: HLT, Short Papers, Columbus Ohio, June 2008.
Association for Computational Linguistics.
[Sch98]
Hinrich Schütze. Automatic word sense discrimination. Journal of Computational
Linguistics, 1998.
[SemCor] Princeton University. Semcor http://multisemcor.itc.it/semcor.php
[Senseval] A description of http://www.senseval.org/ , Senseval 1, Senseval 2 and Senseval 3
for download http://www.d.umn.edu/~tpederse/data.html
[SNTS06] Myra Spiliopoulou, Irene Ntoutsi, Yannis Theodoridis, and Rene Schult. Monic:
modeling and monitoring cluster transitions. In KDD '06: Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining,
New York, NY, USA, 2006. ACM.
[TIRNS08] Nina Tahmasebi, Tereza Iofciu, Thomas Risse, Claudia Niederee, and Wolf Siberski;
Terminology Evolution in Web Archiving: Open Issues; In Proc. of the 8th
International Web Archiving Workshop in conjunction with ECDL 2008, Aarhus,
Denmark
[TreeTagger] TreeTagger – part of speech tagger.
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
[UIMA]
UIMA Overview. Apache, http://incubator.apache.org/uima/
[vD00]
Stijn van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of
Utrecht, May 2000.
[Yar95]
David Yarowsky. Unsupervised word sense disambiguation rivaling supervised
methods. In Proceedings of the 33rd annual meeting on Association for
Computational Linguistics, Morristown, NJ, USA, 1995. Association for
Computational Linguistics.
Page 60 of 60