with the cover page - École des Mines de Saint

Transcription

with the cover page - École des Mines de Saint
Open Source Web Information Retrieval
COMPIÈGNE, FRANCE • SEPTEMBER 19, 2005
Edited by Michel Beigbeder and Wai Gen Yee
Sponsors
In Conjunction with the
2005 IEEE/WIC/ACM International Conference on
Web Intelligence and Intelligent Agent Technology
ISBN : 2-913923-19-4
OSWIR 2005
First workshop on
Open Source Web Information Retrieval
Edited by Michel Beigbeder and Wai Gen Yee
ISBN: 2-913923-19-4
OSWIR 2005 Organization
Workshop chairs
Michel Beigbeder
G2I department
École Nationale Supérieure des Mines de Saint-Étienne, France
Wai Gen Yee
Department of Computer Science
Illinois Institute of Technology, USA
Program Committee
Abdur Chowdhury, America Online Search and Navigation, USA
Ophir Frieder, Illinois Institute of Technology, USA
David Grossman, Illinois Institute of Technology, USA
Donald Kraft, Louisianna State University, USA
Clement Yu, University of Illinois at Chicago, USA
Reviewers
Jefferson Heard, Illinois Institute of Technology, USA
Dongmei Jia, Illinois Institute of Technology, USA
Linh Thai Nguyen, Illinois Institute of Technology, USA
2
OSWIR 2005
Open Source Web Information Retrieval
The World Wide Web has grown to be a primary source of information for millions of people. Due to the size of the Web,
search engines have become the major access point for this information. However, ”commercial” search engines use hidden
algorithms that put the integrity of their results in doubt, so there is a need for some open source Web search engines.
On the other hand, the Information Retrieval (IR) research community has a long history of developing ideas, models and
techniques for finding results in data sources, but finding one’s way through all of them is not an easy task. Moreover their
applicability to the Web search domain is uncertain.
The goal of the workshop is to survey the fundamentals of the IR domain and to determine the techniques, tools, or models
that are applicable to Web search.
This first workshop was organized by Michel BEIGBEDER from École Nationale Supérieure des Mines de Saint-Étienne1 ,
France and Wai Gen YEE from Illinois Institute of Technology2 , USA. It was held on September 19th, 2005 in the UTC
(Compiègne University of Technology3 ) in conjunction with WI and IAT 20054 the 2005 IEEE/WIC/ACM International
Conferences on Web Intelligence and Intelligent Agent Technology.
We want to thank all the authors of the submitted papers, the members of the program committee: Abdur Chowdhury,
Ophir Frieder, David Grossman, Donald Kraft, Clement Yu and the reviewers: Jefferson Heard, Dongmei Jia, and Linh Thai
Nguyen.
Michel Beigbeder and Wai Gen Yee
1 http://www.emse.fr/
2 http://www.iit.edu/
3 http://www.hds.utc.fr/
4 http://www.hds.utc.fr/WI05/
3
4
Table of contents
Pre-processing Text for Web Information Retrieval Purposes by Splitting Compounds into
their Morphemes
Sven Abels and Axel Hahn
7
Fuzzy Querying of XML documents – The Minimum Spanning Tree
Abdeslame Alilaouar and Florence Sedes
11
Link Analysis in National Web Domains
Ricardo Baeza-Yates and Carlos Castillo
15
Web Document Models for Web Information Retrieval
Michel Beigbeder
19
Static Ranking of Web Pages, and Related Ideas
Wray Buntine
23
WIRE: an Open Source Web Information Retrieval Environment
Carlos Castillo and Ricardo Baeza-Yates
27
Nutch: an Open-Source Platform for Web Search
Doug Cutting
31
Towards Contextual and Structural Relevance Feedback in XML Retrieval
Lobna Hlaoua and Mohand Boughanem
35
An Extension to the Vector Model for Retrieving XML Documents
Fabien Laniel and Jean-Jacques Girardot
39
Do Search Engines Understand Greek or User Requests ”Sound Greek” to them?
Fotis Lazarinis
43
Use of Kolmogorov Distance Identification of Web Page Authorship, Topic and Domain
David Parry
47
Searching Web Archive Collections
Michael Stack
51
XGTagger, an Open-Source Interface Dealing with XML Contents
Xavier Tannier, Jean-Jacques Girardot and Mihaela Mathieu
55
The Lifespan, Accessibility and Archiving of Dynamic Documents
Katarzyna Wegrzyn-Wolska
59
SYRANNOT: Information Retrieval Assistance System on the Web by Semantic Annotations
Re-use
Wiem Yaiche Elleuch, Lobna Jeribi and Abdelmajid Ben Hamadou
63
Search in Peer-to-Peer File-Sharing System: Like Metasearch Engines, But Not Really
Wai Gen Yee, Dongmei Jia and Linh Thai Nguyen
67
5
6
Pre-processing text for web information retrieval purposes
by splitting compounds into their morphemes
Sven Abels, Axel Hahn
Department of Business Information Systems,
University of Oldenburg, Germany
{ abels | hahn } @ wi-ol.de
morphemes “hand” and “shake”. In German, the word
“Laserdrucker” is composed of the words “Laser” (laser)
and “Drucker” (printer). In Spanish we can find
“Ferrocarril” (railway) consisting of “ferro” (iron) and
“carril” (lane).
Splitting compounds into their morphemes is
extremely useful when preparing text for further analysis
(see e.g. [10]). This is especially valid for the area of web
information retrieval because in those cases, you often
have to deal with a huge amount of text information,
located on a large number of websites. Splitting
compounds helps to detect the meaning of a word easier.
For example, when looking for synonyms, a
decomposition of compound words will help because it is
usually easier to find word-relations and synonyms of
morphemes then to look at the compound word.
Another important advantage when splitting
compounds is the capability of stemming compounds.
Most stemming algorithms are able to stem compounds
correctly if their last morpheme differs in its grammatical
case or in its grammatical number. For example,
“sunglasses” will correctly be stemmed to “sunglass” in
most algorithms. However: There are cases, where a
stemming does not work correctly for compounds. This
appears, whenever the first morpheme changes its case or
number. For example the first word in “Götterspeise”
(English: jelly) is plural and the compound will therefore
not be stemmed correctly. Obviously, this will result in
problems when processing the text for performing, e.g.,
text searches.
Abstract
In web information retrieval, the interpretation of text is
crucial. In this paper, we describe an approach to ease
the interpretation of compound word (i.e. words that
consist of other words such as “handshake” or
“blackboard”). We argue that in the web information
retrieval domain, a fast decomposition of those words is
necessary and a way to split as many words as possible,
while we believe that on the other side a small error rate
is acceptable.
Our approach allows the decomposition of compounds
within a very reasonable amount of time. Our approach is
language independent and currently available as an open
source realization.
1. Motivation
In web information retrieval, it is often necessary to
interpret natural text. For example, in the area of web
search engines, a large amount of text has to be
interpreted and made available for requests. In this
context, it is beneficial to not only provide a full text
search of the text as it is on the web page but to analyze
the text of a website in order to, e.g., provide a
classification of web pages in terms of defining its
categories (see [1], [2]).
A major problem in this domain is the processing of
natural language (NLP; see e.g. [3], [4] for some detailed
descriptions). In most cases, text is prepared in terms of
being preprocessed before it is analyzed. A very popular
method is text stemming, which creates a basic word form
out of each word. For example the word “houses” is
replaced with “house”, etc. (see [5]). A popular approach
for performing text stemming is the porter stemmer [6].
Apart from stemming text, the removal of so called “stop
words” is another popular approach for pre-processing
text (see [7]). It removes all unnecessary words such as
“he”, “well”, “to”, etc. While both approaches are well
established and quite easy to implement, the problem of
splitting compounds into its morphemes is more difficult
and less often implemented. Compounds are words that
consist of two or more other words (morphemes). They
can be found in many of today’s language. For example,
the English word “handshake” is composed of the two
A nice side effect when splitting compounds is that
you can use the decomposition to get a unified way of
processing words that might be spelled differently. For
example one text might use “containership”, while
another one uses “container ship” (2 words). Using a
decomposition will lead to a unified way.
2. Difficulties and Specific Requirements
Splitting compounds in the domain of web information
retrieval has some specific requirements. Since web
information retrieval tasks usually have to deal with a
large amount of text information, an approach for
splitting compounds into its morphemes has to provide a
7
high speed in order to keep the approach applicable in
praxis. Additionally, a low failure rate is of course crucial
for the success. Hence, it is necessary to use an approach
that provides a very high speed with an acceptable
amount of errors.
Another specific requirement for applying a
decomposition of words in web information retrieval is
the high amount of proper names and nouns that are
concatenated to proper nouns or figures. For example, we
might find words such as “MusicFile128k” or
“FootballGame2005”. Because of that, it is necessary to
use an approach that can deal with unknown strings at the
end or at the beginning of a compound.
The main task of our approach is performed in a
recursive way. This is to be realized as a single
findTupel method with one parameter, which is the
current compound that should be decomposed into
morphemes. In case that this word is smaller then 3
characters (or null), we simply return the word as it is.
In all other cases, it is decomposed into a left part and a
right part in a loop. Within each loop, the right part gets
one character longer. For example, the word “hourseboat”
will be handled like this:
Loop No
1
2
3
4
...
The most difficulty in splitting compounds is of course
the detection of the morphemes. However, even when
detecting a pair of morphemes, it does not mean that we
have a single splitting, that is correct for a word. For
example, the German word “Wachstube” has two
meanings and could be decomposed into “Wachs” and
“Tube” (“wax tube”) or into “Wach” (guard) and “Stube”
(house), which means “guardhouse”. Obviously, both
decompositions are correct but have a different meaning.
In the following section, we will describe an approach
for realizing a decomposition of compounds into
morphemes, which is designed for dealing with a large
amount of text in order to be suited for web information
retrieval. In this approach, we focus on providing a high
speed with a small failure rate, which we believe is
acceptable.
Left part
Right part
Houseboa
t
Housebo
at
Houseb
oat
House
boat
...
...
Table 1. Decomposition
Within each loop, it is checked wether the right part is a
meaningful word that appears in a language specific
wordlist or not. In our implementation, we provided a
wordlist of 206.877 words containing different words in
singular and plural. In case that the right part represents a
word of this wordlist, it is checked it the left part can still be
decomposed. In order to do this, findTupel method is
called again with the left part as a new parameter
(recursively). In case that the right part never represents a
valid word, the method returns a null value. If the
recursive call returns a value, different from null, its
result is added to a resulting list, together with the right
part. Else the loop continues. This ensures that the shortest
decomposition of the compound is returned.
For some languages, compounds are composed by
adding a connecting character between the morphemes. For
example, in the Germany language, one can find an “s” as a
connection between words. In order to consider those
connecting characters, they are removed when checking if
the right part is a valid word or not.
2. Description of the Approach
In our approach, we sequentially split compound
words in three phases:
1. direct decomposition of a compound,
2. truncation of the word from left to right and
3. truncation of the word from right to left.
In the first phase, we try to directly split the composed
word by using a recursive method findTupel, which
aims in detecting the morphemes of the word and returns
it as an ordered list. In case of not being able to
completely decompose the word, we truncate the word by
removing characters starting at the left side of the word.
After removing a character, we repeat the recursive
findTupel method. If this does not lead to a successful
decomposition, we use the same methodology in the third
step to truncate the word from right to left. This enables
us,
to
successfully
split
the
word
“HourseboatSeason2005” into the tokens { “House”,
“Boat”, “Season”, “2005” } as discussed in the last
section.
Before starting with the analysis of the word, all nonalphanumeric characters are removed and the word is
transformed into lower case.
3. Managing problems
Basically, the success rate is highly influenced by the
quality and completeness of the language specific word
list. The approach benefits from a large amount of words.
In order to ensure a high speed, we use a hash table for
managing the wordlist.
A major problem of this approach is the existence of
very small words in a language that might lead to a wrong
decomposition. For example the word “a” in English or
the word “er” (he) in German can lead to results that
change the meaning of the word. For example, our
approach would decompose the word “Laserdrucker” into
“Las”, “Er”, “Druck”, “Er” (read-he-print-he) instead of
8
this test, jWordSplitter was able to process 120.000
words per minute.
In order to test the quality of the approach, we took a list
of 200 randomly chosen compounds, consisting of 456
morphemes. The average time for splitting a word took
about 80 milliseconds. Within this test set, jWordSplitter
has been unable to split about 5% of the words
completely. Another 6% have been decomposed
incorrectly. Hence, 89% have been decomposed
completely and without any errors and about 94% have
been either composed correctly or at least not been
composed incorrectly.
We performed the same test with SiSiSi, which took
about twice as long and which was unable to split 16% of
the words. However, their failure rate was a bit less (3%).
“Laser”, “Drucker” (laser printer). In order to avoid this,
we use a minimum length of words, which in the current
implementation is a length of 4. This made the problem
almost disappear in practical scenarios.
4. Related work
There are of course several other approaches that can
be used for decomposing compounds. One example is the
software Machinese Phrase Tagger from Connexor1,
which is a “word tagger” used for identifying the type of
a word. It can, however, also be used to identify the
morphemes of a word but is quite to slow on large texts.
Another example is SiSiSi (Si3) as described in [8] and
[9]. It was not developed for decomposing compounds
but for performing hyphenations. It does, however,
identify main hyphenation areas for each word, which is
in most cases identical with the morphemes of a
compound. More examples can be found in [10] and
[11].2
Existing solutions where, however, not developed for the
usage in the web information retrieval domain. This
means that many of them have a low failure rate but do
also need a lot of time compared to our approach. In the
following section we will therefore perform an
evaluation, analyzing the time and quality of our
approach.
7. Conclusion
We have presented an approach, which we argue is
suited for using it as a method for preparing text
information in web information retrieval scenarios. The
approach offers a good compromise between failure rate,
speed and ability to split words. We think that in this
domain it is most important to split as much words as
possible in a short period of time, while we believe that a
small amount of incorrect decompositions is acceptable
for achieving this.
Our approach can be used to ease the interpretation of
text. It could for example be used in search engines and
classification methods.
5. Realization and Evaluation
The approach was realized in Java with the name
jWordSplitter and was published as an open source
solution using the GPL license. We used the
implementation to perform an evaluation of the approach.
We analyzed the implementation based on (i) its speed
and (ii) its quality since we think that both is important
for web information retrieval.
8.
Further
independence
research
and
language
In order to test the effectiveness of our approach, we
intend to integrate it into the “Apricot”-project. It is
proposed to offer a complete open source based solution
for finding product information on the internet. We
therefore provide a way of analyzing product data
automatically in order to allow a fast search of products
and in order to classify the discovered information. The
integration of jWordSplitter in this real-world project will
help to evaluate its long-term application and will
hopefully also lead to an identification of problematic
areas of the approach.
The speed of our approach was measured in three
different cases.
• Using compounds with a large amount of morphemes
(i.e. consisting of 5 or more morphemes). In this
case, our approach was able to split about 50.000
words per minute.
• Using compounds that consist of 1 or 2 morphemes
(e.g. “handwriting”). In This case, our approach has
been able to split about 150.000 words per minute.
• Using words that do not make any sense and cannot
be decomposed at all (e.g. “Gnuavegsdfweeerr”). In
An interesting question is the language independence
of the approach. jWordSplitter itself is designed to be
fully language independent. It is obvious that its benefit
does, however, vary between languages. While word
splitting is very important for languages with many
compounds, it might lead to fewer advantages in other
languages. We therefore intend to extend Lucene3, an
open source search engine by preprocessing text with
1
http://www.connexor.com
Please note that pure hyphenating is not the same as word
splitting. It is only equivalent in those cases where each
root word has only one syllable.
2
3
9
http://lucene.apache.org
jWordSplitter. Afterwards, we will rate its search results
before and after the integration. We intend to repeat this
test for different languages in order to get a language
dependent statement about the benefits of jWordSplitter.
[5] Jones, S.; Willet, K.; Willet, P.: Readings in
Information Retrieval, Morgan Kaufmann, 1997
[6] Porter, M.F.: An algorithm for suffix stripping,
Program, 14(3), 1980
9. Acknowledgement and Sources
In the current implementation, it contains a German
wordlist. Since the approach itself is language
independent, it can be used for other languages as well if
a wordlist is provided.
The sources for jWordSplitter are available online as
an open source project using the GPL at:
http://www.wi-ol.de/jWordSplitter
10. References
[1] Eliassi-Rad, T.; Shavlik, J. W.: Using a Trained Text
Classifier to Extract Information, Technical Report,
University of Wisconsin, 1999
[2] Jones, R.; McCallum, A.; Nigam, K.; Riloff, E.:
Bootstrapping for Text Learning Tasks, In IJCAI-99
Workshop on Text Mining: Foundations, Techniques
and Applications, pp. 52-63. 1999.
[3] Wilcox A.; Hripcsak G. Classification Algorithms
Applied to Narrative Reports, in: Proceedings of the
AMIA Symposium, 1999
[4] Harman, D; Schäuble, P.; Smeaton, A.: Document
Processing, in: Survey of the state of the art in human
language technology, Cambridge Univ. Press, 1998
[7] Heyer, G.; Quasthoff, U.; Wolff, C.: Möglichkeiten
und Verfahren zur automatischen Gewinnung von
Fachbegriffen aus Texten. In: Proceedings des
Innovationsforums Content Management, 2002
[8] Kodydek, G.: A Word Analysis System for German
Hyphenation, Full Text Search, and Spell Checking,
with Regard to the Latest Reform of German
Orthography. In: Proceedings of the Third
International Workshop on Text, Speech and Dialogue
(TSD 2000), Springer-Verlag, 2000
[9] Kodydek, G.; Schönhacker, M.: Si3Trenn and Si3Silb:
Using the SiSiSi Word Analysis System for PreHyphenation and Syllable Counting in German
Documents, Proceedings of the 6th Internat.
Conference on Text, Speech and Dialogue (TSD
2003), Springer-Verlag, 2003
[10] Andersson, L.: Performance of Two Statistical
Indexing Methods, with and without Compound-word
Analysis,
www.nada.kth.se/kurser/kth/2D1418/
uppsatser03/LindaAndersson_compound.pdf, 2003
[11] Neumann, G.; Piskorski, J.: A Shallow Text
Processing Core Engine. In: Proceedings of Online
2002, 25th European Congress Fair for Technical
Communication, 2002
10
!"#
$
&
+ )
12
-/
,
* %)
* - , * . /) 0
) ) 3 / ) ) )3
-) ++
0
*+
4 *
)
. /-) + )
) )
4/
)
4
$
.
/+ %
+) 0 + ).
44 4 )
* . /) 0 +
+ +
) % 4
+) 0
0-) +
) 0
, )
* )
) 0/ ) 4
4 3)
+
)
* . /) 05 - ) 0 4
* 12
66/ 3
)
*. )
)
* %) ) ) / +
4
)
3
) 7 ) 48
)
44 4 )
* +
) 0
9
0 0
)
) / 4
+ ) ) )/
*
+
+) 0
+) 4 4 - 4 4
-) +) + * 66/ 0)
* - ,
+ ) )
4 ) 0
4
: ;<
* %)
* . /) 0
)5
)
0
12
) 4 )
) )) /
+ - , *
)
=
* %4 ) 0
4 *
. /) 0
)
)
:
<
!
%
)
00
%
'() ) *
0 ) +)
4
)
)
* +
4 ) 0
+ +)
+
4 ) 0 4
) + 12
*)
0) %4 ) 0 + 4
** %)
. /) 0 *
12
+ +
)*) ) 0 ) ?
!
!
5
)
+
)6 / )
)%
*) / 0
*)
0
)/
) / +
)0
*
+
%+ 0
*+
/
5
)
+
)6
/
0
) 0
0 0
)/ *
)
7)
+
) 4
) *
)0+
+
3
*
-) + )%
+
)
) *8
/ )%
#
)
), 12
)5
*
)
5 ), 3) -
+
12
: <
++
)
+
*
+
)
4
)
+)
-) +
+ 4
4
0
7
8
0
)
)
*
0
+ *
*
+
0 +
*
+ *+
@A% 3 ) BC !C
) 0BC
5 CAD
@ 0) D
@
0 )
)*) B C# C D
@ +
D
@
D@
D E @9
D@9
D
@
D@
D # @9
D@9
D
@9+
D
@4 ) D F!!@94 ) D @9 0 D
@
0 )
)*) B CE!C D
@ +
D
@
D E @9
D
@4 ) D
@
D !!@9
D
@-)
D !!@9
-)
D @94 )
@9+
D
@9 0 D
@9 0) D
G
+
+)
*
+
"
%
*
+
4 *
* =
12
7)
-+) + /
)**
08
+ * -
+
)
>+
) 12
- 5
*
%
,
"
)5
4
$
-?
+ 4?99--- ) ) *9=
11
%4
+*
-) + 1H
D
0 -) + E
/
+
I0
>J
7C
I099
@
09
0 DI0()
0 % C899 0)
BE
)*) &'
0D
+) . /-)
*)
)
+ )
,
4/
*
-) 0
+ )
4
, - 0 *+
+ +
) ,
+
) +)4
+ C
0C
C
C 0 )
*
9
) +)4 + + 0)3 4
9+)
) +)4 )
))
+ *0
4 C +
C 0
C
0C
C
C) . /
. /) 0
)5
+ *+ . /
+ *+
4
J -3
) 0 ) / * + 12
+ . / + ).
+
%
+) 0
/4) / )
.
), /
4/ )
4
* ) +)
3
+ ). 4 4
-) + +)
4
+
+ * 66/ 0)
+
66/ 0) + + )
*
/) 0 + 4
*
4
)
*) 4 ) )
) / )0
3 4 * 66/
) 0
+ )
, )
+ * 66/
)
, - 0
+ + +
) 0)
+
4 -+ . /) 0 +
)5
74 )
/
12
8 )
G*)
%
K
*
7)
) +
) ) 3
K
8 4 ) /-+ + K
)
, 4) )
-+ . /) 0 +
0
12
) -+) + +
3)
)
)**
*
+
+ *
*)
7 3 )** )
+
+4
*
%) 8
)
4
)
* 3
+) - ,
+ +
4 *
%
+
)
) ) )
*
* 66/ 0) * - ,
44) + )
*) /? + %4 )
*+
G4 *
+
)3
3
*) 4
/? +
3
)
*. ) )
,+
) 0
+ 0
-+) + + / ) */ +
G4 *
)
+) * - ,
+
4 4
/
) )
+) - ,50 4 :;< +)
)
+
)
* + * 66/
0 4+ : #< +
)
*
+ 12
/ 3
) 0+
) 4
7- )0+) 08 * + ) *
)
)
+
9
+ )* + /
3) 4 + * / 0+) +
)0)
+ - )0+ * 7) +
*
8)
4
/ 00 0 ) 03)
5
:"< +
- )0+ * +
0) 0 + 4 + )
4
) + )0)
0 4+:;< J
+ 4 ) )4 )
* +)
44 + )
4 *
+) 0 4
)
* 66/0 4+
+) 04
>+) +) 0 /)
+ * %) ) ) / )
)
)
/ + )
4
4 %) / * + 4
*
* 66/
) +
4
0
*
+) 0
+ )
0 4+
+ . /0 4+
3 +
)
)
+
4 %) /
0
* +) 4
-+) + )
/ +)0+ 70 4+
+) 0 )
0 +
4 % 4 0
) 0
4
8 > +3 *
+
+
+
+ - , *
)
=
:
<
+ * %)
. /) 0 *
)
)
)5
*
+
*12
)
+
/
* + ) **) ) / *. /) 0
+
%
+) 0 ) )
)
/ 4 ) +
%4
. ) -) +
) )
*, - 0
K
+ + . ) -) +
+)0+
)
+ + . )
)
)0 /
+
4 *
- )0+
) ) + 0
4 4 /
.)
+)
) *)
/4)
*
* 66/ 0) *
- , +)
)
4
/ * 66/ 4 )
:"< ), * % 4 ?
%4 )3
0
J -3
* ) ) 0
44 +
) 0
+ , - 0 *
+ +
+ - -)
) ) 0 ) +) 0
+
*. ) + + - / 4 *
)
+ * -?
#(
! )
44 ) 0 +
+
4) ) )
+
* 00) 0 )
+
*
+ 0
)
+ ) +)
+)
>
*)
+
*+
. /) 0
+)
+ * %) ) ) /
) / +
*+
74 /
8
+
G4 *
+ 3
* )
)
4 )
* %) ) ) /
) )
+
) 4
*+ . /
) )
+
*
4) ) )
+
)**
+
) )
3 +
+) *
** %) ) ) / )
/
)
+ - , *
) =
L
=)3 :E
M<
+
* %)
. /) 0 + * - -) *
/
+
) ) 0
)
+ . /) 0 -) +
, 4 ) , - 0 * +
*(
>) +
)
+
)
! )
4) ) , - 0 *
+
/ %4
+) . ) ) )**
-/ +
+ + 4 ) ))/
%4
. ) )
* 5*
)
*, /* ) 0
0
)
3
0
/) *
)
/
* )
+4
+)
44 + +
4
) : <
+
12
0)
*
%4
)
3)
4 3)
* , /0
)
9 ,/
*+
)
+
+ 0
) )
*
+ 0 )/ *+
)4)
)4 /
) 0 +)
) 0 3
+
0)
)0+
0/G
) ) 0
+
)
) *, - 0 +)
0) )
) ) 0 + 4) 0 +
* %4 ) 0 +) . /
+
)
/ 44
+3 0
, - 0
*+
) 7+
+ , /8
*
+
/) 0
*
: <
J -3 )
44 + + 4
* /
0)
)
+ 4+
+
+ 44 4 )
%)
)
0
) +
%
+
- L )
0
-/ +
-) %4
. )
/) ) ) 0
*4 *
7) * 66/ ) ) 8
*) 4
&ω) ? B " '-) + + ) 3
? B#"
*
4) ) /
+ +
4) ) / 3
ω
) + *
- , *
) )
4 ) ) ) / + /: !< /
0
*
)/ ) ) 0
+ *
7ω 8 B
+ - ) 4 )3 ω )
4
/
+ *
7ω) 8
4 ) ,)
* )4
)
:F< ) 0 0
*
+)4µω $ * - ?
) 0
)
µ ω 7α 8 =
=
)*α
) *)
ω
)*α 3)
−
ω
4
-+ %
) )
/
+) - /
-+ %
) 4 )
) */ ω 5 ) )
3)
)
+ /4
)
)
+
ω
) *) 7ω 8
0
.
5 ) >+
3)
4 ) ))/ 0
)
4
)
* ) ))/ +
*
)/ 0
+
3
*4 ) ) / *ω ) )
30
) ) + -) ?
µ ω 7α 8 =
(
%
−
-) +
) )
)0+5* *)
) )
3
* 0)3 3 N +
+ 3
+
0)3 3
) 0 +
+4
) -+
+
) ) . ) ) -) +
) 0
+
+
) ) )/
*)
%4) ) /)
,)
* - ,
) 0 ) )
)
+
0 0 *
+
)
*)
) 4) ) / /
*+ ) 0 )
)
: E< * + * ) ) ) / 0 )
)
) ) )/ 3
+
-)
0
/ ++ +
-)
) ) ) )6 /
*4 )
) *)
+ +)0+ 5) 4
) )
+)
+ * )
) *
4 )
)
+
4 ) 0
) +
+)
* * 3 / ) )
+)
* )
) *
) -)
*
/4 )
+ +
) )
4 ) 0
) 4
+ ) )
4 ) 0
: E< * 0 4+) +
*
) )
0+
) 0+
-+
/G 0+G
+
* + - )0+ * +
) 0) , ) +
% + ) )
4 ) 0
)
*)
-+) +
)
. /
*
9
J
+ 3
)
*. /
0
*
+ 0
*4 ) ) ) /
0
*
)/ /
* 00 0 )
* +
4 ) ))/
)/ 0
*
/ ) )
+
+ . )
*
/
K )
+
0
*
4 ) ) ) /)
/ + ) )
+ . )
*
/ )K )
+
0
* 4 ) ))/ )
/+
%)
* -?
µΠ
µΠ
µ ω 7α 8 )
µω 7α8 ) * ) ) ) / 3 * 4
)
- α /ω
+
- α)
) )
+ ? µω 7α8 B ) )
+ ) )
/ ) *) ω) -+) µω 7α8 B ! ) )
ω) ! @ µω 7α8 @ -+ α ) *)
/
/ 3)
µω)7α8 % µω)7β8 ) )
+ ω )
4 ) /ω
/ β
+ *
) *)
/ α +
* ) ) ω
7ω 8 D 7ω&8 %4 ) 0 + + ) * )
+ . )
) 4 *
+
)* )
*ω J
3
4 0 )3 / / 3
) 0 . ) / +
) ) *
+ +)0+ 54) ) /
+
- 54) ) /
+ * + /
/
) 0+ . /
) 0 +) ) 4
3 )
)
) 0/
3
+
/+ ) 0 + * /) 4
+ - , +
) 0
*
3)
+) ) 4
3
+) 0*
4 ) 0
7 0
)
=$
8 ) +
-+) +
)
**) ) / ) )
7-) + 4
0)3 + + 8 J - 3 *
)
α
*
K
7 ϖ ϖ ⊂℘ 8 =
) K
) *=
(ϖ ϖ ⊂℘ ) =
++ α
+
℘
+,
4=
)/ 0
%α ∈ϖ (
)
α ∈ϖ
(
µ ω 7α 8
)
µ ω 7α 8
)
* + . / ω)
) )
*ω) ϖ
*0
"! )"
+)
+ . )
4
)
4
/ - )0+
-+
+
/3 )
/3 )
3 )
/
+
0
)
4
)
- )0+
)
4
+ - )0+
) *
)
)
+) - , * +
)
+ 0
+3
) /
+)
*+ ) +
13
−
+
3
+
) ) 0) +
*
0
+ - )0+
4
-+) + 44 %)
) 4 )
-) + +
0 *+
4
+ ) 4
3 *+
+
+ 0 L )
)
+ - )0+
3 +
44
+ +
4
, - 0
+
+)
*+
0 )/
3
)
$
.
/ + . ) *
/ +)
/) 4 )
)
44 %)
+)
K )*/ + * + +/4 + ) 3 -) + +
) 4/ -) + %
+)
+)
)
44
+
, -4 ) /+
* +)
+
0 -) +*
+ 4)
.
E!!O -) + ) 4
.
C! C
)
*
C!"C
)
4)
+) . /
4
) + / - )0+
4
+
)
)0
/ + +) . / +
44 %)
+)
+)
)
+ )
/
)0
+) )
+ * + - +3
3 4)
*+
0 ) + 0)3
+
% *
0 4+
+) 0 +) 4
)
+ 4
* +
5
+
%
. /) 0 12
) 4
+ * +
+
5
* 0
)
)
/ ) 0
)
4
. / * + ) 3 -/
-) +
+) 4
% 4 ) +
* + 4) +
* 00 0 +
C4 ) C )
C4 ) C
+
7
*
8+
+
4 ) 0
+
% L *
)
4 ))/ *
+ 0 )+ *
- -)
3
0)3
,
+ -+
5
4 ) 0 +
0
74 )
B!"8
7
E!!O
/
%
&
B! 8
E
"
)
+ 0 )+
4
)
4?
+ *)
)
)6 ) +
) )
*5
*+
4
. /
0
*
+ ) +)
+)
+)
4)
/ . )3
+ . /) 0 -) + , +
)
+
) 05
*+
)6
+
3 ) )
* + ) +)
+)
-) +
4
. / )
)+
+
+) 0
. /
% )
) )
4 ) 0
+
)
-) +) ) ) /
) * ) )
) 4
) +
4
. /
3 / 4 * +
0 )+
)
+4 3)
)
)
)
)
,) 0 )
+ ) +)
+)
+
*)
4
. / /
+
/
)
-+
+
+)
+)
*
4
. /
4
+ ) )
4 ) 0
) )
+
+
5 4 44 + )
+
- 4 0 )3 /*
+
*5
5
*
+ +)0+ 54 ) ) /
+ - 5
4) ) /
-,
"
$
+) 4 4 - + 3
,
+ 4
*. /) 0
12
> +3
*+
+
00
+
-) +
) ) 0
,
%+ )3
3 /
)
/
+ * 66/ 0)
*
- , - +3 4 4
+
-) 0
4 0 )3 / ) ) 0 +
/ 3
) 0 +
3
- )
4 ) ) ) / * *) ) 0 +
4 )
+ *+
+ - +3
+ ) )
4 ) 0
+
)
)
* +) - , - -)
3
) )) /
, +
/ %4 )
-) +) +
=
P4 *
1 7 ) ) )3 * + 3
)
*12
) 3 8)
,
4 )3
/
-) + + + 4 4
44 +
.
: <
5Q +)
$+
)3 3
=
% )
L $6 + 4 ) #!!#
:#< L 0 %
% R 4+ 2 +) 0 ) 0 ) )
*
) ) )
0 ) + =+
+ )
= ) #!!#
: < $+
S $+ 0 Q
66/
)
R 4+ -) +
44) )
$+
0))
/
2
$/
)
MM#
:E< L
= =
J
)
)
* 66/
4 ) ))/ + /
+
*
) /
) 4 )) )
2
MMF 4 # ;5 #E
:;<
) )
L) . )
12
%4 /
44) )
#!!! 4#""5#FM
:"< S ) R Q
L
66/
66/ 0) ? + /
44) )
)) ?=4
, MM
:F<
)
0) J =
J? = ) ) ) / + / )
$
)
)* )
=
44)
)0
T " MM" 4 # F5 !M
: <
)
=
J
66/ 0)
+ ). )
2 )
)
H /) 0?
= ) ) /
3 )0 )
*+ =
)
S 0 7 8 #!! 4
5 M#
:M<
)
=
J
) 0 * 66/
) * %)
(
)
. /) 0? -+/
+- '
MMF = E; 5 "!
: !<
)
0 U =
J = ) )) )
0)
J
, * 0) )
0) = 0
) 0 7
R
/
%*
)3 ) /=
MME 4 E MP;
: <=
J V
%)
. /) 0 *
)5
!+
#!! 4 M#5 M;
: #< 2
U
) = 66/ R 4+
66/
J/4 0 4+ 7
) )
66)
* $ 4 ) 08
=+/ ) 5 T 0 J )
0#!!!
: < )
U 0 ) +J 3
) 0
) ) ) /)
12
) ;+
)
> ,+ 4
+ >
#!!#
: E< 3 + ) T L) /
4
*
) 0
)
)
)
3
3 =+/
, " M""
4F!F5F !
: ;< R +
J =
+ +)
/ *+ 4 ) 0
4
C
* + J)
/ *$ 4 ) 0 F
M ; 44 5#
14
Link Analysis in National Web Domains
Ricardo Baeza-Yates
ICREA Professor
University Pompeu Fabra
[email protected]
Abstract
The Web can be seen as a graph in which every page
is a node, and every hyper-link between two pages is an
edge. This Web graph forms a scale-free network: a graph
in which the distribution of the degree of the nodes is very
skewed. This graph is also self-similar, in terms that a small
part of the graph shares most properties with the entire
graph.
This paper compares the characteristics of several national Web domains, by studying the Web graph of large
collections obtained using a Web crawler; the comparison
unveils striking similarities between the Web graphs of very
different countries.
Carlos Castillo
Department of Technology
University Pompeu Fabra
[email protected]
Table 1 summarizes the characteristics of the collections.
The number of unique hosts was measured by the ISC2 ; the
last column is the number of pages actually downloaded.
Table 1. Characteristics of the collections.
Collection
Year Available hosts Pages
[mill] (rank) [mill]
Brazil
2005
3.9
11th
4.7
Chile
2004
0.3
42th
3.3
Greece
2004
0.3
40th
3.7
th
Indochina
2004
0.5
38
7.4
Italy
2004
9.3
4th
41.3
South Korea 2004
0.2
47th
8.9
Spain
2004
1.3
25th
16.2
U. K.
2002
4.4
10th
18.5
1 Introduction
Large samples from specific communities, such as national domains, have a good balance between diversity and
completeness. They include pages inside a common geographical, historical and cultural context that are written by
diverse authors in different organizations. National Web domains also have a moderate size that allows good accuracy
in the results; because of this, they have attracted the attention of several researchers.
In this paper, we study eight national domains. The
collection studied include four collections obtained using
WIRE [3]: Brazil (BR domain) [18, 15], Chile (CL domain)
[1, 8, 4], Greece (GR domain) [12] and South Korea (KR
domain) [7]; three collections obtained from the Laboratory
of Web Algorithmics1: Indochina (KH, LA, MM, TH and VN
domains), Italy (IT domain) and the United Kingdom (UK
domain); and one collection obtained using Akwan [10]:
Spain (ES domain) [6]. Our 104-million page sample is
less than 1% of the indexable Web [13] but presents characteristics that are very similar to those of the full Web.
By observing the number of available hosts and the
downloaded pages in each collection, we consider that most
of them have a high coverage. The collections of Brazil
and the United Kingdom are smaller samples in comparison with the others, but their sizes are large enough to show
results that are consistent with the others.
Zipf’s law: the graph representing the connections between Web pages has a scale-free topology. Scale-free networks, as opposed to random networks, are characterized
by an uneven distribution of links. For a page p, we have
P r(p has k links) ∝ k −θ . We find this distribution on the
Web in almost every aspect, and it is the same distribution
found by economist Vilfredo Pareto in 1896 for the distribution of wealth in large populations, and by George K. Zipf
in 1932 for the frequency of words in texts. This distribution later turned out to be applicable to several domains [19]
and was called by Zipf the law of minimal effort.
Section 2 studies the Web graph, and section 3 the Hostgraph. The last section presents our conclusions.
1 Laboratory
of
Web
Algorithmics,
Scienze dell’Informazione, Universitá degli
<http://law.dsi.unimi.it/>.
2 Internet
Systems
Consortium’s
<http://www.isc.org/ds/>
Dipartimento
di
studi di Milano,
15
domain
survey,
Indegree
2 Web graph
The distributions of the indegree and outdegree are
shown in Figure 1; both are consistent with a power-law
distribution. When examining the distribution of outdegree,
we found two different curves: one for smaller outdegrees
–less than 20 to 30 out-links– and another one for larger
outdegrees. They both show a power-law distribution and
we estimated the exponents for both parts separately.
For the in-degree, the average power-law exponent θ we
observed was 1.9±0.1; this can be compared with the value
of 2.1 observed by other authors [9, 11] in samples of the
global Web. For the out-degree, the exponent was 0.6 ± 0.2
for small outdegrees, and 2.8 ± 0.8 for large out-degrees;
the latter can be compared with the parameters 2.7 [9] and
2.2 [11] found for samples of the global Web.
Brazil
10−1
10−2
−3
10
10−4
−5
10
10−6
10−7
0
10
2.1 Degree
Outdegree
Brazil
10−1
10−2
−3
10
10−4
−5
10
−6
1
102
10
3
10
4
10
Chile
10−1
−2
10
10−3
−4
10
−5
10
10−6
10−7
0
10
100
10
101
102
103
102
10
Chile
10−1
−2
10
−3
10
10−4
10−5
−6
1
2
10
3
10
10
4
10
0
10
1
10
10
Greece
10−1
10−2
10−3
−4
10
−5
10
10−6
−7
10
100
Greece
10−1
−2
10
10−3
−4
10
−5
10
−6
1
2
10
3
10
10
4
10
10
100
1
2
10
103
10
2.2 Ranking
One of the main algorithms for link-based ranking of
Web pages is PageRank [16]. We calculated the PageRank
distribution for several collections and found a power-law in
the distribution of the obtained scores, with average exponent 1.86 ± 0.06. In theory, the PageRank exponent should
be similar to the indegree exponent [17] (the value they
measured for the exponent was 2.1), and this is indeed the
case. The distribution of PageRank values can be seen in
Figure 2.
We also calculated a static version of the HITS scores
[14], counting only external links and calculating the scores
in the whole graph, instead of only on a set of pages.
The tail of the distribution of authority-score also follows
a power law. In the case of hub-score, it is difficult to assert
that the data follows a power-law because the frequencies
seems to be much more dispersed. The average exponent
observed was 3.0 ± 0.5 for hub score, and 1.84 ± 0.01 for
authority score.
3 Hostgraph
We studied the hostgraph [11], this is, the graph created
by changing all the nodes representing Web pages in the
same Web site by a single node representing the Web site.
The hostgraph is a graph in which there is a node for each
Web site, and two nodes A and B are connected iff there is
at least one link on site A pointing to a page in site B. In
this section, we consider only the collections from which
we have a hostgraph.
Italy
10−1
−2
10
−3
10
10−4
−5
10
10−6
10−7
100
−2
10
10−3
10−4
10−5
101
10−1
10−2
−3
10
10−4
10−5
10−6
10−7
100
10−1
10−2
10−3
10−4
10−5
10−6
10−7
100
102
103
10−6
100
4
10
101
Korea
−1
10
−2
10
10−3
10−4
10−5
10−6
10−7
100
Italy
10−1
102
103
Korea
10−1
10−2
−3
10
−4
10
−5
10
1
10
2
10
3
10
10−6
100
4
10
Spain
101
102
103
Spain
10−1
−2
10
10−3
10−4
−5
10
101
102
103
4
10
10−6
0
10
101
U.K.
102
103
U.K.
10−1
−2
10
10−3
10−4
10−5
1
10
2
10
3
10
10
4
10−6
100
101
102
3
10
Figure 1. Histograms of the indegree and outdegree of Web pages, including a fit for a
power-law distribution.
16
3
Brazil
10-2
Chile
10-2
Greece
10-2
10-3
10-3
10-3
10-4
10-4
10-4
10
-5
10-5
10-5
10
10
-5
10
10-6
10-6
10-7
10-7
10
10-6
-5
10
10-4
10-4
-6
10-3
10
10
-4
10
10-7
10
-6
-5
10
10
-4
-4
-6
-6
10
10-7
10-7
10
-7
10-6
10-5
10-4
10-7
-6
10
-5
10-4
-7
-7
-6
10
-5
10
10
10-4
10
-7
10-6
10
Greece
-5
10-4
10
Korea
10-3
-4
10
10
10-5
10
10-6
10
10
Korea
10-6
10
-5
-6
-7
10
10-4
-5
-4
-5
10
10
10-3
10
-5
10
-4
10
10-5
Chile
10
-6
10
10-3
10
10
10-7
Greece
-7
10-3
10-5
-4
10
-6
Brazil
-4
10-5
10
-7
-5
10-6
10
-4
10
-6
-7
10-7
10-3
-5
-7
10
Chile
10
-7
10
10-5
10
10-6
10
-6
10
-4
-5
-6
10
-7
10-3
10-4
10
-6
Brazil
10-3
10
10-7
-4
10
-7
Korea
10-2
-3
-7
10-6
10
-5
10-7
10-7
-4
10
10
-6
-5
10
10-4
10
Figure 2. Histograms of the scores using PageRank (top), hubs (middle) and authorities (bottom).
3.1 Degree
3.2 Web structure
The average indegree per Web site (average number of
different Web sites inside the same country linking to a
given Web site) was 3.5 for Brazil, 1.2 for Chile, 1.6 for
Greece, 37.0 for South Korea and 1.5 for Spain. The histograms of indegree is consistent with a Zipfian distribution,
with parameter 1.8 ± 0.3.
Broder et al. [9] proposed a partition of the Web graph
based on the relationship of pages with the larger strongly
connected component (SCC) on the graph. The pages in
the larger strongly connected component belong to the category MAIN. All the pages reachable from MAIN by following links forwards belong to the category OUT, and by following links backwards to the category IN. The rest of the
Web that is weakly connected (disregarding the direction of
links) to MAIN is in a component called TENDRILS.
By manual inspection we observed that in Brazil and
specially in South Korea, there is a significant use –and
abuse– of DNS wildcarding. DNS wildcarding is a way
of configuring DNS servers so they reply with the same IP
address no matter which host name is used in a DNS query.
The average outdegree per Web site (average number
of different Web sites inside the same country linked by a
given Web site) was 2.2 for Brazil, 2.4 for Chile, 4.8 for
Greece, 16.5 for South Korea and 11.2 for Spain. The distribution of outdegree also exhibits a power-law with parameter 1.6 ± 0.3.
We also measured the number of internal links, that is,
links going to pages inside the same Web site. We normalized this by the number of pages in each Web site, to be
able to compare values. We observed a combination of two
power-law distributions: one for Web sites with up to 10 internal links per Web page on average, and one for Web sites
with more internal links per Web page. For the sites with
less than 10 internal links per page on average, the parameter for the power-law was 1.1 ± 0.3, and for sites with more
internal links per page on average, 3.0 ± 0.3.
In [2] we showed that this macroscopic structure is similar at the hostgraph level: the hostgraphs we examined
are scale-free networks and have a giant strongly connected
component. We observed that distribution of the sizes of
their strongly connected components is shown in Figure 3.
0
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
Brazil
1
10
2
10
3
10
0
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
4
0
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
5
10
10
Chile
1
10
2
10
3
10
4
10
Korea
1
10
2
10
10
3
10
4
5
10
0
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
0
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
5
10
Greece
10
1
10
2
Spain
10
1
2
10
10
3
10
4
5
10
Figure 3. Histograms of the sizes of SCCs.
17
3
10
4
10
5
10
The parameter for the power-law distribution was 2.7 ±
0.7. In Chile, Greece and Spain, a sole giant SCC appears
having at least 2 orders of magnitude more Web sites than
the second largest SCC component. In the case of Brazil,
there are two giant SCCs. The larger one is a “natural” one,
containing Web sites from different domains. The second
larger is an “artificial” one, containing only Web sites under
a domain that uses DNS wildcarding to create a “link farm”
(a strongly connected community of mutual links). In the
case of South Korea, we detected at least 5 large link farms.
Regarding the Web structure, the distribution between
sites in general gives the component called OUT a large
share. If we do not consider sites that are weakly connected
to MAIN, IN has on average 8% of the sites, MAIN 28%,
OUT 58% and TENDRILS 6%. The sites that are disconnected from MAIN are 40% on average, but contribute less
than 10% of the pages.
4 Conclusions
Even when the collections were obtained from countries
with different economical, historical and geographical contexts, and speaking different languages we observed that
the results across different collections are always consistent when the observed characteristic exhibits a power-law
in one collection. In this class we include the distribution of
degrees, link-based scores, internal links, etc.
Besides links, we are working in a detailed account of
the characteristics of the contents and technologies used in
several collections [5].
Acknowledgments: We worked with Vicente López in
the study of the Spanish Web, with Efthimis N. Efthimiadis
in the study of the Greek Web, with Felipe Ortiz, Bárbara
Poblete and Felipe Saint-Jean in the studies of the Chilean
Web and with Felipe Lalanne in the study of the Korean
Web. We also thank the Laboratory of Web Algorithmics
for making their Web collections available for research.
References
[1] R. Baeza-Yates and C. Castillo. Caracterizando la Web
Chilena. In Encuentro chileno de ciencias de la computación,
Punta Arenas, Chile, 2000. Sociedad Chilena de Ciencias de
la Computaci´on.
[2] R. Baeza-Yates and C. Castillo. Relating Web characteristics
with link based Web page ranking. In Proceedings of String
Processing and Information Retrieval SPIRE, pages 21–32,
Laguna San Rafael, Chile, 2001. IEEE CS Press.
[3] R. Baeza-Yates and C. Castillo. Balancing volume, quality
and freshness in Web crawling. In Soft Computing Systems Design, Management and Applications, pages 565–572, Santiago, Chile, 2002. IOS Press Amsterdam.
[4] R. Baeza-Yates and C. Castillo. Caracter´ısticas de la Web
Chilena 2004. Technical report, Center for Web Research,
University of Chile, 2005.
[5] R. Baeza-Yates and C. Castillo. Characterization of national
Web domains. Technical report, Universitat Pompeu Fabra,
July 2005.
[6] R. Baeza-Yates, C. Castillo, and V. L´opez. Caracter´ısticas
de la Web de España. Technical report, Universitat Pompeu
Fabra, 2005.
[7] R. Baeza-Yates and F. Lalanne. Characteristics of the Korean
Web. Technical report, Korea–Chile IT Cooperation Center
ITCC, 2004.
[8] R. Baeza-Yates and B. Poblete. Evolution of the Chilean Web
structure composition. In Proceedings of Latin American Web
Conference, pages 11–13, Santiago, Chile, 2003. IEEE CS
Press.
[9] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web: Experiments and models. In Proceedings
of the Ninth Conference on World Wide Web, pages 309–320,
Amsterdam, Netherlands, May 2000. ACM Press.
[10] A. S. da Silva, E. A. Veloso, P. B. Golgher, , A. H. F. Laender,
and N. Ziviani. CoBWeb - A crawler for the Brazilian Web. In
Proceedings of String Processing and Information Retrieval
(SPIRE), pages 184–191, Cancun, Mxico, 1999. IEEE CS
Press.
[11] S. Dill, R. Kumar, K. S. Mccurley, S. Rajagopalan,
D. Sivakumar, and A. Tomkins. Self-similarity in the web.
ACM Trans. Inter. Tech., 2(3):205–223, 2002.
[12] E. Efthimiadis and C. Castillo. Charting the Greek Web.
In Proceedings of the Conference of the American Society
for Information Science and Technology (ASIST), Providence,
Rhode Island, USA, November 2004. American Society for
Information Science and Technology.
[13] A. Gulli and A. Signorini. The indexable Web is more than
11.5 billion pages. In Poster proceedings of the 14th international conference on World Wide Web, pages 902–903, Chiba,
Japan, 2005. ACM Press.
[14] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999.
[15] M. Modesto, ´a. Pereira, N. Ziviani, C. Castillo, and R. BaezaYates. Un novo retrato da Web Brasileira. In Proceedings of
SEMISH, São Leopoldo, Brazil, 2005.
[16] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical
report, Stanford Digital Library Technologies Project, 1998.
[17] G. Pandurangan, P. Raghavan, and E. Upfal. Using Pagerank to characterize Web structure. In Proceedings of the 8th
Annual International Computing and Combinatorics Conference (COCOON), volume 2387 of Lecture Notes in Computer
Science, pages 330–390, Singapore, August 2002. Springer.
[18] E. A. Veloso, E. de Moura, P. Golgher, A. da Silva,
R. Almeida, A. Laender, R. B. Neto, and N. Ziviani. Um
retrato da Web Brasileira. In Proceedings of Simposio
Brasileiro de Computacao, Curitiba, Brasil, 2000.
[19] G. K. Zipf. Human behavior and the principle of least effort: An introduction to human ecology. Addison-Wesley,
Cambridge, MA, USA, 1949.
18
Web Document Models for Web Information Retrieval
Michel Beigbeder
G2I department
École Nationale Supérieure des Mines
158, cours Fauriel
F 42023 SAINT ETIENNE CEDEX 2
Abstract
Different Web document models in relation to the hypertext nature of the Web are presented. The Web graph is the
most well known and used data extracted from the Web hypertext. The ways it has been used in works in relation with
information retrieval are surveyed. Finally, some considerations about the integration of these works in a Web search
engine are presented.
1. Web document models
Flat independent pages The immediate reuse of the long
lived Information Retrieval (IR) techniques led to the most
simple model of Web documents. It was used by the first
search engines: Excite (1993), Lycos (1994), AltaVista
(1994), etc. In this model, HTML pages are converted to
plain text by removing the tags and keeping the text between the tags. Easily, the content of some tags can be
ignored. Then pages are indexed as flat plain text. The
prevailing IR model used with this document model is the
vector model, though AltaVista introduced a combination
of a Boolean model to select a bunch of documents which
is then ranked with a vector model.
The main advantage of this model is that many of the traditional IR tools and techniques could be straightforwardly
used.
Structured independent pages The enhancement from
the first model is that some structure about the pages is kept
either in the index or considered in the indexing step. For
instance, with a Boolean like models, words could be only
looked for in the title tag. Such capabilities have been proposed for some time, but, like other Boolean capabilities,
did not get much public success. With the vector model
the words appearing in the title or sectioning tag (for instance) could receive a greater weight than others. Some
search engines mentioned this peculiarity, but, as far as I
know, no details were given and no experiments were conducted to prove the effectiveness of these different weighting schemes.
These uses of the internal structure of the Web documents are very weak compared to the strong internal structure allowed by HTML. But the documents found on the
Web are not strongly structured because many structural elements are misused to obtain page layouts. So, the works
in IR on structured documents are not useful in the actual
Web.
Linked pages In this model the hypertext links represented by the <a href="..."> tags are used to build
a directed graph: the Web graph. The nodes of the graph
are the pages themselves, and there is one arc from the node
P to the node P 0 iff there is somewhere in the HTML code
of P an href link to the page P 0 . Note that this is a simplification of what is really coded in the HTML, because if
there are many href links in P to P 0 , there is only one arc
(otherwise, we would define a multigraph).
But the most difficult point here is to define precisely
what are the nodes: pages or URL or set of pages. Let us
precise the choice.
The pages are identified by their URL, and URL themselves are composed of nine fields:
<scheme>://<user>:<passwd>@<host>:<port>/
<path>;<parameters>?<query>#<fragment>
If the user and passwd fields can be safely ignored,
what to do with the parameters and query ones is not
trivial. By ignoring them to define the nodes, a graph with
fewer nodes and more connectivity is obtained, but the point
is what of the many content is to be associated to the node ?
Moreover, using the fragment field would lead either
to consider the page as composed of smaller units or to consider these smaller units as the documents to be returned
by the search engine. Though, due to the poor use of the
HTML, many of the opening <A NAME="..."> tags are
not closed with a </A>, so many fragments are not fully
19
delimited. So I think that this field should be ignored.
Another difficulty is the replication of pages, either actual replication on different servers or replication through different names on a single server. As
an example of the second case, both http://rim.
Page gathering The page ranking algorithms can be used on
emse.fr/ and http://www.emse.fr/fr/transfert/
g2i/depscientifiques/rim/index.html point on the
same page. When it is possible to recognize this replication, I
think it is better to merge the different URL in a single node because a graph with higher density is obtained and as they refer to
the same content there is not the problem of choosing or building
such a content.
Given some choices regarding the quoted questions, the directed graph can be built. It has been extensively studied [3] [10]
and used for information retrieval in particular. We will review
some of its usages in section 2.
to categorize their neighbors, this idea has been used in combination with the content analysis of the pages [4].
Anchor linked pages This model takes into account more
of the HTML code.
Each anchor, delimited with a <a
href="..."> tag and the corresponding </a> tag, is used to
index the page pointed by the href attribute. This idea is still in use
in some search engines. Moreover in the Web context where spidering is an essential part of the information retrieval system (IRS)
to keep the index up to date, it allows the association of an index to
a document (a page) before it is actually loaded. Variations consist
in heuristics that take into account not only the anchor text itself
but also its neighboring. Note that this is not very different from
the first point exposed in section 2 about relevance propagation.
2. Link usage
The Web graph between pages is used by many works. In relation to IR it has been used for different goals.
Index enhancement and relevance propagation One of
the first ideas tested in hypertext environments [8] consists in using the index of neighbors of a node either to index the node in
the indexing step, or to use the relevance score values (RSV) of
these nodes in the querying step. Both of these methods are based
on the idea that the text in a node (a page in the Web context) is
not self contained and that the text of the neighbors can give either
a context or some precision to the text of the nodes. Savoy conducted many experiments to test this idea. He reports that effectiveness improvements are low with vector and probabilistic models [16] and higher with the Boolean model [17]. Marchiori uses a
propagation with some fading for fuzzy metadata [13]. The same
scheme could be applied to the term weights in the vector model.
Page ranking: PageRank [2] and HITS [9] We will not
describe once more here these two methods. The first one attribute
a (popularity) score to every page, the second one attributes two
(hubbiness and authority) scores to them. The key point is that
these scores are independent of the words used either in the documents or in the query.
any graph, and hence on any subgraph of the Web. The PageRank
algorithm has been used to focus gathering on a given topic [5].
Page categorization If some pages are categorized, it can help
Page classification Classification is different from categorization in the sense that classes are not predefined. A method based
on co-citation, which was first used in library science [18], is presented in the Web context by Prime et al. [15], it aims to semiautomatically qualify Web pages with metadata.
Similar page discovery Dean et al. [6] proposed two solutions to this problem. The first one is based on the HITS algorithm
and the second one is based on co-citation [18].
Replica discovery Bharat et al. present a survey of techniques
to find replicas on the Web [1]. One of them is based on the link
structure of the Web.
Logical Units Discovery The idea here is akin to that of index enhancement: if pages are not self contained, they need to be
indexed or searched with other ones. But here, the context is not
built with a breadth first search algorithm on the Web graph, but
with other algorithms.
Three methods are aimed at augmenting the recall, with the
idea that not all the concepts of a conjunctive query are present in
a page, but some of them are in neighbor pages [7] [19] [11]. Note
that the Dyreson’s method [7] does not use the Web graph but a
graph derived from it by taking into account the directory hierarchy coded in the URL. These three methods share the drawback
that they take place in a boolean framework.
Tajima et al. [20] propose to discover the logical units of information by clustering. To take into account the structure, the
similarity between two clusters is zero when there are no links between any page of one cluster and any page of the second cluster,
otherwise the similarity is computed with Salton’s model. So there
is not a strong use of the link structure.
Communities discovery Another approach by Li et al. [12]
attempts to discover logical domains — as opposed to the physical
domains tied to the host addresses. These domains are of a greater
granularity than the logical units of the previous paragraph. Their
goal is to cluster the results of a search task. In order to build these
domains, the first step consists in finding k (an algorithm parameter) entry points with criteria that take into account the title
tag content, the textual content of the URL1 , the in and out degree
within the Web graph, etc. In the second step, pages that are not
entry points are linked to the first entry point located above considering their URL path (as a result, some pages may stay orphan).
Moreover some conditions — minimal size of a domain, radius —
influence the constitution of domains.
1 Some
20
words such as welcome, home, people, user, etc. are important.
3. Link tools
There are rather few basic methods used in the link usage:
1. graph search (mainly breadth first search);
2. PageRank and HITS algorithms (which are matrix based);
3. co-citation (building the co-citation data is also a matrix; manipulation)
4. clustering (many methods can be used).
4. URL use
We already note that Dyreson [7] does use the URL data to discover logical units. In the study conducted by Mizuuchi et al. [14]
the URL coded paths are used to discover for every page P one
(and only one) entry page, i.e. a page by which a reader is supposed to get through before arriving at P . A page tree is defined
by these entry pages. This tree is used to enhance the index of a
page with the content of some tags of the ancestors of P .
5. Conclusion and proposition
IR integration The works quoted above are not all dealing directly with the IR problem. Many of them were not tested with test
collections which are standard in the IR community such those of
TREC 2 . So some work has to be done on how to integrate and test
these methods in a search engine.
Precision enhancement Now, I present some qualitative considerations. Many of these methods are aimed at dealing with the
huge size of the Web: everything about some kind of classification
or categorization are of this kind. Most often, these methods can
be applied either before the query as a preprocessing step or on the
results of a query.
While not explicitly in this direction the PageRank algorithm
can be considered of this kind. Due to the very huge size of the
Web, many queries, especially the very short queries submitted
by the Web users, have many, many answers. The polysemy is
much higher than in traditional IR collections. So the use of clues
external to the vocabulary can be seen as a discrimination factor to
select documents when the collection is very large.
Recall enhancement The other usages (Index enhancement
and Logical Units Discovery) are aimed on the contrary to enhance
recall, which is not often required, or not a priority when too many
irrelevant answers are given to the queries.
Though, as for me, the Logical Units Discovery methods can
be considered in an IR point of view as trying to access to different
levels of granularity of documents in the Web space. If we consider that an IR system returns pointers to documents, the notion
of document is what is returned by the IR system. So if an IR system returns a Logical Unit which is composed of several pages,
this is a higher level of granularity.
2 http://trec.nist.gov/
Proposition: a hierarchical presentation of the Web
Many of the queries submitted to search engines have many many
answers. The IR traditional relevance and the popularity produce
lists of answers. But presenting the results as an ordered list increases the likelihood of missing important, and in some sense
rarer, information. This is true especially if the ranking is only
done with popularity as this has the effect that the best ranked
documents have the more likelihood to get better ranked.
I suggest that the results should be presented by clusters, with
a number of clusters manageable by the user (from ten to one hundred, it could be a user preference). With iterative clustering, any
document would be at a log(n) distance from the root rather than
to be at a n distance from the beginning of a sorted list.
To help to do that, many possibilities can be considered:
• some of the clustering techniques could be applied either on
the Web, or on the results of a query;
These clustering could be done with similarity based on different clues according to the user information need (text similarity, co-citation similarity, co-occurrence, etc.)
• some categorization could be used (particularly open ones 3 );
• Entry Points Discovery and Logical Units Discovery could
be used to merge several URL in a single node in the graph;
Merging several URLs in a single node has two beneficial
effects: it both reduces the size of the graph and the resulting
graph has a higher density. Reducing the size of the graph
has an influence on the run time of the algorithms, which
is important due to the size of the Web and the complexity
of some algorithms (clustering for example). Increasing the
density is important because the Web graph is rather sparse,
and a few proportion of pages are cited (and even fewer are
co-cited). So the benefit of the algorithms based on the links
is not well spread.
• recall enhancement methods could be used when queries
give no answers.
References
[1] K. Bharat, A. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the www.
IEEE Data Engineering Bulletin, 23(4):21–26, 2000.
[2] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN
Systems, 30(1–7):107–117, 1998.
[3] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web: experiments and models. In 9th International World Wide Web Conference, The Web: The Next Generation, 5 2000.
[4] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In L. M. Haas and
A. Tiwary, editors, Proceedings ACM SIGMOD International Conference on Management of Data, pages 307–318.
ACM Press, 1998.
[5] J. Cho, H. Garcia-Molina, and L. Page. The anatomy ocient
crawling through url ordering. Computer Networks and
ISDN Systems, 30(1–7):161–172, 1998.
3 http://www.dmoz.org/
21
for instance.
[6] J. Dean and M. R. Henzinger. Finding related pages in
the world wide web. Computer Networks, 31(11-16):1467–
1479, 1999.
[7] C. E. Dyreson. A jumping spider: Restructuring the WWW
graph to index concepts that span pages. In A.-M. Vercoustre, M. Milosavljevi, and R. Wilkinson, editors, Proceedings
of the Workshop on Reuse of Web Information, held in conjunction with the 7th WWW Conference, pages 9–20, 1998.
CSIRO Report Number CMIS 98-11.
[8] M. E. Frisse. Searching for information in a hypertext medical handbook. Communications of the ACM, 31(7):880–886,
1988.
[9] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999.
[10] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar,
A. Tomkins, and E. Upfal. The web as a graph. In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems, 2000.
[11] W.-S. Li, K. S. Candan, Q. Vu, and D. Agrawal. Retrieving
and organizing web pages by ”information unit”. In M. R.
Lyu and M. E. Zurko, editors, Proceedings of the Tenth International World Wide Web Conference, 2001.
[12] W.-S. Li, O. Kolak, Q. Vu, and H. Takano. Defining logical
domains in a web site. In HYPERTEXT ’00, Proceedings of
the eleventh ACM on Hypertext and hypermedia, pages 123–
132, 2000.
[13] M. Marchiori. The limits of web metadata and beyond. Computer Networks and ISDN Systems, 30(1–7):1–9, 1998.
[14] Y. Mizuuchi and K. Tajima. Finding context paths for web
pages. In HYPERTEXT ’99, Proceedings of the tenth ACM
Conference on Hypertext and hypermedia: returning to our
diverse roots, pages 13–22, 1999.
[15] C. Prime-Claverie, M. Beigbeder, and T. Lafouge. Transposition of the co-citation method with a view to classifying
web pages. Journal of the American Society for Information
Science and Technology, 55(14):1282–1289, 2004.
[16] J. Savoy. Citation schemes in Hypertext information retrieval, pages 99–120. Kluwer Academic Publishers, 1996.
in Agosti M. and Smeaton A. editors, Information Retrieval
and Hypertext.
[17] J. Savoy. Ranking schemes in hybrid boolean systems: a new
approach. Journal of the American Society for Information
Science, 48(3):235–253, 1997.
[18] H. Small. Co-citation in the scientific literature: a new measure of the relationship between two documents. Journal
of the American Society for information Science, 24(4):265–
269, 1973.
[19] K. Tajima, K. Hatano, T. Matsukura, R. Sano, and K. Tanaka.
Discovery and retrieval of logical information units in web.
In R. Wilensky, K. Tanaka, and Y. Hara, editors, Proc. of
Workshop of Organizing Web Space (in conjunction with
ACM Conference on Digital Ligraries ’99), pages 13–23,
1999.
[20] K. Tajima, Y. Mizuuchi, M. Kitagawa, and K. Tanaka. Cut as
a querying unit for WWW, netnews, e-mail. In HYPERTEXT
’98, Proceedings of the ninth ACM conference on Hypertext
and hypermedia: links, objects, time and space–structure in
hypermedia systems, pages 235–244, 1998.
22
Static Ranking of Web Pages, and Related Ideas
Wray Buntine
Complex Systems Computation Group,
Helsinki Institute for Information Technology
P.O. Box 9800, FIN-02015 HUT, Finland
[email protected]
Abstract
This working paper reviews some different ideas in
link-based analysis for search. First, results about static
ranking of web pages based on the so called randomsurfer model are reviewed and presented in a unified
framework. Second, a topic-based hubs and authorities model using a discrete component method (a variant of ICA and PCA) is developed, and illustrated on
the 500,000 page English language Wikipedia collection. Third, a proposal is presented to the community for a Links/Ranking consortium extracted from the
Web Intelligence paper Opportunities from Open Source
Search.
1
Introduction
PageRankTM used by Google and the HypertextInduced Topic Selection (HITS) model developed at
IBM [9] are the best known of the ranking models although they represent a very recent part of a much older
bibliographic literature (for instance, discussed in [5]).
PageRank ranks all pages in a collection and is then
used as a static (i.e., query-free) part of a query evaluation. Whereas HITS is intended to be applied to just
the subgraph of pages retrieved with a query, and perhaps some of their neighbors. There is nothing, however, to stop HITS being applied like PageRank to a
full collection rather than just query results.
PageRank is intended to measure the authority of
a webpage on the basis that high authority pages have
other high authority pages linked to them. HITS is also
referred to as the hubs and authority model: a hub is
a web page that is viewed as a reliable source for links
to other web pages, whereas an authority is viewed as
a reliable content page itself. Generally speaking, good
hubs should point to good authorities and visa verca.
The literature about these methods is substantial [2, 1].
Here I review these two models, and then discuss
their use in an Open Source environment.
2
Random
Seekers
Surfers
versus
Random
The PageRank model is based on the notion of an
idealised random surfer. The random surfer starts off
by choosing from some selection of pages according to
an initial probability vector ~s. When at a new page,
the surfer can take one of the outgoing links from the
current page, or with some smaller probability restart
afresh at a completely new page again using the initial probability vector ~s. The general start-restart process is depicted in the graph in Figure 1, where the
initial state is labelled start, and the pages themselves
form a subgraph T . Every page in the collection has
...
restart-1
restart-P
Collection
start
Figure 1. Start-Restart for the Random Surfer
a link to a matching restart state leading directly to
start, and start links back to those pages in the collection with a non-zero entry in ~s. Note the restart states
could be eliminated, but are retained for comparison
with the later model. This represents a Markov model
once we attach probabilities to outgoing arcs, and the
usual analysis of Markov chains and linear systems (see
for instance [12]) applies [1]. The computed static rank
is the long run probability of visiting any page in the
collection according to the Markov model.
Extensions to the model include making the initial
23
probability vector ~s dependent on topic [7, 11], providing a back button so the surfer can reject a new page
based on its unsuitability of topic [11, 10], and handling
the way in which pages with no outgoing links can be
dealt with [6, 1]. These extensions make the idealised
surfer more realistic, yet no real analysis of the Markov
models on real users has been attempted. A fragment of
a graph illustrating the Markov model from the point of
view of surfing from one page, is given in Figure 2, From
web, but instead of continuously surfing, can ”find” a
page and thus stop. The general model comparable to
Figure 1 is now given by Figure 3, In the random seeker
...
stop-1
stop-P
Collection
start
Collection
Figure 3. Start-Stop for the Random Seeker
restart
check-P1
check-Pk
...
page
click
Figure 2. Local view of Primitive States
a page j, the surfer decides to either restart with probability rj , or to click on a link to a new page. Once they
decide to click, they try different pages k with probability given by matrix p, where
½
0
page j has no out link to k
pj,k =
1/L page j has L outlinks, one to k
model, the computed static rank is the long run probability of stopping at (“finding”) any given page. It is
thus given by the probabilities for the absorbing states
in the Markov model, and again the usual analysis applies. The page to page transition probabilities, however, can otherwise be modelled in various ways using
Equation (1).
The structure of the graphs suggests that these two
models (random surfer versus random seeker) should
have a strong similarity in their results. We can work
out the exact probabilities by folding the transition matrices. The following lemma do this.
Lemma 1. Given the random seeker model with parameters ~s, ~r, ~a and p, where rj = 1 for any page j without
outgoing links. Let P denote the transition matrix
½
0
j not linked to k
P
Pj,k =
pj,k ak /( k pj,k ak ) page j linked to k
but have a one time opportunity (to check) to either
accept the new page k, given by ak , or to try again and
go back to the intermediate click state. Folding in the
various intermediate states (click and the check states)
and just keeping the pages and the start and restart
Let Dr denote the diagonal matrix formed using entries
states, yields a transition matrix starting from a page
~r. The total probability of the stop states for paths of
j of
length less than or equal to n + 2 is given by
!
Ã
(
n
X
rj
state = restart
i
p a
Dr I +
((I − Dr )P) ~s
(2)
p(state | page j) =
(1 − rj ) P j,kp ka
state = page k
k
j,k k
(1)
Note in this formulation, if a page j has no outgoing
links, then rj = 1 necessarily. This has the parameters
summarised in the following table.
~s
~r
~a
Description
initial probabilities for pages, normalised
restart probabilities for pages
acceptance probabilities for pages
With appropriate choice of these, all of the common
models in the literature can be handled [7, 11, 1].
A new model proposed by Amati, Ounis, and Plachouras [13] is the static absorbing model for the web.
The absorbing model is instead based on the notion of
a random seeker. The random seeker again surfs the
i=1
This can be proven by straight forward enumeration
of states. Equation (2) is evaluated in practice using
a recurrence relation such as ~q0 = ~s, p~i+1 = p~i + Dr ~qi
and ~qi+1 = (I − Dr )P~qi .
Lemma 2. Given the random seeker model with parameters ~s, ~r, ~a and p, where rj = 1 for any page j
without outgoing links,, and let P and Dr be defined
as above. Assume rj > 0 for all pages j. The total
absorbing probability of the stop states is given by
Dr (I − (I − Dr )P)
−1
~s
(3)
The matrix inverse exists. Moreover, the L1 norm of
this minus Equation (2) is less than (1 − r0 )n+1 /r0
where r0 = minj rj > 0.
24
Note in the standard PageRank interpretation, r0 =
1 − α, so the remainder is αn+1 /(1 − α), the same order
as for the PageRank calculation [1].
i
Proof. Consider ~qi = ((I − Dr )P) ~s, and prove the
recursion ||~qi ||1 ≤ (1 − r0 )i . Since P is a probability matrix with some rows zero, ||P~qi ||1 ≤
i+1
||~qi ||1 and
´ − r0 ) . Consider
³Phence ||~qi+1 ||1 ≤ (1
m
i
~qn,m =
~s. Hence ||~qn,m ||1 ≤
i=n+1 ((I − Dr )P)
Pm
i
n+1
− (1 − r0 )m )/r0 .
i=n+1 (1 − r0 ) which is ((1 − r0 )
Thus ~qn,∞ is well defined, and has an upper bound
of (1 − r0 )n+1 /r0 . Thus the total absorbing probability is given by Equation (2) as n → ∞, with L1
error after n steps bounded by (1 − r0 )n+1 /r0 . Since
the sum is well defined and converges, it follows that
−1
(I − (I − Dr )P) exists.
Lemma 3. Given the random surfer model with parameters ~s, ~r, ~a and p, where rj = 1 for any page j
without outgoing links,, and let P and Dr be defined as
above. Assume rj > 0 and sj > 0 for all pages j. Then
the long run probability over pages exists independently
of the initial probability over pages and is proportional
to
−1
(I − (I − Dr )P) ~s
(4)
Proof. Eliminate the start and restart states, and then
the transition matrix becomes as follows: given a probability over pages of p~i , then at the next cycle
p~i+1 = ~s(~r† p~i + (I − Dr )P~
pi
Since ~r and ~s are strictly positive, the Markov chain
is ergodic and irreducible [12], and thus the long run
probability over pages exists independently of the initial probability over pages. Consider the fixed point
for these equations. Make a change of variables to
p~ = Dr p~/(~r† p~). This is always well defined since the
positivity constraints on ~r ensure ~r† p~ > 0. Then
p~0 = Dr~s + Dr (I − Dr )PDr −1 p~0
Rewriting,
Dr (I − (I − Dr )P) Dr −1 p~0 = Dr~s
We know from above that the inverse of the middle
matrix expression exists. Thus
p~0 = Dr (I − (I − Dr )P)
−1
~s
Substituting back for p~ yields the result.
Note the usual recurrence relation for computing this is
p~i+1 + = ~s(~r† p~i ) + (I − Dr )P~
pi ,
and due to the correspondence between Equations (3)
and Equations (4), the alternative occurrence for the
absorbing model could be adapted as well. The recurrence relation holds: ~q0 = ~s, p~i+1 = p~i + ~qi and
~qi+1 = (I − Dr )P~qi , noting that the final estimate p~i+1
so obtained needs to be normalised. This can, in fact,
be supported on a graphical basis as well.
This correspondence gives us insight into how to improve these models. How might we make the Markov
models more realistic? Could the various parameters
be learned from click stream data? While in the surfing model ~r corresponds to the probability of restarting,
in the seeking model it is the probability of accepting a
page and stopping. One is more likely to use the back
button on such pages, and thus perhaps the acceptance
probabilities ~a should be modified. Some versions are
suggested in [6].
3
Probabilistic Hubs and Authorities
A probabilistic authority model for web pages, based
on PLSI [8], was presented by [5]. By using the GammaPoisson version of Discrete PCA [4, 3], a generalisation
of PLSI using independent Gamma components, this
can be extended to a probabilistic version of the hubs
and authorities model. The method is topic based in
that hubs and authorities are produced for K different
topics. An authority matrix Θ gives the authority score
for each page j for the k-th topic, θj,k , normalised for
each topic. Each page j is a potential hub, with hub
scores lj,k for topic k taken from the hub matrix l. The
links in a page are modelled independently using the
Gamma(1/K, 1) distribution. The occurrences of link j
in page i are then Poisson distributed with a mean given
by authority scores for the link
P weighted by the hub
scores for the page, Poisson( k li,k θj,k ). More details
of the model, and the estimation of the authority matrix
and hub matrix are at [3],
To investigate this model, the link structure of
the English language Wikipedia from May 2005 was
used as data. The output of this analysis is given at
http://cosco.hiit.fi/search/MPCA/HA100.html.
This is about 500,000 documents and K = 100 hub and
authority topics are given. The authority scores are the
highest values for a topic k from the authority matrix
Θ, and the hub scores are the highest component
estimates for topic k for lj,k for a page j.
Note a variety of hub and authority models have been
investigated in the context of query evaluation [2]. It
is not clear if this is the right approach for using these
models. Nevertheless, these represent another family
of link-based systems than can be used in a search engine, and an alternative definition of authority to the
previous section.
25
4
A Trust/Reputation Consortium for
Open Source Ranking
Having reviewed some methods for link analysis, let
use now consider their use. Opportunities for their use
abound once the right infrastructure is in place for open
source search. Here I describe one general kind of system that could exist in the framework, intended either
as an academic or commercial project.
On Google the ranking of pages is influenced by the
PageRank of websites. Sites appearing in the first page
of results for popular and commercially relevant queries
get a significant boost in viewership, and thus PageRank has become critical for marketing purposes. This
method for computing authority for a web page borrows
from early citation analysis, and the broader fields of
trust, reputation, and social networks (which blog links
could be interpreted to represent) provide new opportunities for this kind of input to search. Analysis of large
and complex networks such as the Internet is readily
done on todays grid computing networks.
What are some scenarios for the use of new kinds of
data about authority, trust and reputation, standards
set up by a consortium perhaps. A related example is
the new OpenID1 , a distributed identity system.
ACM could develop a ”computer science site rank”
that gives web sites an authority ranking according to
”computer science” relevance and reputation. In this
ranking the BBC Sports website would be low, Donald
Knuth’s home page high, and Amazon’s Computer Science pages perhaps medium. Our search engines can
then incorporate this authority ranking into their own
scores when asked to do so. ACM might pay for the development and maintenance of this ranking as a service
to its members, possibly incorporating its rich information about citations as well, thus using a sophisticated
reputation model well beyond simple PageRank. In an
open source search network, consumers of these kinds
of organisational or professional ranks could be found.
To take advantage of such a system, a user could
choose to search Australian university web sites via a
P2P universities search engine and then enrol with the
ACM ranking in order to help rank their results.
Yahoo could develop a vendor web site classification
that records all websites according to whether they primarily or secondarily perform retail or wholesale services, product information, or product service, extending its current Mindset demonstration2 . This could be
coupled with a vendor login service so that venders can
manage their entries, and trust capabilities so that some
measure of authority exists about the classifications.
Using this, search engines then have a trustworthy way
of placing web pages into different product genres, and
thus commercial and product search could be far more
predictable.
To take advantage of this, a user could search for
product details, but enrol with the Yahoo service classification to restrict their search to relevant pages.
Network methods for trust, reputation, community
groups, and so forth, could all be invaluable to small
local search engines, that cannot otherwise gain a global
perspective on their content. They would also serve as
a rich area for business potential.
References
[1] L. A.N., , and M. C.D. Deeper inside pagerank. Internet Mathematics, 1(3):335–400, 2004.
[2] A. Borodin, G. Roberts, J. Rosenthal, and P. Tsaparas.
Link analysis ranking: algorithms, theory, and experiments. ACM Trans. Inter. Tech., 5(1):231–297, 2005.
[3] W. Buntine. Discrete principal component analysis.
submitted, 2005.
[4] J. Canny. GaP: a factor model for discrete data. In
SIGIR 2004, pages 122–129, 2004.
[5] D. Cohn and H. Chang. Learning to probabilistically
identify authoritative documents. In Proc. 17th International Conf. on Machine Learning, pages 167–174.
Morgan Kaufmann, San Francisco, CA, 2000.
[6] N. Eiron, K. McCurley, and J. Tomlin. Ranking the
web frontier. In WWW ’04: Proceedings of the 13th
international conference on World Wide Web, pages
309–318, 2004.
[7] T. Haveliwala. Topic-specific pagerank. In 11th World
Wide Web, 2002.
[8] T. Hofmann. Probabilistic latent semantic indexing. In
Research and Development in Information Retrieval,
pages 50–57, 1999.
[9] J. Kleinberg. Authoritative sources in a hyperlinked
environment. Journal of the ACM, 46(5):604–632,
1999.
[10] F. Mathieu and M. Bouklit. The effect of the back
button in a random walk: application for pagerank. In
WWW Alt. ’04: Proceedings of the 13th international
World Wide Web conference on Alternate track papers
& posters, pages 370–371, 2004.
[11] M. Richardson and P. Domingos. The intelligent surfer:
Probabilistic combination of link and content information in pagerank. In NIPS*14, 2002.
[12] S. Ross. Introduction to Probability Models. Academic
Press, fourth edition, 1989.
[13] I. O. V. Plachouras and G. Amati. The static absorbing
model for hyperlink analysis on the web. Journal of
Web Engineering, 4(2):165–186, 2005.
1 http://www.openid.net/
2 Search
for Mindset at Yahoo Research
26
WIRE: an Open Source Web Information Retrieval Environment
Carlos Castillo
Center for Web Research, Universidad de Chile
[email protected]
Ricardo Baeza-Yates
Center for Web Research, Universidad de Chile
[email protected]
Abstract
In this paper, we describe the WIRE (Web Information
Retrieval Environment) project and focus on some details
of its crawler component. The WIRE crawler is a scalable, highly configurable, high performance, open-source
Web crawler which we have used to study the characteristics of large Web collections.
1. Introduction
At
the
Center
for
Web
Research
(http://www.cwr.cl/) we are developing a software suite for research in Web Information Retrieval,
which we have called WIRE (Web Information Retrieval
Environment). Our aim is to study the problems of Web
search by creating an efficient search engine. Search
engines play a key role on the Web, as searching currently
generates more than 13% of the traffic to Web sites [1].
Furthermore, 40% of the users arriving to a Web site for the
first time clicked a link from a search engine’s results [14].
The WIRE software suite generated several sub-projects,
including some of the modules depicted in Figure 1. So
far, we have developed an efficient general-purpose Web
crawler [6], a format for storing the Web collection, a tool
for extracting statistics from the collection and generating
reports and a search engine based on SWISH-E using PageRank with non-uniform normalization [3].
In some sense, our system is aimed at a specific segment:
our objective was to use it to download and analyze collections having in the order of 106 − 107 documents. This is
bigger than most Web sites, but smaller than the complete
Web, so we worked mostly with national domains (ccTLDs:
country-code top level domains such as .cl or .gr). The
main characteristics of the WIRE crawler are:
High-performance and scalability: It is implemented
using about 25,000 lines of C/C++ code and designed to
work with large volumes of documents and to handle up to
a thousand HTTP requests simultaneously. The current implementation would require further work to scale to billions
Figure 1. Some of the possible sub-projects
of WIRE, highlighting the completed parts.
of documents (e.g.: process some data structures on disk
instead of in main memory). Currently, the crawler is parallelizable, but unlike [8], it has a central point of control.
Configurable and open-source: Most of the parameters for crawling and indexing can be configured, including
several scheduling policies. Also, all the programs and the
code are freely available under the GPL license.
The details about commercial search engines are usually kept as business secrets, but there are a few
examples of open-source Web crawlers, for instance
Nutch http://lucene.apache.org/nutch/. Our
system is designed to focus more on evaluating page quality, using different crawling strategies, and generating data
for Web characterization studies. Due to space limitations, on this paper we describe only the crawler in some
detail. Source code and documentation, are available at
http://www.cwr.cl/projects/WIRE/.
The rest of this paper is organized as follows: section 2
details the main programs of the crawler and section 3 how
statistics are obtained. The last section presents our conclusions.
27
2 Web crawler
In this section, we present the four main programs that
are run in cycles during the crawler’s execution: manager,
harvester, gatherer and seeder, as shown in Figure 2.
select pages P1 and P3 for this cycle as they give the higher
profit.
2.2 Harvester: short-term scheduling
The “harvester” program receives a list of K URLs and
attempts to download them from the Web. The politeness
policy chosen is to never open more than one simultaneous
connection to a Website, and to wait a configurable amount
of seconds between accesses (default 15). For the larger
Websites, over a certain quantity of pages (default 100), the
waiting time is reduced (to a default of 5 seconds).
Figure 2. Modules of the crawler.
2.1 Manager: long-term scheduling
As shown in Figure 4, the harvester creates a queue for
each Web site and opens one connection to each active Web
site (sites 2, 4, and 6). Some Web sites are “idle”, because
they have transfered pages too recently (sites 1, 5, and 7) or
because they have exhausted all of their pages for this batch
(3). This is implemented using a priority queue in which
Web sites are inserted according to a time-stamp for their
next visit.
The “manager” program generates the list of K URLs to
be downloaded in this cycle (we used K = 100, 000 pages
by default). The procedure for generating this list is based
on maximizing the “profit” of downloading a page [7].
Figure 4. Operation of the harvester program.
Figure 3. Operation of the manager program.
The current value of a page depends on an estimation of
its intrinsic quality, and an estimation of the probability that
it has changed since it was crawled.
The process for selecting the pages to be crawled next
includes (1) filtering out pages that were downloaded too
recently, (2) estimating the quality of Web pages, (3) estimating the freshness of Web pages and (4) calculating the
profit for downloading each page. This balances the process of downloading new pages and updating the alreadydownloaded ones. For example, in Figure 3, the behavior of
the manager for K = 2 is depicted. In the figure, it should
Our first implementation used Linux threads and did
blocking I/O on each thread. It worked well, but was not
able to go over 500 threads even in PCs with processors of
1 GHz and 1GB of RAM. It seems that entire thread system was designed for only a few threads at the same time,
not for higher degrees of parallelization. Our current implementation uses a single thread with non-blocking I/O over
an array of sockets. The poll() system call is used to
check for activity in the sockets. This is much harder to
implement than the multi-threaded version, as in practical
terms it involves programming context switches explicitly,
but the performance is much better, allowing us to download from over 1000 Web sites at the same time with a very
lightweight process.
28
2.3 Gatherer: parsing of pages
The “gatherer” program receives the raw Web pages
downloaded by the harvester and parses them. In the current implementation, only HTML and plain text pages are
accepted by the harvester.
The parsing of HTML pages is done using an eventsoriented parser. An events-oriented parser (such as SAX
[12] for XML) does not build an structured representation
of the documents: it just generates function calls whenever certain conditions are met. We found that a substantial
amount of pages were not well-formed (e.g.: tags were not
balanced), so the parser must be very tolerant to malformed
markup.
The contents of Web pages are stored in variable-sized
records indexed by document-id. Insertions and deletions
are handled using a free-space list with first-fit allocation.
This data structure also implements duplicate detection:
whenever a new document is stored, a hash function of its
contents is calculated. If there is another document with the
same hash function and length, the contents of the documents are compared. If they are equal, the document-id of
the original document is returned, and the new document is
marked as a duplicate.
files for millions of pages, instead of small files, can save a
lot of disk seeks, as noted also by Patterson [16].
2.4 Seeder: URL resolver
The “seeder” program receives a list of URLs found by
the gatherer, and adds some of them to the collection, according to a criteria given in the configuration file. This
criteria includes patterns for accepting, rejecting, and transforming URLs.
Patterns for accepting URLs include domain name and
file name patterns. The domain name patterns are given
as suffixes (e.g.: .cl, .uchile.cl, etc.) and the file
name patterns are given as file extensions. Patterns for rejecting URLs include substrings that appear on the parameters of known Web applications (e.g. login, logout,
register, etc.) that lead to URLs which are not relevant for a search engine. Finally, to avoid duplicates from
session ids, patterns for transforming the URLs are used
to remove known session-id variables such as PHPSESSID
from the URLs.
Figure 5. Storing the contents of a document.
Figure 6. For checking a URL: (1) the host
name is searched in the hash table of Web
site names. The resulting site-id (2) is concatenated with the path and filename (3) to
obtain a doc-id (4).
The process for storing a document, given its contents
and document-id, is depicted in Figure 5. For storing a document, the crawler has to check first if the document is a
duplicate, then search for a place in the free-space list, and
then write the document to disk. This module requires support to create large files, as for large collections the disk
storage grows over 2GB, and the offset cannot be provided
in a variable of type “long”. In Linux, the LFS standard [10] provides offsets of type “long long” that are
used for disk I/O operations. The usage of continuous, large
The structure that holds the URLs is highly optimized for
the most common operations during the crawling process:
given the name of a Web site, obtain a site-id, given the siteid of a Web site and a local link, obtain a document-id, and
given a full URL, obtain both its site-id and document-id.
The process for converting a full URL is shown in Figure 6.
This process is optimized to exploit the locality on Web
links, as most of the links found in a page point to other
pages co-located in the same Web site. For this, the implementation uses two hash tables: the first for converting
29
Web site names into site-ids, and the second for converting
“site-id + path name” to a doc-id.
3 Obtaining statistics
To run the crawler on a large collection, the user must
specify the site suffix(es) that will be crawled (e.g.: .kr
or .upf.edu), and has to provide a starting list of “seed”
URLs. Also, the crawling limits have to be provided, including the maximum number of pages per site (the default
is 25,000) and the maximum exploration depth (default is 5
levels for dynamic pages and 15 for static pages).
There are several configurable parameters, including the
amount of time the crawler waits between accesses to a
Web site –that can be fine-tuned by distinguishing between “large” and “small” sites– the number of simultaneous downloads, the timeout for downloading pages, among
many others. On a standard PC with a 1 GHz Intel 4 processor and 1 GB of RAM, using standard IDE disks, we usually
download and parse about 2 million pages per day.
WIRE stores as much metadata as possible about Web
pages and Web sites during the crawl, and includes several
tools for extracting this data and for obtaining statistics. The
analysis includes running link analysis algorithms such as
Pagerank [15] and HITS [11], aggregating this information
by documents and sites, and generating histograms for almost every property that is stored by the system. It also
includes a module for detecting the language of a document
based on a dictionary of stopwords in several languages that
is included with WIRE.
The process for generating reports includes the analysis of the data, its extraction, the generation of gnuplot
scripts for plotting, and the compilation of automated reports using LATEX. The generated reports include: distribution of language, histograms of in- and out-degree, link
scores, page depth, HTTP response codes, age (including
per-site average, minimum and maximum), summations of
link scores per site, histogram of pages per site and bytes
per site, an analysis by components in the Web structure [5],
the distribution of links to multimedia files, and of links to
domains that are outside the delimited working set for the
crawler.
4 Conclusions
So far, we have used WIRE to study large Web collections including the national domains of Brazil [13],
Chile [2], Greece [9] and South Korea [4]. We are currently
developing a module for supporting multiple text encodings
including Unicode.
While downloading a few thousands pages from a bunch
of Web sites is relatively easy, building a Web crawler that
has to deal with millions of pages and also with misconfigured Web servers and bad HTML coding requires solving a
lot of technical problems. The source code and the documentation of WIRE, including step-by-step instructions for
running a Web crawl and analysing the results, are available
at http://www.cwr.cl/projects/WIRE/doc/.
References
[1] Search Engine Referrals Nearly Double Worldwide.
http://websidestory.com/pressroom/pressreleases.html?id=181,
2003.
[2] R. Baeza-Yates and C. Castillo. Caracterı́sticas de la Web
Chilena 2004. Technical report, Center for Web Research,
University of Chile, 2005.
[3] R. Baeza-Yates and E. Davis. Web page ranking using link
attributes. In Alternate track papers & posters of the 13th
international conference on World Wide Web, pages 328–
329, New York, NY, USA, 2004. ACM Press.
[4] R. Baeza-Yates and F. Lalanne. Characteristics of the Korean Web. Technical report, Korea–Chile IT Cooperation
Center ITCC, 2004.
[5] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web: Experiments and models. In Proceedings of
the Ninth Conference on World Wide Web, pages 309–320,
Amsterdam, Netherlands, May 2000. ACM Press.
[6] C. Castillo. Effective Web Crawling. PhD thesis, University
of Chile, 2004.
[7] C. Castillo and R. Baeza-Yates. A new crawling model.
In Poster proceedings of the eleventh conference on World
Wide Web, Honolulu, Hawaii, USA, 2002.
[8] J. Cho and H. Garcia-Molina. Parallel crawlers. In Proceedings of the eleventh international conference on World Wide
Web, pages 124–135, Honolulu, Hawaii, USA, 2002. ACM
Press.
[9] E. Efthimiadis and C. Castillo. Charting the Greek Web. In
Proceedings of the Conference of the American Society for
Information Science and Technology (ASIST), Providence,
Rhode Island, USA, November 2004. American Society for
Information Science and Technology.
Large File Support in Linux.
[10] A. Jaeger.
http://www.suse.de/aj/linux lfs.html, 2004.
[11] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999.
[12] D. Megginson.
Simple API for XML (SAX 2.0).
http://sax.sourceforge.net/, 2004.
[13] M. Modesto, á. Pereira, N. Ziviani, C. Castillo, and
R. Baeza-Yates. Un novo retrato da Web Brasileira. In Proceedings of SEMISH, São Leopoldo, Brazil, 2005.
[14] J. Nielsen. Statistics for Traffic Referred by Search Engines
and Navigation Directories to Useit. http://www.useit.com/about/searchreferrals.html, 2003.
[15] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical
report, Stanford Digital Library Technologies Project, 1998.
[16] A. Patterson. Why writing your own search engine is hard.
ACM Queue, April 2004.
30
Nutch: an Open-Source Platform for Web Search
Doug Cutting
Internet Archive
[email protected]
Abstract
Nutch is an open-source project providing both
complete Web search software and a platform for the
development of novel Web search methods. Nutch is built
on a distributed storage and computing foundation, such
that every operation scales to very large collections. Core
algorithms crawl, parse and index Web-based data.
Plugins extend functionality at various points, including
network protocols, document formats, indexing schemas
and query operators.
1. Introduction
Nutch is an open-source project hosted by the Apache
Software Foundation [1]. Nutch provides a complete,
high-quality Web search system, as well as a flexible,
scalable platform for the development of novel Web search
engines. Nutch includes:
• a Web crawler;
• parsers for Web content;
• a link-graph builder;
• schemas for indexing and search;
• distributed operation, for high scalability;
• an extensible, plugin-based architecture.
Nutch is implemented in Java and thus runs on many
operating systems and a wide variety of hardware.
2. Architecture
Nutch has a set of core interfaces implemented by plugins. Plugins implement such things as network protocols,
document formats and indexing schemas. Generic algorithms combine the plugins to create a complete system.
These algorithms are implemented on a distributed computing platform, making the entire system extremely scalable.
3. Distributed Operation
Distribution operation is built in two layers, storage
and computation.
3.1 Nutch Distributed File System (NDFS)
Storage is provided by the the Nutch Distributed File
System (NDFS) which is modeled after the Google File
System [2] (GFS). NDFS provides reliable storage across
a network of PCs. Files are stored as a sequence of blocks.
Each block is replicated on multiple hosts. Replication and
fail-over are handled automatically, providing applications
with an easy-to-manage, efficient file system that scales to
multi-petabyte installations.
For small deployments, without large storage requirements, Nutch is easily configured to simply use a local
hard drive for all storage, in place of NDFS.
3.2 MapReduce
MapReduce is Nutch's distributed computing layer,
again inspired by Google [3]. MapReduce, as its name implies, is a two-step operation, map followed by reduce. Input and output data are files containing sequences of keyvalue pairs.
During the map step, input data is split into contiguous
chunks that are processed on separate nodes. A user-supplied map function is applied to each datum, producing an
intermediate data set.
Each intermediate datum is then sent to a reduce node,
based on a user-supplied partition function. Partitioning is
typically a hash function, so that all equivalently keyed intermediate data are all sent to a single reduce node. For example, if a map function outputs URL-keyed data, then partitioning by URL hash code sends intermediate data associated with a given URL to a single reduce node.
Reduce nodes sort all their input data, then apply a
user-supplied reduce function to this sorted map output,
producing the final output for the MapReduce operation.
All entries with a given key are passed to the reduce function at once. Thus, with URL-keyed data, all data associated with a URL is passed to the reduce function and may be
used to generate the final output.
The MapReduce system is robust in the face of machine failures and application errors. Thus one may reliably run long-lived applications on tens, hundreds or even
thousands of machines in parallel.
A single-threaded, in-process implementation of
MapReduce is also provided. This is useful not just for debugging, but also to simplify small, single-machine installations of Nutch.
4. Plugins
An overview of the primary plugin interfaces is provided below.
4.1 URL Normalizers and Filters
These are called on each URL as it enters the system.
A URL normalizer transforms URLs to a standard form.
Basic implementations perform operations such as lowercasing protocol names (since these are case-independent)
and removing default port numbers (e.g., port 80 from
HTTP URLs). If an application has more knowledge of
particular URLs, then it can easily implement things such
as removal of session ids within a URL normalizer.
URL filters are used to determine whether a URL is
permitted to enter Nutch. One may, for example, wish to
exclude queries with query parameters, since these are likely to be dynamically generated content. Or one may use a
URL filter to restrict crawling to particular domains, to implement an intranet or vertical search engine.
31
Nutch provides regular-expression based implementations of both URL normalizer and URL filter. Thus most
applications need only modify a configuration file containing regular expressions in order to alter URL normalization
and filtering. However, if, e.g., an application needs to
consult an external database in order to process URLs, that
may easily be implemented as a plugin.
4.2 Protocol Plugins
A protocol plugin is invoked to retrieve the content of
URLs with a given scheme, e.g., HTTP, FTP, FILE, etc. A
protocol implementation, given a URL, returns the raw, binary content of that URL, along with metadata (e.g., protocol headers).
4.3 Parser Plugins
Parser plugins, given the output of a protocol plugin
(raw content and metadata), extract text, links and metadata
(author, title, etc.). Links are represented as a pair of
strings: the URL that is linked to; and the “anchor” text of
the link.
Nutch includes parsers for formats such as HTML,
PDF, Word, RTF, etc. Since Web content is frequently
malformed, robust parsers are required. Nutch currently
uses the NekoHTML [4] parser for HTML, which can successfully parse most pages, even those with mismatched
tags, those which are truncated, etc.
The HTML parser also produces a XML DOM parse
tree of each page's content. Plugins may be specified to
process this parse tree. For example, a Creative Commons
plugin scans this parse tree for Creative Commons license
RDF embedded within the HTML. If found, the license
characteristics are added to the metadata for the parse so
that they may subsequently be indexed and searched.
4.4 Indexing and Query Plugins
Nutch uses Lucene for indexing and search. When indexing, each parsed page (along with a list of incoming
links, etc.) is passed to a sequence of indexing plugins in
order to generate a Lucene document to be indexed. Thus
plugins determine the schema used; which fields are indexed and how they are indexed. By default, the content,
URL and incoming anchor texts are indexed, but one may
enable other plugins to index such things as date modified,
content-type, language, etc.
Queries in Nutch are parsed into an abstract syntax
tree, then passed to a sequence of query plugins, in order to
generate the Lucene query that is executed. The default indexing plugin generates queries that search the content,
URL and anchor fields. Other plugins permit field-specific
search, e.g., searching within the URL only, date-range
searching, restricting results to particular document types
and/or languages, etc.
5. Algorithms
Generic algorithms are implemented in terms of the
plugins outlined above, in order to perform user-level tasks
such as crawling, indexing etc. Each algorithm, except
search, is implemented as one or more MapReduce operations. All persistent data may be stored in NDFS for completely distributed operation.
5.1 Crawling
The crawling state is kept in a data structure called the
crawldb. It consists of a mapping from URLs to a CrawlDatum record. Each CrawlDatum contains a date to next
fetch the URL, the status of the URL (fetched, unfetched,
gone, etc.), the number of links found to this URL, etc.
The crawldb is bootstrapped by inserting a few root URLs.
The Nutch crawler then operates in a cycle:
1. generate URLs to fetch from crawldb;
2. fetch these URLs;
3. parse the fetched content;
4. update crawldb with results of fetch and new URLs
found when parsing.
These steps are repeated. Each step is described in
more detail below.
5.1.1 Generate
URLs are generated which are due to be fetched (status
is not 'gone' and next fetch date is before now). This set of
URLs may further be limited so that only the top most
linked pages are requested, and so that only a limited number of URLs per host are generated.
5.1.2 Fetch
The fetcher is a multi-threaded application that employs protocol plugins to retrieve the content of a set of
URLs.
5.1.3 Parse
Parser plugins are employed to extract text links and
other metadata from the raw binary content.
5.1.4 Update
The status of each URL fetched along with the list of
linked URLs discovered while parsing are merged with the
previous version of the crawldb to generate a new version.
URLs which were successfully fetched are marked as such,
incoming link counts are updated, and new URLs to fetch
are inserted.
5.2 Link Inversion
All of the parser link outputs are processed in a single
MapReduce operation to generate, for each URL, the set of
incoming anchor texts. Associating incoming anchor text
with URLs has been demonstrated to dramatically increase
the quality of search results. [5]
5.3 Indexing
A MapReduce operation is used to combine all information known about each URL: page text, incoming anchor text, title, metadata, etc. This data is passed to the indexing plugins to create a Lucene document that is then
added to a Lucene index.
5.4 Search
Nutch implements a distributed search system, but, unlike other algorithms, search does not use MapReduce.
Separate indexes are constructed for partitions of the collection. Indexes are deployed to search nodes. Each query
is broadcast to all search nodes. The top-scoring results
over all indexes are presented to the user.
32
6. Status
Nutch has an active set of users and developers. Many
sites are using Nutch today, for both intranet and vertical
search applications, scaling to tens of millions of pages. [6]
Nutch's search quality rivals that of commercial alternatives [7] at considerably lower costs. [8] Soon we hope
that Nutch's public deployments will include multi-billion
page search engines.
The MapReduce-based version of Nutch described
here is under active development. In the course of Summer
2005 we expect to index a billion-page collection using
Nutch at the Internet Archive.
[6] http://wiki.apache.org/nutch/PublicServers
[7]
http://www.nutch.org/twiki/Main/Evaluations/OSU_Querie
s.pdf
[8] http://osuosl.org/news_folder/nutch
7. Acknowledgments
The author wishes to thank The Internet Archive, Yahoo!, Michael Cafarella and all who contribute to Nutch.
8. References
[1] http://lucene.apache.org/nutch/
[2] Ghemawat, Gobioff, and Leung, The Google File
System, 19th ACM Symposium on Operating Systems
Principles, Lake George, NY, October, 2003, http://labs.google.com/papers/gfs.html
[3] Dean and Ghemawat, MapReduce: Simplified
Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004, http://labs.google.com/papers/mapreduce.html
[4] Clark, CyberNeko HTML Parser, http://people.apache.org/~andyc/neko/doc/html/index.html
[5] Craswell, Hawking, and Robertson, Effective site
finding using link anchor information, Proceedings of
ACM
SIGIR
2001,
http://research.microsoft.com/users/nickcr/pubs/craswell_s
igir01.pdf
33
34
Towards Contextual and Structural Relevance Feedback in XML Retrieval
Lobna Hlaoua and Mohand Boughanem
IRIT-SIG, 118 route de Narbonne 31068 Toulouse Cedex 4, France
{hlaoua,bougha}@irit.fr
-
Abstract
XML Retrieval is a process whose objective
is to give the most exhaustive and specific
information for given query. Relevance
Feedback in XML retrieval has been recently
investigated it consists in considering both
content and structural information extracted
from elements judged relevant, in query
reformulation. In this paper, we describe a
preliminary approach to select the most
expressive keywords and the most appropriate
generative structure to be added to the user
query.
1. Introduction
The Relevance Feedback (RF) is an interactive and
evaluative process. It usually consists in enriching an
initial request by adding terms extracted from documents
judged as relevant by the user.
Recently, the new standards of document
representation have appeared, in particular, XML
(eXtensible Markup Language) developed by W3C [10].
By exploring the characteristics of these new standard,
traditional Information Retrieval (IR) that treats a
document like only one atomic unit, has been extended to
better manage this kind of documents. Indeed due to the
structure of XML documents XML retrieval approaches
try to select the most relevant part, represented by an
XML element, instead of the whole document. As
consequence XML retrieval systems offer two type of
query expression, the CO (Content Only) query where
user express his needs with simple key word, and the
CAS (Content And Structure) query where user can add
structural.
Due to the structure of XML document, the traditional RF
task becomes more complicated. Indeed, the RF in
traditional IR consists in adding the most expressive
keywords extracted from of the relevant document In
XML retrieval the situation is quite different. The two
main questions are:
- First, how to extract from elements, that have
different role (semantic), the best terms,
and the second is how to select the best
generative structure that can be added to the
query.
In this paper we will present a preliminary work on how
one can incorporate the content and the structural
information when reformulating the user query. We first
give a brief related works in RF and XML retrieval then
we present our approach in section 3. The proposed treats
the content and the structure separately. In the last section
we will describe how we will evaluate our approaches in
the framework of INEX.
2. Previous Works
In traditional Information Retrieval, RF consists of
reformulating the original query according the user’s
judgment or automatically said behind RF. It is applied in
different model of IR like vector space model presented
by Rocchio [7], Tamine [9] has defined the RF in
connexionist model, Croft and Haines [1] described RF in
an alternative probabilistic model. In XML retrieval the
number of RF works is not important. The most works are
presented within the framework of the XML retrieval in
the company of INEX [2] (Initiative for the Evaluation of
XML retrieval).
The working group of V. Mihajlovic and G. Ramirez [6]
has proposed a strategy of reformulation applied to the
TIJAH [3] model. This last has the same architecture's
data bases system. Indeed, the model is composed of
three levels: conceptual, logic and physics. At the
conceptual level, the authors have adopted the query
language Narrowed Extended XPath (NEXI) proposed by
INEX in 2003. The logical level is based on the algebra
"`score area algebra"' whose documents were regarded as
a continuation of tokens. At the physical level the
"`MonetDB" ' which was applied to calculate similarity.
This last is based to the three measures: tf (frequency of
the terms), cf (frequency of the collection) and lp (size
with priory). Reformulation is carried out on two stages:
the first consists in extract from the document the most
relevant elements. This information represents the
newspaper where is found the most relevant element, the
etiquette of the element and the size which one wishes to
find.
Another proposal for a reformulation was presented by
the group IBM[4]. This proposal adapted the Rocchio [7]
35
algorithm to the vector model [5] whose vectors is
composed of sub-vectors where each one represents a
level of granularity, They applied the method IT (Lexical
Affinity) for separation of the relevant documents and not
relevant.
elements judged more relevant.
The CO query represent a simple application of
contextual RF, but for CAS query, we have to add the
Key words (or reject from) to the most generative
structure, we will explain in follow how can restitute the
most generative structure.
3. Relevance Feedback in XML documents
3.2. Structural Relevance Feedback
Up until now, in Information Retrieval, the simple
keywords are applied in query expansion. But the XML
retrieval offers the opportunity to express user’s needs by
structural information. The main goal of this preliminary
work is to present our investigation in CO and CAS
queries. More precisely, we discuss how one can
introduce structural constraints in the CO query and how
one can correct the structural constraints in the CAS
query?
We have seen that contextual RF is based on
additional key words, but in structural RF it is not
possible to add structure. Thus we are obliged to look for
an appropriate method to restitute the generative structure
that can help the user to get improvement in retrieval.
Our goal is to define and to restitute the appropriate
generative structure. We have to notice that the
appropriate generative structure should not be the most
generative because this last represents the totality of
documents. That is why; we have to define the smallest
common ancestor (sca).
If we consider the following example, we notice that the
XML document is represented as a tree in which, the root
is the totality of document. The nodes represent the
different elements of various granularities and the leaf
nodes are the textual information.
These two issues are described separately in the following
the subsections.
3.1. Contextual relevance feedback
According to the previous works in traditional
Information Retrieval, we have noticed that the more
appropriate method to expand query in vector space
model is to add weighted key words that represent the
most relevant documents and reject those express the
irrelevant documents. This method is represented by the
formula of Rocchio [7]. In the same way we do not apply
any more key words of documents but those of various
components of this last.
Our approach is expressed in the following formula:
Q'= Q + ∑npCp -∑nnpCnp
With:
Q: vector of the initial request
Q': vector of the new request,
Cp': (resp. Cnp): vector of a relevant component
(resp. not relevant),
np (resp. nnp): component count considered
relevant (resp. not relevant).
To apply this method, we have to select the most
important key words, so it is clear that if we will add all
the representative elements key words, the set of the last
will be very enormous and we will have various concepts
that can bring noise to the retrieval result. For this reason,
we have got more important weight for the key words that
are repeated in more one element. The key words weight
is proportional to the number of appearances in the
Figure 1: Example of XML document representation
Let us consider the tree structure T,
-Anc[n] is the whole of the ancestors of node n; it is
the whole of the nodes which make the way active of
the root towards n.
-Des[n] is the whole of the descendants of node n in
T; it is the whole of the nodes having n like ancestor.
-Sca (m, n) is the smallest common ancestor of the
nodes m and n; it is the first node common to the ways
active of m and n towards the root.
If we assume that for given query the IR system
returns the nodes 13, 8 and 4, the task is to decide which
structure can be introduced in the query: « book/chapter »
36
or
« book/chapter/section »
or
«book/chapter/section/para ».
We
notice
that
« book/chapter »is more generative but the criterion that
we have respect in IR is that the information must be
exhaustive1 and specific2. So, in our approach, we get
more advantage to the structure that is represented by a
big number of relevant elements and by considering
elements scores. The function that calculates the score `
SScore' of each structure candidate (i.e. which can be
injected into the request) is given in follow:
which can be candidates are presented in the following
table with their scores (α=0.8). We have chosen this
value arbitrarily that will be varied in the following
experiments.
We have to notice that if α smaller, we give advantage to
the more specific structure and if α is bigger, we give
advantage to the more generative structure.
/A/K/C/B/
/A/F/L/B/
/A/K/B/
/A/
/A/K/
0.5
0.2
0.35
0.58
0.6
Sscore= ∑in Si · αd
Table 1: Measurement of the candidate structure scores
With:
Si score of a relevant element having a joint base
with the structure candidate,
n a number of the relevant judged elements,
α a constant varying between 0 and 1,
d is the distance which separates the turned over
node, of the last node on the left of the structure
candidate.
Example 1
Q is a CO query (Content Only), composed by simple
key words: "X, Y",
We suppose that there are 3 components judged as
relevant having respectively the following structures:
« /A/B/C », « /A/B/F/L/P » and « /A/B / » and having
various weights. It is noticed that structure «/ A/B »
represents the common factor of the three components,
the reformulated request
Q': will be the query of the type CAS (Content And
Structure):
Q': /A/B [ about (X ,Y)].
Example 2
Q is a CAS query expressed in the query language of
XFIRM [8]:
//A[about (...,X)]//ce: B[Y],
With: A and B are names of the tags of XML
documents components. X and Y are key words.
This query seeks one under component B which contains
the key word Y and belongs to the descendants of A in
which, one speaks about X. There are 3 components
considered to be relevant having respectively, the
following structures: « /A/K/C/B », « /A/F/L/B » and
« /A/K/B/ » whose corresponding elements have
respectively, following weights: 0.5, 0.2 and 0.35.
We apply thereafter the formula which calculates the
score ` SScore' of each structure candidate. The structures
1
An element is judged exhaustive if it involves the all
information needed by the user.
2
An element is judged specific if the all information
included is related to the subject of the user’s query.
Sscore/A/K/= 0.5 · 0.82+0.35 · 0.81 = 0.6
Sscore/A/= 0.5 · 0.83+0.2 · 0.83+0.35 · 0.82 = 0.58
According to this table, we notice that the structure
which can be inserted is: « /A/K/ ».
To introduce it into the structured request we use the
function of aggregation already used for the made up
CAS query according to model of XFIRM [8].
We have to notice that if the structure having the most
important weight is the same of the initial structure query
we consider in aggregation the structure on the second
rank.
If we suppose that N and M are two different elements:
The node result of aggregation (N and M) and its
relevance are represented by the pair (l,rl ).
L is the ancestor nearest is:
rl =aggrand (rn, rm, dist(l,n),dist(l,m))
With:
aggrand (rn, rm, dist (l,n), dist (l,m))=
rn/ dist (l,n) +rm/ dist (l,m)
dist(x, y) is the distance which separates X and y in-depth
and ri the value of relevance of element i.
The final reformulated query is the result of aggregated
structure where content condition is the initial keyword
added with expansion given by Contextual RF.
4. Experiments
Application of reformulation is applied on XFIRM. It
is a Flexible Model of Information Retrieval for the
37
storage and the interrogation of documents XML
prepared within our team. It is based on data storage and
a simple query language, allowing the user to formulate
his need using simple key words or in a more precise way
by integrating constraints structure of the documents. The
similarity measure is based on tf (term frequency) and ief
(inverse element frequency).
To evaluate the results of our contribution, we have
resorted to the company of INEX (Initiative for the
Evaluation of XML Retrieval) [2]. The purpose of this
company is to be able to evaluate the XML Retrieval
systems by providing test collections of XML documents,
the procedures of evaluation and a forum. This company
allows to the participating organizations to compare their
results. Collections of the test evaluation XML retrieval
treat the elements of various granularities. The corpus is
composed of papers coming from IEEE Computer
Society marked out with format XML; they constitute a
collection from approximately 750 MB, containing more
than 13000 articles published between 1995 and 2004,
coming from 21 reviews. An average article is composed
of approximately 1500 nodes XML. The evaluation is
based on the two criteria: exhaustiveness and specificity.
It is the participant’s verdict which will decide the degree
of tow criteria.
We have carried out our approach that will be evaluated
on INEX 2005. The ultimate result will be given in
November 2005 and since it is our first participation in
RF task, we have not yet had an official result.
102_109, 2003.
[4] Y. Mass and M. Mandelbrod. Relevance Feedback for
XML Retrieval. INEX 2004 Workshop Pre-Proceedings:
154_157, 2004.
[5] M. Mass, M. Mandelbrod, E. Amitay, Y. Maarek. and
A. Soffer. JuruXML-an XML retrieval system at
INEX'02. Proceedings of the First Workshop of the
INiative for the Evaluation of XML Retrieval (INEX):
73_80, 2002.
[6] V. Mihajlovic, G. Ramirez, A. de Vries. and D.
Hiemstra. TIJAH at INEX 2004 Modeling Phrases and
Relevance Feedback. INEX 2004 Workshop PreProceedings: 141_148, 2004.
[7] J. J. Rocchio. Relevance feedback in information
retrieval. The SMART retrieval system - experiments in
automatic document processing: 313_323, 1971.
[8] K. Sauvagnat, M. Boughanem and C. Chrisment.
Searching XML documents using relevance propagation.
SPIRE 04: 242_254, 2004.
[9] L. Tamine and M. Boughanem: Query Optimization
Using An Improved Genetic Algorithm. CIKM 2000:
368-373, 2000.
[10]
Extensible
markup
language
(XML).
http://www.w3.org/TR/1998/REC-xml-19980210.
5. Conclusion
In this paper, we have presented our search work done
in the XML Retrieval. Our work represents a new
approach in the Relevance Feedback task which we have
applied a new strategy of contextual expansion query and
our proposition is to restitute the appropriate generative
structure in order to get the most exhaustive and specific
information. In future, we will evaluate our approaches in
INEX 2005.
6. References
[1] W. Croft and D. Harper. Using probabilistic models of
information retrieval without relevance information.
Journal of Documentation. 35(4): 285_295, 1979.
[2] INEX 2004 Workshop Pre-Proceedings.
http://inex.is.informatik.uni-duisburg.de:2004/
[3] J. A. List, V. Mihajlovic, A. P. de Vries and G.
Ramirez. The TIJAH XML-IR system at INEX 2003
(DRAFT. Proceedings of INEX 2003 Workshop:
38
An extension to the vector model for retrieving XML documents
Fabien LANIEL, Jean-Jacques GIRARDOT
École Nationale Supérieure des Mines de Saint-Étienne
158 Cours Fauriel
42023 Saint-Étienne CEDEX 2, FRANCE
Email: {laniel,girardot}@emse.fr
Abstract
The information retrieval community has worked a lot
on the combination of content and structure for creating
preferment information retrieval systems. With the development of new standards like XML or DocBook, researchers
got a growing data-base for creating and testing such systems.
Many XML query engines have been proposed, but most
of them do not possibly include a ranking system, because,
in all the criteria we can extract form a document, it’s not
easy to know which one cause a document to be more relevant than another.
This paper describes a reverse engineering method to determinate which criteria are the best to optimize the system
efficiency.
1
Introduction
During the last twenty years, research in Information
Retrieval (IR) has concentrated on two domains: flat documents mainly concerned by textual documents, and structured data such as those that are managed by relational databases. With the apparition of XML [8], a new standard for
semi-structured data, and the very fast development of corpora of XML documents, new challenges are offered to the
research community.
As a matter of fact XML, which offers a very versatile
format for information and data exchanging and keeping,
can handle a large range of usages from little structured textual documents to strongly typed and structured data. There
is however a hidden flaw behind this versatility: while most
applications know how to read and write XML documents,
there exist no tool that can search with efficiency large quantity of XML documents.
Actually, XML documents that are mainly textual with
little structured information (like the text of a novel) can
be easily handled as flat documents using the IR approach;
similarly, very structured documents (like the output of a
program) can be easily mapped to relations, represented in
a classical relational data-base, and queried with SQL. Between these extremes, documents that mix textual contents
with complex structures are not satisfactorily handled with
these approaches. These include most ”digital documents”,
such as literal text transcribed with TEI [7] or Shakespeare’s
plays [6], scientific documents represented in the DocBook
[1] format, and most semi-structured information, like those
used to constitute catalogs of industrial products, food, furniture, travels etc.
In this last case, we expect to use both contents and structures information of the document to reply efficiently to
a query. Many models and methods have been proposed
[3, 5, 9, 4] with many criteria (often chosen arbitrarily).
So, what criteria should we take into account? Number of
appearances of terms? Proximity between them? Relative
height between elements? etc. Furthermore, is there a criterion more important that others, and in which proportion?
In this paper we will present an approach of reverse engineering process to try to answer those questions. We will
present, in a first part, the context of this work: INitiative
for the Evaluation of XML Retrieval (INEX). Next, we will
describe the methodology we used. Finally we will show
some results and discuss the approach.
2
Context
The INEX [2] test collection consists of a set of XML
documents, topics and relevance assessments. The documents are articles of the IEEE, which are quite structured
and the topics relate to the content only or the structure and
the content of the documents.
INEX has defined a query language: NEXI, which proposes an important operator for us: about(). For example
the NEXI query:
//article[about(.,java)]
//sec[about(.,implementing threads)]
39
represents the sections about implementing threads of articles which about java in general. The about() operator is exactly an IR operator, in other words the query
//article[about(.,java)] can be processed with a classical
flat model of IR.
If we return to the first example, we can solve the first
part with classical model (//article[about(.,java)]) giving
a global relevance for the document, we can solve the second part too, if we then consider all the sections as separate
documents. But it will be a failure to overlook that the sections are descendant of article, since this may have an impact to the ranking; i.e., if two sections have the same score,
but the articles which contain them have different global
relevancies, it seems logical to rank the section in the most
relevant article before the other.
3
Fortunately, INEX provides us, not only with queries, but
also with assessments to these queries.
The idea presented here is to say that for this specific
topic, we can compute Ra and Rsi,j , using some well established evaluation method (such as the vector model) for
each assessed document, and say that the result Ri is equal
to the user estimated relevance of the document 1 . In our
case, this gives us a set of 2473 equations with 4 unknown
quantities: this over-determined system can be solved with
mathematical methods (a linear least squares method in our
case), giving values for α, β and γ that minimizes the system.
For the chosen model, and the specific query, we discover therefore the most appropriate values to represent the
relevance of any unrated new document.
With the query and the assessment table:
Our Approch
It is clear that we need an expression of the relevance of a
document that takes into account the contents of individual
elements of a document, and the structure of the document
itself. If we suppose that (as in flat document retrieval) we
can express the relevance Rx of the textual part of any XML
element Ex of the document, the relevance of the document
is to be expresses as a function of these values Rx that reflects the structure of the document.
The relevance of the document to a specific request is
therefore of the form R = FD−R (R1 , . . . , Ri , . . . , Rn),
where the FD−R function is specific to the document and
the request itself.
In our very simple example, we could think that it is
pertinent to select only documents where both conditions
on article and section are satisfied. However, relevance is
a strange function, and documents that are not detected as
talking about java or that contain no section with the words
”implementing threads” can be judged as relevant by the
user.
Computing relevance is therefore not a matter of just
”anding” or ”oring” results, but rather a problem of finding
a convenient equation with appropriate coefficients.
Starting with a simple topic such as //article[about(.,
java)]//sec[about(., implementing threads)] where Rai
and Rsi,j are the computed relevancies of article i and each
of it’s sections j, we can say that the relevance of some section is a function of Rai and Rsi,j . Different models have
been proposed in the past, combining Rai and Rsi,j with
functions such as addition, multiplication, etc. We can note
that ”and” conditions are typically represented by a multiplication or a minimum. If we use these combinations, the
generic equation corresponding to our model is:
Ri = α∗Rai +β∗Rsi,j +γ∗Rai ∗Rsi,j +δ∗min(Rai , Rsi,j )
The question is: how should we choose the coefficients?
article
section
1
1
1
2
1
3
...
...
1
mi
2
1
2
2
i
j
...
...
n
mn
user relevance
1/3
0
0
...
1/3
1
0
k
...
2/3
We create the system:
α ∗ Ra1 + β ∗ Rs1,1 + γ ∗ Ra1 ∗ Rs1,1 + δ ∗ min(Ra1 , Rsi,j ) = 1/3
α ∗ Ra1 + β ∗ Rs1,2 + γ ∗ Ra1 ∗ Rs1,2 + δ ∗ min(Ra1 , Rsi,j ) = 0
α ∗ Ra1 + β ∗ Rs1,3 + γ ∗ Ra1 ∗ Rs1,3 + δ ∗ min(Ra1 , Rsi,j ) = 0
.
.
.
α ∗ Ra1 + β ∗ Rs1,m1 + γ ∗ Ra1 ∗ Rs1,m1 + δ ∗ min(Ra1 , Rs1,mn ) = 1/3
α ∗ Ra2 + β ∗ Rs2,1 + γ ∗ Ra2 ∗ Rs2,1 + δ ∗ min(Ra2 , Rs2,1 ) = 1
α ∗ Ra2 + β ∗ Rs2,2 + γ ∗ Ra2 ∗ Rs2,2 + δ ∗ min(Ra2 , Rs2,2 ) = 0
.
.
.
α ∗ Rai + β ∗ Rsi,j + γ ∗ Rai ∗ Rsi,j + δ ∗ min(Rai , Rsi,j ) = k
.
.
.
α ∗ Ran + β ∗ Rsn,mn + γ ∗ Ran ∗ Rsn,mn + δ ∗ min(Ran , Rsn,mn ) = 2/3





























4
Results
We used these three similar topics for testing this
method:
• Topic 128: (1623 Equations)
//article[about(., intelligent transport
systems)]//sec[about(., on-board route planning
navigation system for automobiles)]
• Topic 141: (2473 Equations)
//article[about(., java)]//sec[about(.,
implementing threads)]
• Topic 145: (2687 Equations)
//article[about(., information
retrieval)]//p[about(.,relevance feedback)]
1 For most INEX documents, relevancies have not been estimated. We
use only documents for which the relevance has been estimated; the corresponding values (0, 1, 2, and 3) are normalized to 0, 1/3, 2/3 and 1.
40
By solving these three systems we obtain values for the unknown quantities:
128
0.051
0.085
-0.753
0.567
141
0.014
0.059
0.769
0.265
145
0.009
0.123
0.506
0.134
Now we can reintroduce these values into each system,
compute the score of each answer, order them by growing
values and draw the Precision-Recall graphics. The figure 1
shows for each topics the Precision-Recall graphics.
Topic 128
60
Relevance of section only
Our method
50
40
Precision
Topic
α
β
γ
δ
30
20
5 Conclusion and Perspective
10
What conclusions can we draw from these very first experiments? While we have chosen three similar requests,
which look like they might be solved by ”adding” two conditions, experimental results only partially validate this hypothesis.
However, there are many aspects that impact the results,
and which are difficult to take into account.
0
20
40
60
80
100
Recall
Topic 141
60
Relevance of section only
Our method
50
40
Precision
• Clearly, the relevance assessment made by user is
rarely a strict interpretation of the NEXI formulation:
a user can also make errors, incorrect judgments2 , etc.
0
30
20
• the function that we use to evaluate the relevance of
a passage is quite simple, based on the vector model.
It doesn’t take into account synonymy or homonymy
between words, etc.
10
0
0
20
40
60
80
100
Recall
• the equation system that we obtain is usually ill conditioned, and gives sometimes unstable results.
Topic 145
60
• taking into accounts the profile of the user.
Relevance of section only
Our method
50
40
Precision
Many more experiments (including different evaluations
functions for textual elements) clearly need to be conducted
to be able to draw firm conclusions. However, we believe
that the approach may lead to a progress in many directions,
including:
30
20
• discovering the best usages of structures for XML information retrieval.
10
• adapting relevance feedback system
0
0
20
40
60
80
Recall
More generally, we can expect such an approach to help
designing acceptable models for ”and” and ”or” operations,
used in typical requests on structure and contents of XML
documents, therefore allowing us to build better information
retrieval systems.
Figure 1. Precision-Recall for each topics
2 Actually, when two INEX experts evaluate the same set of documents,
they usually totally disagree about which are relevant and which are not.
41
100
References
[1] DocBook. http://www.docbook.org/.
[2] Initiative for the evaluation of xml retrieval. http://
inex.is.informatik.uni-duisburg.de/.
[3] G. Navarro. A Language for Queries on Structure and Contents of Textual Databases. PhD thesis, University of Chile,
1995.
[4] K. Sauvagnat, M. Boughanem, and C. Chrisment. Searching
XML documents using relevance propagation. SPIRE, 2004.
[5] T. Schlieder and H. Meuss. Querying and ranking XML documents. JASIST, 53(6):489–503, 2002.
[6] XML corpus of Shakespeare’s plays.
http://www.ibiblio.org/xml/examples/
shakespeare/.
[7] TEI Consortium. Text Encoding Initiative, 1987. http:
//www.tei-c.org/.
[8] World Wide Web Consortium (W3C). Extensible Markup
Language (XML), February 1998. http://http://www.
w3.org/XML/.
[9] R. Wilkinson. Effective retrieval of structured documents. In
Research and Development in Information Retrieval, pages
311–317, 1994.
42
Do search engines understand Greek or user requests “sound Greek” to them?
Fotis Lazarinis
Department of Technology Education & Digital Systems
University of Piraeus
80 Karaoli & Dimitriou,185 34 Piraeus, Greece
[email protected]
Abstract
This paper presents the outcomes of initial Greek Web
searching experimentation. The effects of localization
support and standard Information Retrieval techniques
such as term normalization, stopword removal and simple
stemming are studied in international and local search
engines. Finally, evaluation points and conclusions are
discussed.
1. Introduction
The Web has rapidly gained popularity and has
become one of the most widely used services of the
Internet. Its friendly interface and its hypermedia features
attract a significant number of users. Finding information
that satisfies a particular user need is one of the most
common and important operations in the WWW. Data are
dispensed in a measureless number of locations and so
utilization of a search engine is necessary.
Although international search engines like Google and
Yahoo are preferred over the local ones, as they employ
better searching mechanisms and interfaces, they do not
really value other spoken languages than English.
Especially in languages like Greek which has inclinations
and intonation, it seems that the majority of the
international search engines have no internal (indexing)
or external (interface) localization support. Thus the user
has to devise alternative ways so as to discover the
desired information and to adapt themselves to the search
engine’s interface.
This paper reports the results of initial experimentation
in Greek Web searching. The effect of localization
support, upper or lower case queries, stopword removal
and simple stemming is studied and evaluation points are
presented. The conclusions could be readily adapted to
other spoken languages with similar characteristics to the
Greek language.
2. Experimentation and evaluation
Interface simplicity and adaptation is maybe the most
important issue which influences user satisfaction and
acceptance of Web sites and thus search engines [1, 2].
User acceptance factor is obviously increased when a
search engine changes the language and maybe its
appearance to satisfy its diversified user basis. This is
significant especially to novice users.
Stopword removal, stemming and capitalization or
more generally normalization of index and query terms
are amongst the oldest and most widely used IR
techniques [3]. All academic systems support them.
Commercial search engines, like Google, explicitly state
that they remove stopwords, while capitalization support
is easily inferred. Stemming seems to not be supported
though. This may be due to the fact that WWW document
collection is so huge and diverse that stemming would
significantly increase recall and possibly reduce
precision. However simple stemming, like final sigma
removal which will be presented later in the paper, may
play an important role when seeking information in the
Web using Greek query terms.
These four issues were examined with respect to the
Greek language. For conducting our assessment we used
most of the predominately known worldwide .com search
engines: Google, Yahoo, MSN, AOL, Ask, Altavista. The
.com search engines were selected based on their
popularity [4]. Also, for comparison reasons, we
considered using some native Greek search engines: In
(www.in.gr), Pathfinder (www.pathfinder.gr) and Phantis
(www.phantis.gr).
2.1. Interface issues
Ten users participated in the interface related
experiment and they also constructed some sample
queries for the subsequent experiments. Users had
varying degrees of computer usage expertise. We needed
end users with technical expertise and obviously
increased demands over the utilization of web searchers.
On the other hand we should measure the difficulties and
listen to the people who have just been introduced to
search engines. This combination of needs reflect real
everyday needs of web “surfers”.
The following sub-issues extracted from a more
complete evaluation study of user effort when searching
the Greek Web space utilizing international search
engines [5]. Here we extend (with more users and search
43
engines) and present only the issues connected with
whether search engines really value other spoken
languages than English, like Greek, or not.
2.1.1. Localization support. The first issue in our study
was the importance of a localized interface. All the
participants (100%) rated this feature as highly important
as many users have basic or no knowledge of English.
Although search engines have uncomplicated and
minimalist interfaces their adaptation to the local
language is essential as users could easily comprehend the
available options.
From the .com ones only Google automatically detects
local settings and adapts to Greek. Altavista allows
manual selection of the presentation language with a
limited number of language choices and setup instructions
in English. Also if you select another language, search is
automatically confined to this country’s websites (this
must be altered manually again).
2.1.2. Searching capability. In this task users were asked
to search using queries with all terms in Greek. All search
engines but AOL and Ask were capable of running the
queries and retrieving possibly relevant documents. AOL
pops-up a new Window when a user requests some
information but it cannot correctly pass the Greek terms
from the one window to the other. So no results are
returned. However, when requests typed directly to the
popped-up window then queries are run but presentation
of the rank is problematic again.
Ask does not retrieve any results, meaning that
indexing of Greek documents is not supported. For
example zero documents retrieved in all five queries of
section 2.2. For these reasons AOL and Ask left out of
the subsequent tests.
2.1.3. Output presentation. An important point made by
the participants is that some of the search engines rank
English web pages first, although search requests were in
Greek. For example in the query “Ολυµπιακοί αγώνες
στην Αθήνα” (Olympic Games in Athens) Yahoo, MSN
and Altavista ranked some English pages first. This
depends on the internal indexing and ranking algorithm
but it is one of the points that increase user effort because
one has to scroll down to the list of pages to find the
Greek ones.
of Recall and Precision [6] are used for comparing the
results of the sample queries. Recall refers to the number
of retrieved pages, as indicated by search engines, while
precision (relevance) was measured in the first 10 results.
Table 1. Sample queries.
No
Q1
Q2
Q3
Q4
Q5
Queries in Greek
Μορφές ρύπανσης
περιβάλλοντος
Εθνική πινακοθήκη
Αθηνών
Προβλήµατα υγείας από
τα κινητά τηλέφωνα
Συνέδριο πληροφορικής
2005
Τεστ για την πιστοποίηση
των εκπαιδευτικών
Queries in English
Environmental pollution
forms
National Art Gallery of
Athens
Health problems caused
by mobile phones
Informatics conference
2005
Tests for educators’
certification
Table 2 presents the number of recalled pages for each
query. From table 2 we realize that In and Pathfinder
share the same index and employ exactly the same
ranking procedure. The result set was identical both in
quantity and order. Their only difference was in output
presentation. Altavista and Yahoo had almost the same
number of results, ranked slightly differently though.
Table 2. Recall in lower case queries.
Google
Yahoo
MSN
Altavista
In
Pathfinder
Phantis
Q1
867
820
1357
821
251
251
33
Q2
3400
933
1537
939
343
343
63
Q3
805
527
542
515
67
67
22
Q4
15500
11200
6486
11400
689
689
88
Q5
252
186
272
191
49
49
6
In all cases the international search engines returned
more results than the native Greek local engines.
However, as seen in table 3, relevance of the first 10
results is almost identical in all cases, except Phantis,
which maintains either a small index or employs a crude
ranking algorithm. Query 4 retrieves so many results
because it contains the number (year) 2005. So,
documents which contain one of the terms and the
number 2005 are retrieved, increasing recall significantly.
Table 3. Precision of the top 10 results.
Google
Yahoo
MSN
Altavista
In
Pathfinder
Phantis
2.2. Term normalization, Stemming, Stopwords
Trying to realize how term normalization, stemming
and stopwords affect retrieval we run some sample
queries. We used 5 queries (table 1) suggested by the
participants of the previous test. They were typed in
lower case sentence form with accent marks leaving the
default options of each search engine. A modified version
44
Q1
5
5
4
5
5
5
2
Q2
7
7
7
7
7
7
2
Q3
9
8
8
8
8
8
2
Q4
8
7
6
7
6
6
1
Q5
8
8
7
8
8
8
0
We confined the relevance judgment to only the first
ten results so to limit the required time and because the
first ten results are those with the highest probability to be
visited. Relevance was judged upon having visited and
inspected each page. The web locations visited had to be
from a different domain. So if two consecutive pages
were on the same server only one of them was visited.
An interesting point to make is that although recall
differs substantially among search engines precision is
almost the same in all cases. Another point of attention is
that the third query shows the maximum precision. This is
because in this case terms are more normalized, compared
to the other queries. This means that they are in the first
singular or plural form which is the usual case in words
appearing in headings or sub-headings. Consequently a
better retrieval performance is exhibited. But, as we will
see in section 2.2.3, it contains stopwords which when
removed precision is positively affected and reaches
10/10.
changes to “µορφεσ”.
These observations are at least worrying. What would
happen if a searcher were to choose to search only in
capital letters or without accent marks? Their quest would
simply fail in most of the cases leading novice users to
stop their search. In English search there is no
differentiation between capital and lower letters. The
result sets are identical in both cases so user effort and
required “user Web intelligence” is unquestionably less.
2.2.1. Term normalization. We then re-run the same
queries but this time in capital letters with no accent
marks. Recall (table 4) was dramatically diminished in
most of the worldwide search enabling sites while it was
left unaffected in two of the three domestic ones (In and
Pathfinder). Precision was negatively affected as well
(table 5), compared to results presented in table 3.
Wrapping up this experiment one can argue that in
Greek Web searching the same query should be run both
in lower and in capital letters, so as to improve the
performance of the search. Sites where there are no
accent marks or contain intonation errors will not be
retrieved unless variations of the query terms are used.
Greek search engines are superior at this point and make
information hunting easier and more effective. From the
international search engines only Google has recognized
these differences and try to improve its searching
mechanism.
Table 4. Recall in upper case queries.
Google
Yahoo
MSN
Altavista
In
Pathfinder
Phantis
Q1
22
18
10
18
251
251
4
Q2
3400
229
233
239
343
343
63
Q3
41
2
2
2
67
67
3
Q4
673
116
379
117
689
689
14
Q5
252
8
10
9
49
49
6
These observations are true for Yahoo, MSN and
Altavista. Google and Phantis exhibit a somehow unusual
behavior. In queries 2 and 5 Google and Phantis retrieve
the same number of documents in the same order and
have the same precision therefore. Upper case queries 1, 3
and 4 recall only a few documents compared to the
equivalent lower case queries. Correlation between results
is low and precision differs.
Trying to understand what triggers this inconsistency
we concluded that it relates to the final sigma existing in
some terms of queries 1, 3 and 4. The Greek capital sigma
is Σ but lower case sigma is σ when it appears inside a
word and ς at the end of the word. Phantis presents the
normalized form of the query along with the result set.
Indeed it turns out that words ending in capital Σ are
transformed to words with the wrong form of sigma, e.g.
“ΜΟΡΦΕΣ” (forms) should change to “µορφες” but it
Table 5. Precision of the top 10 results.
Google
Yahoo
MSN
Altavista
In
Pathfinder
Phantis
Q1
4
3
3
3
5
5
0
Q2
7
8
6
8
7
7
2
Q3
3
0
0
0
8
8
0
Q4
10
5
7
5
6
6
0
Q5
8
7
7
7
8
8
0
2.2.2. Stemming. Another factor that influences
searching relates to the suffixes of the user request words.
For example the phrases “Εθνική πινακοθήκη Αθηνών”
or “Εθνική πινακοθήκη Αθήνας” or “Εθνική πινακοθήκη
Αθήνα” all mean “National Art Gallery of Athens”. So
while they are different they describe exactly the same
information need. Each variation retrieves quite different
number of pages. For example Google returned 3400, 722
and 5420 web pages respectively. Precision is different in
these three cases as well, and correlation between results
is less than 50% in the first ten results.
One could argue that such a difference is rational and
acceptable as the queries differ. If we consider these
queries solely from a technical point of view then this
argument is right. However if the information needed is
in the center of the discussion then these subtle
differences in queries which merely differ in one ending
should have recalled the same web pages. Stemming is an
important feature of retrieval systems [3] (p. 167) and its
application should be at least studied in spoken languages
which have conjugations of nouns and verbs, like in
Greek. Google partially supports conjugation of English
verbs.
45
2.2.3. Stopwords. Google and other international search
engines remove English stopwords so as to not influence
retrieval. For instance users are informed that the word of
is an ordinary term and is not used in the query “National
Art Gallery of Athens”. Removal of stopwords [3] (p.
167) is an essential part of typical retrieval systems.
We re-run, in Google, queries 3 and 5 removing the
ordinary words. Queries were in lower case and with
accent marks so results should be compared with tables 2
and 3. Query 3 recalled 839 pages and precision equals
10 in the first 10 ranked documents. Similarly for the fifth
query Google retrieved 275 documents and precision
raised from 8 (table 2) to 10. As realized, recall was left
unaffected but precision increased by 10% and by 20%
respectively. This means that ranking is affected when
stopwords are removed. However more intense tests are
required to construct a stopword list and to see how
retrieval is affected by Greek stopwords
4. Conclusions
This paper presents a study regarding utilization of
search engines using Greek terms. The issues inspected
were the localization support of international search
engines and the effect of stopword removal, capitalization
and stemming of query terms. Our analysis participants
identified as highly important the adaptation of search
engines to local settings. Most of the international search
engines do not automatically adapt their interface to other
spoken languages than English and some of them do not
even support other spoken languages. At least these are
true for Greek.
In order to get an estimate of the internal features of
search engines that support Greek, we run some sample
queries. International search engines recalled more pages
than the local ones and they had a small positive
difference in precision as well. However they are case
sensitive, apart from Google, hindering retrieval of web
pages which contain the query terms in a slightly different
form to the requested one. Even if the first letter of a
word is a capital letter the results will be different than
when the word is typed entirely in lower case.
Endings and stopwords are not removed automatically,
thus affecting negatively recall of relevant pages.
Stopwords are removed from English queries making
information hunting easier, looking at it from a user’s
perspective. Terms are not stemmed though even in
English. However in a language with inclinations, like
Greek, simple stemming seems to play an important role
in retrieval assisting end users. In any case more intensive
tests are needed to realize how endings, stopwords and
capitalization affect retrieval.
Trying to answer the question posed in the article’s
title it can be definitely argued that international search
enabling sites do not value the Greek language and
possibly other languages with unusual alphabets. Google
is the only one which differs than the others and seems to
be in a process of adapting to and assimilating the
additional characteristics.
5. References
[1] J. Nielsen, R. Molich, C. Snyder, S. Farrel, Search: 29
Design Guidelines for Usable Search http://www.nngroup.com/
reports/ecommerce/search.html,2000.
[2] Carpineto, C. et al., “Evaluating search features in public
administration websites”, Euroweb2001 Conference, 2001, 167184.
[3] Baeza-Yates, R. and Ribeiro-Neto, B., Modern Information
Retrieval, Addison Wesley, ACM Press, New York, 1999.
[4] D. Sullivan, Nielsen NetRatings: Search Engine Ratings
http://searchenginewatch.com/reports/article.php/2156451,
2005.
[5] Lazarinis, F., “Evaluating user effort in Greek web
searching”, 10th PanHellenic Conference in Informatics,
University of Thessaly, Volos, Greece, 2005 (to appear)
[6] S. E. Robertson, “The Parameter Description of Retrieval
Systems: Overall Measures”, Journal of Documentation, 1969,
25, 93-107.
46
Use of Kolmogorov distance identification of web page authorship, topic and
domain
David Parry
School of Computer and Information Sciences, Auckland University of Technology, Auckland, New
Zealand
[email protected]
Abstract
Recently there has been an upsurge in interest in the use
of information entropy measures for identification of
similarities and differences between strings. Strings
include text document languages, computer programs
and biological sequences. This work deals with the use of
this technique for author identification in online postings
and the identification of WebPages that are related to
each other. This approach appears to offer benefits in
analysis of web documents without the need for domain
specific parsing or document modeling.
1. Introduction
Kolmogorov distance measurement involves the use
of information entropy calculations to measure the
distance between sequences of characters. Information
retrieval is a potentially fruitful area of use of this
technique, and it has been used for language and
authorship identification [1], plagiarism detection in
computer programs [2] and biological sequences, such as
DNA and amino acids [3].
Authorship, genre identification and measures of
relatedness remain an important issue for the verification
and identification of electronic documents. Related
document searches have been identified as an important
tool for users as information retrieval systems [4].
Computers have been used for a long time to try and
verify the identity of authors in the humanities [5] and in
the filed of software forensics [6]. Various techniques
have been used in the past including Bayesian
Inference[7], neural networks[8] and more sophisticated
methods using support vector machines [9]. However,
such approaches tend to be extremely language and
context specific although often very effective.
Briefly, this approach is based around the concept of
the relative information entropy of a document. The
concept of the relative information of a document is
closely related to that of Shannon [10]. One way of
expressing this concept is to view a document as a
message that is being encoded over a communication
channel. A perfect encoding and compression scheme
would produce the minimum length of message. In
general, a document that can undergo a high degree of
shortening by means of a compression algorithm has a
low information entropy – that is there is a large degree of
redundancy, whereas one that changes little in size has a
high degree of information entropy, with little redundant
information. A good compression algorithm should never
increase the size of the “compressed” document. As the
authors of [11] point out, a good zipping algorithm can be
considered as a sort of entropy meter.
The Lempel-Ziv algorithm reduces the size of a file by
replacing repeating strings with codes that represent the
length and content of these strings [12], and has been
shown to be a very effective scheme. To work efficiently,
the Lempel-Ziv algorithm “learns” effective substitutions
as it examines the document sequentially to find repeating
sequences that can be replaced in order to reduce the file
size. This algorithm is the basis of the popular and rapid
zip software in its various incarnations including Gzip,
Pkzip and WinZip. Importantly this method relies on a
sequential examination of the document to be encoded, so
concatenation with other documents can have dramatic
effects on the efficiency of zipping, as rules for encoding
created at the start of the process are found to be useless
at the end.
By adding, a document of unknown
characteristics to one of known properties (for example
language, author, genre etc.) then is suggested that the
combined relative entropy is smallest when the two
documents are most similar.
The work of [1] demonstrated that it was possible to
identify the language used in a document by comparison
with known documents. This method is therefore
complementary to other methods that concentrate on the
understanding of the document, much as handwriting or
voice analysis widens the possibilities of author
identification, even if the content is not distinctive [13].
The rest of this paper describes one implementation of
these types of algorithm (section 2), along with a number
of experiments (Section 3). Section 4 discusses the results
and section 5 describes other approaches and draws
conclusions about this approach.
47
1. Algorithms
The Kolmogorov distance is based on the method of
[14], and earlier work such as [15] which deals with the
identification of minimum pattern length similarities. By
using compression algorithms the following formula for
the distance between two objects may be computed.
Assuming C (A|B) is the compressed size of A using the
compression dictionary used in compressing B, and vice
versa for C (B|A) and C (A), C (B) represent the
compressed length of A and B using their own
compression dictionaries. The distance between A and B,
D(A,B) is given by:
D( A, B) =
study classification schemes – for example the use of
readability or other scores[17] to characterize discussion.
Postings from an online teaching system – Business on
line [18], were used. A total of 160 initial messages were
used. The Kolmogorov distance (KD) was calculated
between this message and 10 other messages, only one of
which by the same author as the first. The message
combination with the shortest KD was then noted, and the
results are shown in Table 1.
Table 1: Kolmogorov Distance for Messages
Percent
Percent in
Shortest KD sample
90%
Author1<>Author2 51.88%
10%
Author1=Author2 48.13%
Status
C ( A | B ) + C ( B | A)
C ( A) + C ( B)
This formula is explicitly derived in [16]. Various
methods of compression have been used for this, for this
work a method was used that did not need explicit access
to the compression dictionary, so that standard zip
programs could be used. Concatenating files and then
compressing them allows the compression algorithm to
develop its dictionary on the first file and then apply it to
the second. The algorithm used is given by:
Obtain the two files – file1 and file2
Concatenate them in two ways, file1+ file2 = (file12)
and file2+ file1 =(file21)
Calculate the compressed length of:
file1 as zip1
file2 as zip2
file12 as zip12
file21 as zip21
The distance (D) is then given by:
D( file1, file2) =
( zip12 − zip1 ) + ( zip 21 − zip 2 )
zip1 + zip 2
Using Chi-Squared, this result is significant at the
p<0.001 level (SPSS 11) χ2=(1,Ν=160)=258,p<0.001.
The proportion of messages with common authors having
the smallest distance is a great deal higher than expected
by chance.
3.2 Experiment Two
4,389 Web Pages were downloaded using a web spider
from 6 root sites. A similar comparison was done for the
website domain-based group, with each of 80 pages
compared with one from the same domain, and nine from
others. The results are shown in Table 2. Again, using
Chi-Squared, this result is significant at the p<0.001 level
(SPSS 11) χ2=(1,Ν=80)=451,p<0.001. The proportion of
websites from common domains having the smallest
distance is a great deal higher than expected by chance.
Table 2: Kolmogorov Distance for Domains
Status
This approach depends on the compression algorithm
being lossless. Previous work had demonstrated that if the
file1 is the same as file2 the distance is minimal.
3. Methods
Three experiments were performed to validate the
algorithm used. One used author identification, the
second used WebPages from different domains, and the
third used different topics within a particular web corpus.
3.1 Experiment One
One particularly rich source of testing data is achieved
newsgroup and list server postings that often contain
particularly relevant information in a concise format.
Newsgroup postings provide a rich corpus of material to
Different Domain
Same Domain
Percent
lowest KD
18.75%
81.25%
Percent in
sample
90%
10%
3.3 Experiment Three
This experiment used the British Medical Journal
(BMJ) Website that includes a large number of pages
grouped by topic. The process began by selecting those
topics that had at least 5 valid pages available for
download. For each of these valid domains (n=133), 5
initial pages were chosen. One page from the same
domain, and nine different pages from other domains
were then selected, in a similar manner to that described
above. Again, the files were selected to be of similar
length, and the pages zipped together, using the
Kolmogorov distance by zipping algorithm. Self-
48
comparison i.e. where file1=file2 was not permitted. The
results are shown in Table 3.
Table 3: Kolmogorov distance BMJ topics
Source
Different topic
domain
Same topic
domain
Number of
occurrences with
shortest distance
17.89%
Percent in
sample
82.11%
10%
90%
Using CHI-Squared implemented in SPSS version 11
the results show that the minimal distance is significantly
more likely to occur using files from the same domain,
rather than ones of similar length from other
domains. χ2=(1,Ν=665)=3839,p<0.001
4. Discussion and Future Work
The Kolmogorov distance measure approach
demonstrates effective identification of related
documents. This relatedness may be intrinsic to the text,
as in the case of content, authorship or language, or
related to the structure of the webpage, that is the
arrangement of tags or formatting information.
Drawbacks to the practical implementation of this
method centre around two main areas, combinatorial
explosion and confounding similarity.
As stated this method requires each file to be
compared with each other file, thus the number of
calculations needed to find the distance between n
documents is given by n!
Current work is concentrating on the clustering of
documents using this approach. One approach has been to
find documents that are close in terms of KD, to use these
as cluster centroids, and measure the distance of new
examples from these. This approach, by identifying the
centroid of a cluster in terms of a limited number of
documents would remove the issue of n! comparisons.
Work by [3], has emphasized the importance of
clustering.
Confounding similarity, represents the case where
documents have a great deal of similarity that is unrelated
to their content – for example in the case of documents
converted to HTML by popular editors with supplied
templates or conversion programs. This does not seem to
be an issue in the case of the BMJ topic corpus, but may
become important in other cases. If necessary text
extraction and separation from formatting tags could be
used. The ultimate length of documents that can be
effectively processed in this way should be investigated,
it seems reasonable to suppose that extremely long
documents of very short documents would not be suitable
because of the likelihood of common of repeating motifs
in the former case and the absence of repeating motifs in
the latter. Other compression techniques, including those
where the compression dictionary is stored separately,
should be investigated.
It is important to note that this approach is generally
complimentary to existing ones and has not been
compared with other methods – such as comparisons
using textual information, This method is attractive in
areas where there is difficulty in performing domain
specific parsing or there is no knowledge relating to
document structure.
In terms of open-source implementation, this approach
could easily be added as a plug-in to browser technology,
allowing individual users to compare new documents to
those cached already, or by allowing users to
collaboratively compare documents with a central or
dispersed repository. The decreasing cost of storage
implies that document cache comparison will become
increasingly important, and simple, general comparison
tools will be important in this regard.
5. Conclusion
Comparing electronic documents using the
Kolmogorov technique is easily implemented and is not
constrained by any proprietary technology. This approach
seems particularly useful for short, unstructured
documents such as newsgroup postings and emails. Web
logs (Blogs) are also becoming more popular and this
approach could be used for comparison and validation of
these. Use of this technique, in addition to current
methods may allow improved characterization of
electronic communication and searching of electronic
databases. For search engine technology, such approaches
may allow improved ranking of results. Particular
applications include relatedness and clustering
applications, email filtering, fraud and plagiarism
detection and genre identification. Further research in this
area may increase the value of this approach.
6. References
[1]
[2]
[3]
49
D. Benedetto, E. Caglioti, and V. Loreto,
"Language Trees and Zipping," Physical Review
Letters, vol. 88, pp. 048702-1 to 048702-4,
2002.
X. Chen, B. Francia, M. Li, B. McKinnon, and
A. Seker, "Shared information and program
plagiarism detection," Information Theory, IEEE
Transactions on, vol. 50, pp. 1545-1551, 2004.
R. Cilibrasi and P. M. B. Vitanyi, "Clustering by
compression," Information Theory, IEEE
Transactions on, vol. 51, pp. 1523-1545, 2005.
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
B. J. Jansen, A. Spink, J. Bateman, and T.
Saracevic, "Real life information retrieval: a
study of user queries on the Web," SIGIR
Forum, vol. 32, pp. 5-17, 1998.
S. Y. Sedelow, "The Computer in the
Humanities and Fine Arts," ACM Computing
Surveys (CSUR), vol. 2, pp. 89-110, 1970.
P. W. Oman and C. R. Cook, "Programming
style authorship analysis," pp. 320--326, 1989.
Mosteller F. and Wallice D., Applied Bayesian
and Classical Inference: the case of the
Federalist Papers: Addison-Wesley, 1964.
S. T. Singhe, F.J., "Neural networks and
disputed authorship: new challenges " in
Artificial Neural Networks, 1995., Fourth
International Conference on, 1995, pp. 24-28.
O. d. Vel, A. Anderson, M. Corney, and G.
Mohay, "Mining e-mail content for author
identification forensics," ACM SIGMOD Record,
vol. 30, pp. 55-64, 2001.
Shannon., "Mathematical Theory of
Communication," in Bell Systems Technical
Journal, 1948.
A. Puglisi, D. Benedetto, E. Caglioti, V. Loreto,
and A. Vulpiani, "Data compression and
learning in time sequences analysis," Physica D:
Nonlinear Phenomena, vol. 180, pp. 92-107,
2003.
J. L. Ziv, A., "A universal algorithm for
sequential data compression," Information
Theory, IEEE Transactions on, vol. 23, pp. 337343, 1977.
S. N. Srihari and S. Lee, "Automatic handwriting
recognition and writer matching on anthraxrelated handwritten mail," in Eighth
International Workshop on Frontiers in
Handwriting Recognition, 2002, pp. 280-284.
A. Kolmogorov, "Logical basis for information
theory and probability theory," Information
Theory, IEEE Transactions on, vol. 14, pp. 662664, 1968.
A. Kolmogorov, "Three Approaches to the
quantitive definition of Information," Problems
of Information Transmission, vol. 1, pp. 1-17,
1965.
M. Li, X. Chen, X. Li, B. Ma, and P. Vitenyi,
"The similarity metric," presented at SODA Proceedings of the fourteenth annual ACMSIAM symposium on Discrete algorithms,
Baltimore, Maryland,, 2003.
P. Sallis and D. Kasabova, "Computer-Mediated
Communication, Experiments with e-mail
readability," Information Sciences, pp. 43-53,
2000.
[18]
50
A. Sallis, G. Carran, and J. Bygrave, "The
Development of a Collaborative Learning
Environment: Supporting the Traditional
Classroom," presented at WWW9, Netherlands,
2000.
Searching Web Archive Collections
Michael Stack
Internet Archive
The Presidio of San Francisco
116 Sheridan Ave. San Francisco, CA 94129
[email protected]
Abstract
Web archive collection search presents the usual set of
technical difficulties searching large collections of
documents. It also introduces new challenges often at
odds with typical search engine usage. This paper
outlines the challenges and describes adaptation of an
open source search engine, Nutch, to Web archive
collection search. Statistics and observations indexing
and searching small to medium-sized collections are
presented. We close with a sketch of how we intend to
tackle the main limitation, scaling archive collection
search above the current ceiling of approximately 100
million documents.
Technically, Nutch provides basic search engine
capability, is extensible, aims to be cost-effective, and is
demonstrated capable of indexing up to 100 million
documents with a convincing development story for how
to scale up to billions [9].
This paper begins with a listing of challenges searching
WACs. This is followed by an overview of Nutch
operation to aid understanding of the next section, a
description of Nutchwax, the open-source Nutch
extensions made to support WAC search. Statistics on
indexing rates, index sizes, and hardware are presented as
well as observations on the general WAC indexing and
search operation. We conclude with a sketch of how we
intend to scale up to index collections of billions of
documents.
1. Introduction
2. Challenges Searching WACs
The Internet Archive (IA)(www.archive.org) is a
501(c)(3) non-profit organization whose mission is to
build a public Internet digital library [1]. Since 1996,
the IA has been busy establishing the largest public Web
archive to date, hosting over 600 terabytes of data.
Currently the only public access to the Web archive has
been by way of the IA Wayback Machine (WM) [2] in
which users enter an URL and the WM displays a list of
all instances of the URL archived, distinguished by
crawl date. Selecting any date begins browsing a site as
it appeared then, and continued navigation pulls up
nearest-matches for linked pages. The WM suffers one
major shortcoming: unless you know beforehand the
exact URL of the page you want to browse, you will not
be able to directly access archived content. Current Web
URLs and published references to historic URLs may
suggest starting points, but offer little help for thorough
or serendipitous exploration of archived sites. URL-only
retrieval also frustrates users who are accustomed to
exhaustive Google-style full text search from a simple
query box. What is missing is a full text search tool
that works over archived content, to better guide users
'wayback' in time.
WACs tend to be large. A WAC usually is an aggregate
of multiple, related focused Web crawls run over a
distinct time period. For example, one WAC, made by
the IA comprises 140 million URLs collected over 34
weekly crawls of sites pertaining to the United States
2004 Presidential election. Another WAC is the complete
IA repository of more than 60 billion URLs. (Although
this number includes exact or near duplicates, the largest
live-Web search engine, Google, only claims to be
"[s]earching 8,058,044,651 web pages" as of this
writing.) WACs are also large because archives tend not
to truncate Web downloads and to fetch all resources
including images and streams, not just text-only
resources.
Nutch [4] was selected as the search engine platform on
which to develop Web Archive Collection (WAC)
search. "Nutch is a complete open-source Web search
engine package that aims to index the World Wide Web
as effectively as commercial search services" [5].
A single URL may appear multiple times in a WAC.
Each instance may differ radically, minimally or not at all
across crawls, but all instances are referred to using the
same URL. Multiple versions complicate search query
and result display: Do we display all versions in search
results? If not, how do we get at each instance in the
collection? Do we suppress duplicates? Or do we display
the latest with a count of known instances in a corner of
the search result summary?
A WAC search engine gets no help from the Web-at-large
serving search results. What we mean by this is that for
WAC searching, after a user clicks on a search result hit,
there is still work to be done. The search result must
51
refer the user to a viewer or replay utility – a tool like the
IA WM – that knows how to fetch the found page from
the WAC repository and display it as faithfully as
possible. (Since this redisplay is from a server other than
the page's original home, on-the-fly content rewriting is
often required.) While outside of the purview of
collection search, WAC tools that can reassemble the
pages of the past are a critical component in any WAC
search system.
3. Overview Of Nutch Operation
The Nutch search engine indexing process runs in a
stepped, batch mode. With notable exceptions discussed
later, the intent is that each step in the process can be
"segmented" and distributed across machines so no single
operation overwhelms as the collection grows. Also,
where a particular step fails (machine crash or operator
misconfiguration), that step can be restarted. A custom
database, the Nutch webdb, maintains state between
processing steps and across segments. An assortment of
parse-time, index-time, and query-time plugins allows
amendment of each processing step.
After initial setup and configuration, an operator
manually steps through the following cycle indexing:
1. Ask the Nutch webdb to generate a number of URLs t o
fetch. The generated list is written to a "segment" directory.
2. Run the built-in Nutch fetcher. During download, an md5
hash of the document content is calculated and parsers
extract searchable text. All is saved to the segment directory.
3. Update the Nutch webdb with vitals on URLs fetched. An
internal database analysis step computes all in-link anchor
text per URL. When finished, the results of the database inlink anchor text analysis are fed back to the segment. Cycle
steps 1-3 writing new segments per new URL list until
sufficient content has been obtained.
4 . Index each segment's extracted page text and in-link
anchor text. Index is written into the segment directory.
5. Optionally remove duplicate pages from the index.
6. Optionally merge all segment indices (Unless the index
is large and needs to be distributed).
Steps 2 and 4 may be distributed across multiple
machines and run in parallel if multiple segments. Steps
1, 3, and 5 require single process exclusive access to the
webdb. Steps 3 and 6 require that a single process have
exclusive access to all segment data. A step must
complete before the next can begin.
To query, start the Nutch search Web application. Run
multiple instances of the search Web application to
distribute query processing. The queried server
distributes the query by remotely invoking queries
against all query cluster participants. (Each query cluster
participant is responsible for some subset of all
segments.) Queries are run against Nutch indices and
return ranked Google-like search results that include
snippets of text from the pertinent page pulled from the
segment-extracted text.
4. Nutch Adaptation
Upon consideration, WAC search needs to support two
distinct modes of operations. First, WAC search should
function as a Google-like search engine. In this mode,
users are not interested in search results polluted by
multiple duplicate versions of a single page. Phase one
of the Nutch adaptation focused on this mode of
operation.
A second mode becomes important when users want to
study how pages change over time. Here support for
queries of the form, "return all archive versions crawled in
1999 sorted by crawl date" is needed. (Satisfying queries
of this specific type is what the IA WM does using a
sorted flat file index to map URL and date to resource
location.) Phase two added features that allow versionand date-aware querying. (All WAC plugin extensions,
documentation, and scripts are open source hosted at
Sourceforge under the Nutchwax project [8].)
4.1. Phase one
Because the WAC content already exists, previously
harvested by other means, the Nutch fetcher step had to
be recast to pull content from a WAC repository rather
than from the live Web. At IA, harvested content is
stored in the ARC file format [6]; composite log files
each with many collected URLs. For the IA, an ARC-tosegment tool was written to feed ARCs to Nutch parsers
and segment content writers. (Adaptation for formats
other than IA ARC should be trivial.) Upon completion
of phase one, using indices purged of exact duplicates, it
was possible to deploy a basic WAC search that used the
IA WM as the WAC viewer application.
4.2. Phase two
To support explicit date and date range querying using
the IA 14-digit YYYYDDMMHHSS timestamp format,
an alternate date query operator implementation replaced
the native Nutch YYYYMMDD format. To support
retrieval of WAC documents by IA WM-like viewer
applications, location information -- collection, arcname
and arcoffset -- was added to search results as well as an
operator to support exact, as opposed to fuzzy, URL
querying (exacturl). Nutch was modified to support
sorting on arbitrary fields and deduplication at query time
(sort, reverse, dedupField, hitsPerDup). Here is the
complete list of new query operators:
• sort: Field to sort results on. Default is no sort.
• reverse: Set to true to reverse sort. Default is false.
• dedupField: Field to deduplicate on. Default is 'site'.
52
• hitsPerDup: Count of dedupField matches to show i n
search results. Default 2.
• date: IA 14-digit timestamps. Ranges specified with '-'
delimiter between upper and lower bounds.
• arcname: Name of ARC file that containing result found.
• arcoffset: Offset into arcname at which result begins.
• collection: The collection the search result belongs to.
• exacturl: Query for an explicit url.
Natively Nutch passes all content for which there is no
explicit parser to the text/html parser. Indexing, logs are
filled with skip messages from the text/html parser as it
passes over audio/*, video/*, and image/* content.
Skipped resources get no mention in the index and so are
not searchable. An alternate parser-default plugin was
created to add at least base metadata of crawl date,
arcname, arcoffset, type, and URL. This allows viewer
applications, which need to render archived pages that
contain images, stylesheets, or audio to ask of the Nutch
index the location of embedded resources. Finally, an
option was added to return results as XML (RSS) [7].
Upon completion of phase two, both modes of operation
were possible using a single non-deduplicated index.
5. Indexing Stats
Discussed below are details indexing two WACs: One
small, the other medium-sized. All processing was done
on machines of the following profile: single processor
2.80GHz Pentium 4s with 1GB of RAM and 4x400GB
IDE disks running Debian GNU/Linux. Indexing, this
hardware was CPU-bound with light I/O loading. RAM
seemed sufficient (no swapping). All source ARC data
was NFS mounted. Only documents of type text/* or
application/* and HTTP status code 200 were indexed.
5.1. Small Collection
This small collection was comprised of three crawls.
Indexing steps were run in series on one machine using a
single disk. The collection comprised 206 ARC files,
37.2GB of uncompressed data. 1.07 million of the
collection total of 1.27 million documents was indexed.
Table 1: MIME Types
MIME Type
text
text/html
application
application/pdf
application/msword
Size (MB)
25767.67
22003.55
6719.92
4837.89
487.89
Table 2: Timings
Segment
Database
16h32m
2h26m
Index
18h44m
% Size
79.32%
67.73%
20.68%
14.89%
1.50%
Dedup
0h01m
Incidence
1052103
1044250
20969
16201
3306
Merge
02h35m
collection. The index plus the cleaned-up segment data -cleaning involved removal of the (recalcuable) segmentlevel indices made redundant by the index merge -occupied 1.1GB + 4.9GB, or about 16% the size of the
source collection. Uncleaned segments plus index made
up about 40% the size of the source collection.
5.2 Medium-sized Collection
The collection was made up of 1054 ARCs, 147.2GB of
uncompressed data. 4.1 million documents were
indexed. Two machines were used to do the segmenting
step. Subsequent steps were all run in series on a single
machine using a single disk.
Table 3: MIME Types
MIME Type
Text
text/html
Application
application/pdf
application/msword
Size (MB)
96882.84
90319.81
50338.40
21320.83
1000.70
Table 4: Timings
Segment
Database
12h32m
7h23m
19h18m
% Size
65.32%
60.81%
34.68%
14.40%
0.67%
Index
55h07m
Dedup
0h06m
Incidence
3974008
3929737
122174
45427
5468
Merge
0h31m
Indexing took 99 hours of processing time (or 86.4 hours
of elapsed time because segmenting was split and run
concurrently on two machines). The merged index size
was 5.2GB, about 4% the size of source collection.
Index plus the cleaned-up segment data occupied 5.2GB
+ 14.5GB, or about 13.5% the size of the source
collection. (Uncleaned segments plus index occupied
about 22% the size of the source collection.)
6. Observations
Indexing big collections is a long-running manual process
that currently requires intervention at each step moving
the process along. Attention required compounds the
more distributed the indexing is made. An early
indexing of a collection of approximately 85 million
documents took more than a week to complete with
segmenting and indexing spread across 4 machines.
Steps had to be restarted as disks overfilled and segments
had to be redistributed. Little science was applied so the
load was suboptimally distributed with synchronizations
waiting on laggard processes. (Others close in to the
Nutch project have reported similar experiences [12].) An
automated means of efficiently distributing the parsing,
update, and indexing work across a cluster needs to be
developed. In the way of any such development are at
least the following obstacles:
Indexing took 40.3 hours to complete. The merged index
size was 1.1GB, about 3% the size of the source
• Some indexing steps are currently single process.
53
• As the collection grows, with it grows the central
webdb of page and link content. Eventually it will
grow larger than any available single disk.
We estimate that with the toolset as is, given a vigilant
operator and a week of time plus 4 to 5 machines with
lots of disk, indexing WACs of about 100 million
documents is at the limit of what is currently practical.
Adding to a page its inlink anchor-text when indexing
improves search result quality.
Early indexing
experiments were made without the benefit of the Nutch
link database — our custom fetcher step failed to properly
provide link text for Nutch to exploit. Results were rich
in query terms but were not what was 'expected'. A
subsequent fix made link-text begin to count. Thereafter,
search result quality improved dramatically.
The distributed Nutch query clustering works well in our
experience, at least for low rates of access: ~1 query per
second. (Search access-rates are expected to be lower for
WACs than live-Web search engines.) But caches kept in
the search frontend to speed querying will turn
problematic with regular usage. The base Nutch (Lucene)
query implementation uses one byte per document per
field indexed. Additions made to support query-time
deduplication and sorting share a cache that stores each
search result's document URL. Such a cache of (Java)
UTF-16 Java strings gets large fast. An alternate smaller
memory-footprint implementation needs to be developed.
7. Future Work
From inception, the Nutch project has set its sights on
operating at the scale of the public web and has been
making steady progress addressing the difficult technical
issues scaling up indexing and search. The Nutch
Distributed File System (NDFS) is modeled on a subset
of the Google File System (GFS) [11] and is "...a set of
software for storing very large stream-oriented files over a
set of commodity computers. Files are replicated across
machines for safety, and load is balanced fairly across the
machine set" [12]. The intent is to use NDFS as
underpinnings for a distributed webdb. (It could also be
used storing very large segments.) While NDFS
addresses the problem of how to manage large files in a
fault-tolerant way, it does not help with the even
distribution of processing tasks across a search cluster.
To this end, the Nutch project is working on a version of
another Google innovation, MapReduce [13], "a platform
on which to build scalable computing" [9]. In synopsis,
if you can cast the task you wish to run on a cluster into
the MapReduce mold -- think of the Python map function
followed by reduce function -- then the MapReduce
platform will manage the distribution of your task across
the cluster in a fault-tolerant way. Mid 2005, core
developers of the Nutch project are writing a Java version
of the MapReduce platform to use in a reimplementation
of Nutch as MapReduce tasks [9]. MapReduce and
NDFS combined should make Nutch capable of scaling
its indexing step to billions of documents.
The IA is moving its collections to the Petabox platform;
racks of low power, high storage density, inexpensive
rack-mounted computers [14]. The future of WAC search
development will be harnessing Nutch MapReduce/NDFS
development on Petabox.
8. Acknowledgements
Doug Cutting and all members of the IA Web Team:
Michele Kimpton, Gordon Mohr, Igor Ranitovic, Brad
Tofel, Dan Avery, and Karl Thiessen. The International
Internet Preservation Consortium (IIPC) [3] supported the
development of Nutchwax.
9. References
[1] Internet Archive http://www.archive.org
[2] Wayback Machine http://www.archive.org/web/web.php
[3] International Internet Preservation Consortium
http://netpreserve.org
[4] Nutch http://lucene.apache.org/nutch/
[5] Nutch: A Flexible and Scalable Open-Source Web Search
Engine http://labs.commerce.net/wiki/images/0/06/CN-TR04-04.pdf
[6] ARC File Format
http://www.archive.org/web/researcher/ArcFileFormat.php
[7] A9 Open Search http://opensearch.a9.com/
[8] Nutchwax http:// archiveaccess.archive.org/projects/nutch
[9] MapReduce in Nutch, 20 June 2005, Yahoo!, Sunnyvale,
CA, USA http://wiki.apache.org/nutchdata/attachments/Presentations/attachments/mapred.pdf
[10] The Nutch Distributed File System by Michael Cafarella
http://wiki.apache.org/nutch/NutchDistributedFileSystem
[11] Google File System
http://www.google.com/url?sa=U&start=1&q=http://labs.
google.com/papers/gfs-sosp2003.pdf&e=747
[12] “[nutch-dev] Experience with a big index” by Michael
Cafarella http://www.mail-archive.com/[email protected]/msg02602.html
[13] MapReduce: Simplified Data Processing on Large
Clusters by Jeffrey Dean and Sanjay Ghemawat
http://labs.google.com/papers/mapreduce-osdi04.pdf
[14] Petabox http://www.archive.org/web/petabox.php
54
XGTagger, an open-source interface dealing with XML contents
Xavier Tannier, Jean-Jacques Girardot and Mihaela Mathieu
Ecole Nationale Supérieure des Mines
158, cours Fauriel
42023 Saint-Etienne FRANCE
tannier, girardot, [email protected]
Abstract
This article presents an open-source interface dealing
with XML contents and simplifying their analysis. This
tool, called XGTagger, allows to use any existing system developed for text only, for any purpose. It takes an XML
document in input and creates a new one, adding information brought by the system. We also present the concept of
“reading contexts” and show how our tool deals with them.
1. Introduction
XGTagger1 is a generic interface dealing with text contained by XML documents. It does not perform any analysis
by itself, but uses any system S that analyse textual data. It
provides S with a text only input. This input is composed of
the textual content of the document, taking reading contexts
into account.
A reading context is a part of text, syntactically and semantically self-sufficient, that a person can read in a go,
without any interruption [3]. Document-centric XML contents does not necessary reproduce reading contexts in a linear way.
Within this context, we can distinguish three kinds of
tags [1]:
• Finally hard tags are structural tags, they break the linearity of the text (chapters, paragraphs. . . ).
2. General principle
Figure 1 depicts the general functioning scheme of XGTagger. Input XML document is processed and a text is
given to the user’s system S. After execution of S, a postprocessing is performed in order to build a new XML document.
2.1. Input
As shown by figure 1, if a list of soft and jump tags is
given by the user, XGTagger recovers the reading contexts,
gathers them (separated by dots) and gives the text T to the
system S. In the following example sc (small capitals) and
bold are soft tags, since footnote is a jump tag.
(1) <article>
<title>Visit
I<sc>stanbul</sc>
M<sc>armara</sc> region</title>
<par>
This
former
capital
of
three
empires<footnote>Istanbul
has
successively been the capital of Roman, Byzantine and Ottoman empires</footnote>
is now the economic capital of
<bold>Turkey</bold>
• Soft tags identify significant parts of a text (mostly emphasis tags, like bold or italic text) but are transparent
when reading the text (they do not interrupt the reading
context);
• Jump tags are used to represent particular elements
(margin notes, glosses, etc.). They are detached from
the surrounding text and create a new reading context
inserted into the existing one.
1 http://www.emse.fr/∼tannier/en/xgtagger.html
and
</par>
</article>
Considering soft, jump and hard tags allows XGTagger
to recognize terms “Istanbul” and “Marmara”, but to distinguish “empires” and “Istanbul” (not separated by a blank
character). The text infered is:
55
take the example of POS tagging2, with TreeTagger [2]
standing for the system S, the first field of the output is the
initial text. Considering our example, words are separated:
Initial XML
Document
Special tag lists
Document parsing,
reading context
recovery
text
only
System S
(black box)
text
only
Initial document
reconstruction and
updating
User’s
parameters
Visit
VV
visit
Istanbul
NP
Istanbul
and
CC
and
Marmara NP
Marmara
Region
NN
region
.
SENT .
...
...
...
The user describes S output with parameters3, allowing
XGTagger to compose back the initial XML structure and to
represent additional information generated by S with XML
attributes. In our running example, parameters should specify that fields are separated by tabulations, that the first field
represents the initial word, the second field stands for the
part-of-speech (pos) and the third one is the lemma (lem).
XGTagger treats these parameters and S output and returns
the following final XML document:
<article>
<title>
<w id=”1” pos=”VV” lem=”visit”>Visit</w>
<w id=”2” pos=”NP” lem=”Istanbul”>I</w>
<sc>
stylesheet
<w id=”2” pos=”NP” lem=”Istanbul”>
stanbul</w>
Final XML
Document
</sc>
<w id=”3” pos=”CC” lem=”and”>and</w>
<w id=”4” pos=”NP” lem=”Marmara”>M</w>
<sc>
Figure 1. XGTagger general fonctioning
scheme.
<w id=”4” pos=”NP” lem=”Marmara”>
armara</w>
</sc>
<w id=”5” pos=”NN” lem=”region”>region</w>
</title>
<par>
<w id=”7” pos=”DT” lem=”this”>This</w>
<w id=”8” pos=”JJ” lem=”former”>former</w>
<w id=”9” pos=”NN” lem=”capital”>capital</w>
<w id=”10” pos=”IN” lem=”of”>of</w>
<w id=”11” pos=”CD” lem=”three”>three</w>
<w id=”12” pos=”NNS” lem=”empire”>empires</w>
<footnote>
Visit Istanbul and Marmara region . This
former capital of three empires is now the
economic capital of Turkey . Istanbul has
successively been the capital of Roman,
Byzantine and Ottoman empires
It is not necessary to take care of soft and jump tags if
the document or the application do not impose it. If nothing
is specified, all tags are considered as hard (in this example, “I” and “stanbul” would have been separated, as well
as “M” and “armara” and the footnote would have stayed in
the middle of the paragraph). Nevertheless, in applications
like natural language processing or indexing, this classification can be very useful.
2.2. Output
This output of the system S must contain (among any
other information) the repetition of the input text. If we
<w id=”21” pos=”NP” lem=”Istanbul”>
Istanbul</w>
<w id=”22” pos=”VHZ” lem=”have”>has</w>
<w id=”23” pos=”RB”
lem=”successively”>successively</w>
...
<w id=”32” pos=”NP” lem=”Ottoman”>
Ottoman</w>
<w id=”33” pos=”NNS” lem=”empire”>
empires</w>
2 A part-of-speech (POS), or word class, is the role played by a word in
the sentence (e.g.: noun, verb, adjective. . . ). POS tagging is the process of
marking up words in a text with their corresponding roles.
3 These parameters can be specified either through a configuration file
or Unix or DOS-like options (the program is written is Java).
56
<w id=”1” pos=”PP” t=”I”>I</w>
<w id=”2” pos=”VVD” t=”do”>did</w>
<w id=”3” pos=”PP” t=”it”>it</w>
<w id=”4” pos=”LOC” t=”in///order///to”>
in</w>
<w id=”4” pos=”LOC” t=”in///order///to”>
order</w>
<w id=”4” pos=”LOC” t=”in///order///to”>
to</w>
<w id=”5” pos=”VV” t=”clarify”>clarify
</w>
<w id=”6” pos=”NNS” t=”matter”>matters
</w>
</footnote>
<w id=”13” pos=”VBZ” lem=”be”>is</w>
<w id=”14” pos=”RB” lem=”now”>now</w>
<w id=”15” pos=”DT” lem=”the”>the</w>
<w id=”16” pos=”JJ” lem=”economi”>economic</w>
<w id=”17” pos=”NN” lem=”capital”>capital</w>
<w id=”18” pos=”IN” lem=”of”>of</w>
<bold>
<w id=”19” pos=”NP” lem=”Turkey”>
Turkey</w>
</bold>
</par>
</article>
Note that the identifier id allows to keep the reading
contexts (see ids 2 and 4, 12 and 13) without any loss of
structural information. The initial XML document can be
converted back with a simple stylesheet (except for blank
characters that S could have added).
More details about XGTagger use and functioning can
be found in [4] and in the user manual [5].
3. Examples of uses
The first example was part-of-speech tagging, but any
kind of treatments can be performed by system S.
N.B.: Recall that an important constraint of XGTagger is
that at least one field of the user system output must contain
the initial text (blank characters excepted).
3.1. POS tagging upgrading: locution handling
If the system S is able to detect locutions, XGTagger
can deal with that feature, with a special option (called
special separator). With this option the user can
specify that a sequence of characters represents a separation
between words.
• Let’s take the following XML element:
<sentence>I did it in order
matters</sentence>
to
clarify
• XGTagger will input the following text into the system:
I did it in order to clarify matters
</sentence>
Note that the three words composing the locution get the
same identifier.
3.2. Syntactic analysis
With the same special separator option, a syntactic analysis can be performed. Suppose that S groups
together noun phrases of the form “NOUN PREPOSITION
NOUN”.
• For the following XML element:
<english_sentence>He has a taste<gloss>Taste:
preference, a strong liking</gloss>
for danger</english_sentence>
• . . . XGTagger will give this text into the system
(considering that ’gloss’ is a jump tag):
He has a taste for danger . Taste:
preference, a strong liking .
• S can perform a simple syntactic analysis and return,
by example:
He has a taste_for_danger/NP .
Taste: preference, a strong liking
.
• With XGTagger options -i -w 1 -2 pos -f
“/” -d “ “ -e “_”, the final output is:
<english_sentence>
• With the special separator ’///’, S can return:
I
PP
did
VVD
it
PP
in///order///to LOC
clarify
VV
matters
NNS
<w id=”1”>He</w>
<w id=”2”>has</w>
<w id=”3”>a</w>
<w id=”4” pos=”NP”>taste</w>
<gloss>
<w id=”6”>Taste:</w>
<w id=”7”>preference,</w>
...
<w id=”10”>liking</w>
• With appropriate options, XGTagger final output is:
<sentence>
57
• S output (same as the input):
United States Elections
</gloss>
<w id=”4” pos=”NP”>for</w>
<w id=”4” pos=”NP”>danger</w>
• Possible final output:
<title>
</english_sentence>
<w id=”1” rc=”United”>U</w>
<sc>
3.3. Lexical enrichment
<w id=”1” rc=”United”>nited</w>
The user’s system can also return any information about
words. For example, a translation of each noun:
• XML Input:
<sentence>I had
brother</sentence>
a
conversation
with
</sc>
<w id=”2” rc=”States”>S</w>
<sc>
my
<w id=”2” rc=”States”>tates</w>
</sc>
<w id=”3” rc=”Elections”>E</w>
<sc>
• S output (suggestion):
I
had
a
conversation/entretien/Gespräch
with
my
brother/frère/Bruder
• Options: second field is French, third field is German;
Output:
<sentence>
<w>I</w>
<w>had</w>
<w>a</w>
<w french=”entretien”
german=”Gespräch”>conversation</w>
<w>with</w>
<w>my</w>
<w french=”frère”
german=”Bruder”>brother</w>
</sentence>
3.4. Reading Contexts finding
Finally, S can just repeat the input text (possibly with a
simple separation of punctuation). The result is that words
are enclosed between tags, reading contexts are brought together (by ids) and cut words are reassembled. This operation can be particularly interesting for traditional information retrieval; it can represent a first step before indexing XML documents4 or operating researchs taking logical
proximity [3] into account.
• XML Input:
<title>U<sc>nited</sc> S<sc>tates</sc>
E<sc>lections</sc></title>
4 An option of XGTagger adds the path of each element as one of its
attribute.
<w id=”3” rc=”Elections”>lections
</w>
</sc>
</title>
4. Conclusion
We have presented XGTagger, a simple software system
aimed at simplifying the handling of semi-structured XML
documents.
XGTagger allows any tool developed for text-only documents, either in the domain of information retrieval, natural
language processing or any document engineering field, to
be applied to XML documents.
References
[1] L. Lini, D. Lombardini, M. Paoli, D. Colazzo, and C. Sartiani.
XTReSy: A Text Retrieval System for XML documents. In
D. Buzzetti, H. Short, and G. Pancalddella, editors, Augmenting Comprehension: Digital Tools for the History of Ideas.
Office for Humanities Communication Publications, King’s
College, London, 2001.
[2] H. Schmid. Probabilistic Part-of-Speech Tagging Using Decision Trees. In International Conference on New Methods in
Language Processing, Sept. 1994.
[3] X. Tannier. Dealing with XML structure through "Reading
Contexts". Technical Report 2005-400-007, Ecole Nationale
Supérieure des Mines de Saint-Etienne, Apr. 2005.
[4] X. Tannier. XGTagger, a generic interface for analysing XML
content. Technical Report 2005-400-008, Ecole Nationale
Supérieure des Mines de Saint-Etienne, July 2005.
[5] X.
Tannier.
XGTagger
User
Manual.
http://www.emse.fr/~tannier/XGTagger/Manual/,
June
2005.
58
The lifespan, accessibility and archiving of dynamic documents
Katarzyna Wegrzyn-Wolska
ESIGETEL, Ecole Supérieure d’Ingénieurs en Informatique et Génie des Télécommunications,
77215 Avon-Fontainebleau, France
[email protected]
Abstract
Today most Web documents are created dynamically.
These documents don’t exist in reality; they are created automatically and they disappear after consultation. This paper surveys the problems related to the lifespan, accessibility and archiving of these pages. It introduces definitions of
the different categories of dynamics documents. It describes
also the results of our statistical experiments performed to
evaluate their lifespan.
1 Introduction
Is a dynamic document a real document, or is it only the
temporary presentation of data? Is it any document created
automatically or is it the document created as a response
to the user’s action? The term "dynamic" can be used for
the different signification; for the HTML Web page with
some dynamic parts like a layers, scripts, etc., but this term
is more often used for the pages created on-line by the Web
server. This paper deal with the problems related to the documents created on-line.
2 The Lifespan and Age of Dynamic Documents
How can the lifespan of dynamic documents be evaluated? These documents disappear immediately from the
computers’ memory after theirs consultation. In this paper
we define the lifespan of dynamic documents as the period
where the given demand results in the same given response.
This period is the time observed by the user as a documents
lifespan. User when surfing the Web in his browser doesn’t
know how the document was created so he doesn’t distinguish the difference between the static and the dynamic document.
How to determine age of the dynamic documents? Can
we consider the value of the http header Modified and Expired or the value fixed in the HTML file with the META tag
Expires to indicate exactly when the document was changed
or when it can be considered to have expired?
3 Dynamic Documents Categories
We distinguish two kinds of dynamic documents: documents created and modified automatically (news sites, chat
sites, weblogs, etc.) and documents created as an answer
to the user requests (the results pages given by the Search
Engines, the responses obtained by filling in the data form,
etc.).
We will analyse these documents separately in two categories. The first category is represented by the responsepages obtained from the Search Engines. The second category contains the pages from the different news sites and
the Weblogs sites.
3.1 News Published on the Web
There are numerous web sites which publish the news.
The news sites publish different kinds of information in different presentation forms [3]. News is a very dynamic kind
of information, constantly updated. The news sites have
to be interrogated frequently so as not to miss any of the
news information. On the other hand, it is often possible to
reach the old articles from the archival files available on
their sites. The archival life is varied on the deferments
sites. The updating frequency and the archival life for some
news sites is presented in Table 1. This information, which
we evaluated was confirmed by the sites administrators.
3.2 The Weblog Sites
A weblog, web log or simply a blog, is a web application,
which contains periodic posts on a common webpage. It is
a kind of online journal or diary frequently updated [1, 2].
59
Table 1. Updating news frequency and
archival life.
Service news
Update
Archiving
French Google
about 20 min
30 days
Google
about 20 min
30 days
Voila actuality
every day
1 week
Voila news info
instantaneously
1 week
Yahoo!News
instantaneously
1 week
TF1 news
instantaneously
News now
5 min
CategoryNet
every day
CNN
instantaneously
Company news
about 40 per day
never ending
2003, 2004, 2005 archived
Figure 1. Visit Frequency of indexing robots.
Table 2. Index-database updating frequency.
Search Engine
Updating frequency
Google
4 weeks
some pages are updating quasi daily
Yahoo!
3 weeks
All the Web
vary frequently
publishes la date of the robots visit
grows very fast. An unfortunate side effect of this continual growth and dynamical modification is that it is impossible to save the totality of Web images. We have compared
the data from the GoogleNews and BBC archives presented
by the Wayback Machin with our statistical data (Table2,
Table1). This comparison shows clearly that this archive is
incomplete.
since 2004 index-data base together with Yahoo!
AltaVista
since 2004 index-data base together with Yahoo!
3.3 Search Engines
5 Statistical Evaluation
We have carried out the following statistical evaluation:
index-databases updating frequency (for Search Engines
and Meta Search Engines) and different statistical tests of
the News sites and the Weblogs.
The Search Engines’ response pages are the dynamic
pages created on-line. The lifespan of the same response
page (period when the Search Engine answer doesn’t
change) depends on the data retrieved from the Search Engine’ index-database. It is evident that this time is correlated
with the updating frequency of index-database.
Table 2 shows the examples of values of the Search Engines’ index-database updating frequency.
To estimate the updating frequency of index-databases
we have analysed the differents logs files and we have calculated frequency of access to Search Engines carried out
by different indexing robots [5, 6, 7]. Figure 1 shows the
example of logs’ data concerning the robots visits.
4 Archiving
5.2 News sites and Web logs
Dynamic documents can be printed, saved by the user
or put to special caching and archiving systems. There are
many Web applications, which store the current web image the Web (example Wayback Machine developed by The
Internet Archive1 ). These applications try to retrieve and
save all of the visible Web [4]. It is evident that this task is
very difficult. The WWW is enormous and it changes and
1 http://www.archive.org/index.php
5.1 Search Engines
We have carried out some statistical tests to evaluate
the updating frequency (lifespan) [7] of News. The results
showed the different behavior of interrogated sites (Figure
6a, Figure 6b, Tableau 3). We have analyzed four categories
of sites:
- Sportstrategies the sport news service,
- News on the site of French television TF1,
- News from BBC site,
- Weblog site (Slashdot.org).
60
Figure 2. Sportstrategies: News lifespan.
a) News 24hours/24
Figure 3. BBC: lifespan of the news.
b) News at working hours
Sportstrategies is an example of the very regular News
site,with a constant update time (every hour : Figure 2).
Figure 4. TF1 News: lifespan of the news.
BBC News is diffused online, the lifespan is very irregular because the information is updated instantaneously
when present (Figure 3).
TFI News is updated frequently during the day. On the
other hand there are no modifications by night. The lifespan
of the News pages is very different in these two cases. We
have presented it in two separated graphs. (Figure 4a et Figure 4b). Two high peaks in the extremes of the graph in Figure 4a correspond to the long period without any changes
during the night.
Slashdot.org Weblog site, represents the last category of
site. This collective weblog is one of the more popular blogs
oriented on the Open Source. The data changes here very
quickly, the new articles are diffused very often and the actual discussions continue without any break. The lifespan of
these dynamically changed pages is extremely short (Figure
5); the mean lifespan is equal to 77 sec. (Tableau 3).
Updating frequency values.
In the next graphs (Figure 6) and Table3 comparatives of updating frequency values for some tested sites are presented. We have found the
maximal and minimal values of the updating frequency and
calculated the mean. The results confirm that the content of
the news sites changes very often.
6 Conclusion
Dynamic documents don’t exist in reality, they disappear from the computer memory directly after consultation.
Their real lifespan is very short. The sites can be classified
into different categories depending on the news-updating
period; very regular -with a constant update time, irregular - information updated when present. The sites can be
also classified into two categories depending on the refresh
time; slow -with a refresh time greater then 10 minutes, fast
61
Table 3. Updating frequency
tested site
Figure 5. Slashdot lifespan of articles.
lifespan
mean
min.
max.
Slashdot.org
77 sec
10 sec
22 min
BBC News
8,5 min
1 min
66 min
TF1 news (24/24)
19,5 min
1 min
502 min
TF1 News (working hours)
6,3 min
1 min
49 min
Sportsynergies
56 min
9 min
61 min
-information refresh even about 10 seconds. Some news
sites present periodic activity: ex. the news site of the
French television channel TF1 is updated only during working hours.
On the other hand, dynamic documents can be stored by
special archiving systems and in fact, users can access them
for a long time. Management of the archived dynamic document’s lifespan is identical to that of static documents, because the dynamic documents are stored in the same way as
static ones.
References
a) TF1: 24/24
[1] R. Blood. The Weblog Handbook: Pratical Advice on Creating and Mintaining your Blog. 2002.
[2] S. Booth. C’est quoi un Weblog. 2002.
[3] A. Christophe. Chercher dans l’actualite recente ou les
archives d’actualites francaises et internationale, on-line
http://c.asselin.free.fr, 2004.
[4] S. Lawrence. Online or invisible ? Nature, 411(687):521, Jan
2001.
[5] K. Wegrzyn-Wolska. Etude et realisation d’un meta-indexeur
pour la recherche sur le Web de documents produits par
l’administration francaise. PhD thesis, Ecoles Superieures
de Mines de Paris, DEC 2001.
[6] K. Wegrzyn-Wolska. Fim-metaindexer: a meta-search engine
purpose-bilt for the french civil service and the statistical classification and evaluation of the interrogated search engines using fim-metaindexer. In G. J.T.Yao, V.V.Raghvan, editor, The
Second International Workshop on Web-based Support Systems, In Conjunction with IEEE WIC ACM WIIAT’04, pages
163–170. Sainr Mary’s University, Halifax, Canada, 2004.
[7] K. Wegrzyn-Wolska. Le document numerique: une etoile filante dans l’espace documentaire. Colloque EBSI-ENSSIB;
Montreal 2004, 2004.
b) TF1: working hours
Figure 6. Updating frequency.
62
SYRANNOT: Information retrieval assistance system on
the Web by semantic annotations re-use
Wiem YAICHE ELLEUCH1, Lobna JERIBI2, Abdelmajid BEN HAMADOU3,
1,3
LARIM, ISIMS, SFAX, TUNISIE
2
RIADI GDL, ENSI, MANOUBA, TUNISIE
1
[email protected]
2
[email protected]
3
[email protected]
Abstract:
In this paper, SYRANNOT system implemented in
java is presented. Relevant retrieved documents are
given to the current user for his query and adapted to
his profile. SYRANNOT is based on the mechanism
of Case Based Reasoning (CBR). It memorizes the
research sessions (user profile, query, annotation,
session date) carried out by users, and re-use them
when a similar research session arises. The first
experimental evaluation carried out on SYRANNOT
has shown very encouraging results.
1. Introduction
The Case Based Reasoning is a problem resolution
approach based on the re-use by analogy of previous
experiments called cases [AAM 94][KOL 93][SCH
89]. Some works of research assistance systems based
on CBR were carried out: RADIX [COR 98], CABRI
[SMA 99], COSYDOR [JER 01]. Our approach
consists in applying CBR on the semantic annotations
coming out of the semantic Web domain. The CBR
has various advantages (information transfer between
situations, evolutionary systems, etc). Nevertheless, its
integration presents some difficulties such as the
representation, memorizing, re-use and adaptation of
the cases. These four key words constitute the CBR
cycle and are the subject of our study. In the
following, SYRANNOT system architecture is
presented. It integrates the CBR on the semantic
annotations. Special attention is given to knowledge
modelling of the reasoning, as well as the search
algorithms and the similarities calculation functions,
in each stage of the cycle of the CBR.
2. SYRANNOT Architecture
We propose two scenarios of SYRANNOT use: the
first is related to memorizing session research (cases)
carried out by the user in RDF data base. Research
sessions are RDF statements based on ontologies
models in OWL language. The second concerns re-use
cases by applying research algorithms and similarity
functions to collect the most similar cases to the
current one, and to exploit them in order to present to
the current user relevant retrieved documents for its
query and adapted to its profile. In the following, both
scenarios processes are detailed.
2.1 Cases memorizing scenario
A user having a given profile, memorized in the user
profiles RDF data base in the form of RDF statements
based on user ontology, expresses his need of
information by formulating a query which he submits
to the search engine. It collects and presents to the
user the retrieved documents. When the current user
finds a document which he considers relevant to his
query, he annotates it. The annotation created is
memorized in the RDF data base of the annotations in
the form of RDF statements based on ontology
annotation.
The research session (user profile
identifier, submitted query, annotation identifier,
session date) is memorized in the RDF data base of
cases in the form of RDF statements based on the
cases ontology. Figure 1 presents the scenario
proposed to memorize cases.
Search
engine
Answers
documents
Cases base
Relevant
document
query
annotation
annotations
Base
user
User profiles
base
Figure 1 : Scenario of memorizing case
The case memorizing scenario is illustrated by the
interfaces figures 2, 4, 5 and 6. Figure 2 shows the
user new inscription interface. The user having a
single identifier (PID) assigned by the system fills in
his name (yaiche), his first name (wiem), his login
(wiem), his password (****) and a set of interests
which he selects from the ontology domain (Case
based reasoning, annotation). The domain ontology is
organised in a concepts tree. The user profile created
is memorized in RDF data base of user profiles in the
form of RDF statements, based on the user ontology
63
modelled in OWL (figure 3). The RDF data base of
user profiles will be re-used later.
preceded by an icon. When the user finds a document
which he considers relevant to his query, he annotates
it by clicking on the icon.
Figure 2: User new inscription interface
Figure 5: Google retrieved documents for a query
name
First name
PID
possède
login
password
Domain ontology
interests
Figure 3 : User ontology diagram
Figure 6 shows the interface which permits to annotate
a document considered by the user to be relevant to
his query. The annotation consists on the one hand in
determining the standardized properties of Dublin
Core such as URL (http://www.scientificamerican....),
the title (the semantic Web), the author (Tim BernersLee, James Hendler, Ora Lassila), the date (May
2001) and the language of the document (English),
and on the other hand to select a set of concepts from
the ontology domain in order to describe the document
according to the user point of view (semantic Web
definition, ontology definition, annotation definition).
Figure 4 shows the SYRANNOT home page. The
enrichment of the cases data base consists in
submitting queries (semantic Web) on the google
search engine, collecting retrieved documents, and
annotating the relevant documents for the query.
Figure 6: Annotation creation interface
Figure 4: Scenario choice interface
Figure 5 shows the answers collected by google for
the submitted query. It is a list of URL, each one is
The annotation created has a single identifier (AID)
assigned by the system, and is memorized in the RDF
data base of annotations, in the form of RDF
statements based on ontology annotation (figure 7).
64
The RDF data base of annotations will be re-used
later.
Filtered
Annotations
Cases base
Annotations containing
the query concepts
URL
annotations
Base
title
Document
concern
Users
profiles base
author
Reclassified
Annotations
Domain ontology
AID
langage
Relevant answers
documents
contain
query
date
annotation
Terms
IHM
Domain ontology
user
Figure 9: Cases re-use scenario
Figure 7: Annotation ontology diagram
The research session (the user identifier, the submitted
query, the annotation identifier, session date) is
memorized in the RDF data base of cases, in the form
of RDF statements based on the ontology cases (figure
8).
user
date
submit
query
annotation
creator
In figure 4, the link recherche sur SYRANNOT permits
the current user to have retrieved documents from
previous similar experiments (similar profiles, similar
queries).
Figure 10 presents the query formulation interface
which allows the user to interrogate the memorized
cases via SYRANNOT. The user expresses his need
of information by selecting concepts from the domain
ontology (semantic Web definition). He can also
make an advanced research on the author, the date or
the language of the document.
Figure 8: Case ontology diagram
The scenario presented above corresponds to the
stages of representation and memorizing of the CBR
cycle.
2.2 Cases re-use Scenario
The current user, having a given profile memorized in
the RDF data base of user profiles, formulates his
query by selecting one or more concepts from the
domain ontology. The system scans the RDF data
base of annotations and collects those having at least
one concept of the query in the annotations terms
field.
The system filters these annotations by
calculating the similarity between the current query
and the annotation terms of each annotation in order to
retain the 20 most relevant annotations. Then, the
system reclassifies them by calculating the similarity
between the current user profile and the profile of the
user who has created the annotation. Finally, the
system extracts and presents to the current user some
information about the relevant documents (URL,
author, date, etc). Figure 9 illustrates the cases re-use
scenario.
Figure 10 : Query formulation interface
SYRANNOT scans the RDF data base of annotations
and collects all the annotations containing at least one
element of the query in the annotations terms field.
SYRANNOT then filters these annotations in order to
retain the 20 most relevant annotations by using API
JENA [JENA] developed by the HP company (the
objective of JENA is to develop applications for the
semantic Web) and by calculating the similarity
between the concepts of the current query and the
65
concepts of the field terms of annotations
corresponding to each annotation.
JENA is used to carry out inferences on the ontologies
and on the RDF data bases.
The similarity calculation of two sets of concepts is
carried out by using the Wu Palmer formula:
⎞
1⎛ 1
1
Sim( A, B) = ⎜⎜
max(ConSim( Ai, Bi))⎟⎟
∑ max(ConSim( Ai, Bi)) + | B | Bi∑
Ai∈P1
2 ⎝ | A | Ai∈P1 Bi∈P2
∈P2
⎠
With
A: set of concepts {Ai}, |A| cardinal of A
B: set of concepts {Bi}, |B| cardinal of B
ConSim(C1, C2): similarity calculation function
between two concepts C1 and C2, in a concepts tree.
3. SYRANNOT tests and evaluations
To evaluate the contribution of SYRANNOT, we have
initialized the data bases of cases, profiles, and
annotations by simulating research sessions. Thus, we
have built a corpus including a hundred PDF scientific
documents annotated using the domain ontology.
First evaluations showed that the fact that the concepts
used for the annotations are elements of the current
query permits to SYRANNOT to present a significant
assistance to the current user. Our current research
tasks focus on the comparison of the performances of
SYRANNOT to other existing systems based on
annotations.
4. Conclusion
ConSim(C1, C2) = 2 * depth (C)/(depthc (C1) + depthc (C2))
With
C is the smallest generalizing of C1 and C2 in arcs
number, depth (C) is the number of arcs which
separates C from the root.
The system then reclassifies the 20 relevant
annotations by calculating the similarity between the
current user profile and the profile of the user who has
created the annotation by using JENA and the Wu
Palmer formula. The system extracts from each
annotation and presents to the user the URL, the title,
the author, the language of the document, as well as
the query submitted to google for a possible
reformulation of the user query (figure 11).
Figure 11: SYRANNOT Results
The scenario presented above corresponds to the
stages of re-use and adaptation of the CBR cycle.
In this paper, we presented the SYRANNOT system
architecture which assists a user in the information
retrieval session by presenting relevant retrieved
documents for his query and adapted to his profile.
SYRANNOT integrates the CBR mechanism in the
semantic annotations coming out of the semantic Web
field. Ontological models were presented, as well as
the research algorithms and the similarity calculation
functions proposed in each stage of the CBR cycle.
Experimental evaluations have shown very
encouraging results in particular when the data base of
cases is important and diversified.
References
[AAM 94] AAMODT, A., PLAZA, E. Case-Based
Reasoning : Foundational Issues, Methodological
Variations and System Approaches. March 1994, AI
Communications, the Europeen journal on AI, 1994,
Vol 7, N°1, p. 39-59.
[COR 98] CORVAISIER, F., MILLE, A., PINON,
J.M. Radix 2, assistance à la recherche d'information
documentaire sur le web. In IC'98, Ingénierie des
Connaissances, Pont-à-Mousson, France, INRIALORIA, Nancy, 1998, p. 153-163.
[JENA] jena.sourceforge.net/
[KOL 93] KOLODNER, J. Case based reasoning. San
Mateo, CA: Morgan Kaufman, 1993.
[JER 01] JÉRIBI, L. Improving Information Retrieval
Performance by Experience Reuse. Fifth International
ICCC/IFIP conference on Electronic Publishing: '2001
in the Digital Publishing Odyssey' ELPUB2001.
Canterbury, United Kingdom, 5-7 July 2001, p.78-92.
[SCH 89] SCHANK, R. C., RIESBECK, C. K. Inside
Case Based Reasoning. Hillsdale, New Jersey, Usa :
Lawrence Erlbaum Associates Publishers, 1989, 423
p.
[SMA 99] SMAÏL, M. Recherche de régularités dans
une mémoire de sessions de recherche d’information
documentaire", InforSID’99, actes des conférences,
XVIIème congrès, La Garde, Toulon, 2-4 juin 1999, p.
289-304.
66
Search in Peer-to-Peer File-Sharing System:
Like Metasearch Engines, But Not Really
Wai Gen Yee, Dongmei Jia, Linh Thai Nguyen
Information Retrieval Lab
Illinois Institute of Technology
Chicago, IL 60616
Email: {yee, jiadong, nguylin}@iit.edu
Abstract
Peer-to-peer information systems have gained prominence of late with applications such as file sharing systems
and grid computing. However, the information retrieval
component of these systems is still limited because traditional techniques for searching and ranking are not directly
applicable. This work compares search in peer-to-peer information systems to that in metasearch engines, and describes how they are unique. Many works describing advances in peer-to-peer information retrieval are cited.
1 Introduction
File-sharing is a major use of peer-to-peer (P2P) technology. CacheLogic estimates that one-third of all Internet bandwidth is consumed by file-sharing applications [2].
Although much of this bandwidth consumption can be attributed to the large sizes of the shared files (i.e., media
files), the fact that the usage has been consistent use suggests its popularity. Increasing system size increases the
importance of search technology that helps rank query results.
The task of effective ranking is exactly the goal of information retrieval (IR). However, traditional IR ranking does
not function effectively in a P2P environment. Some work
in the area of Web IR, however, is similar: the metasearch
engine whose goal is to dispatch queries to other search engines and then rank their results. The goal of this paper is to
explain the shortcomings of traditional IR in the P2P environment as well as the similarities and differences between
metasearch engines and search in the P2P environment.
The impact of improved ranking should be significant
in terms of resource usage as well. The popular Gnutellabased P2P file sharing systems basically flood the network
with queries, which is bandwidth intensive. By improving
ranking effectiveness, fewer queries need to be issued to
find a particular data object. Furthermore, effective ranking
reduces the likelihood that a user will accidentally find and
download interesting, but unrelated data.
2 Peer-to-Peer File Sharing Model
Our model is based on that which exists in common P2P
file sharing systems, such as Gnutella and Kazaa [8]. Peers
of a P2P system collectively share a set of data objects
by maintaining local replicas of them. Each replica1 (of a
data object) is a file (e.g., a music file), which is identified
by a descriptor. A descriptor is a metadata set, which is
composed of terms. Depending on the implementation, a
term may be a single word or a phrase. (A metadata set is
technically a bag of terms, because each term may occur
multiple times.)
A peer acts as a client by initiating a query for a particular data object (as opposed to any one of a category of data
objects). A query is also a metadata set, composed of terms
that a user thinks best describe the desired data object. A
query is routed to all reachable peers, which act as servers.
Query results are references to data objects that fulfill the
matching criterion
DO ⊇ Q, where Q 6= ∅,
(1)
where DO is the descriptor of data object O, and Q is the
query. In other words, by design, the data object’s descriptor must contain all the query terms [14].
A query result contains the data object’s descriptor as
well as the identity of the source server. The descriptor
helps the user distinguish the relevance of the data object
to the query, and the server identity is required to initiate
the data object’s download.
1 We
67
use the term replica and data object interchangeably.
Once the user selects a result (for download), a local
replica of the corresponding data object is made. In addition, the user has the option of manipulating the replica’s
descriptor. He may manipulate it for personal identification
or to better share it in the P2P system.
The set of peers in a P2P file sharing system is connected
in a general graph topology. Generally, peers join the system at arbitrary points, creating a random graph, although
other topologies are possible and may yield performance
benefits [13, 15, 18, 20].
Note that one major variation to the model is in what data
are shared. We assume that data are binary objects, and,
to be effectively shared, need to be identified via metadata
in descriptors. Shared data, however, may be text. In this
case, the data are self-describing, containing text that can
be searched directly. This distinction may be important because the ranking scores of self-describing data objects are
consistent, not being dependent on user-tunable descriptors.
Furthermore, self-describing data objects are easier to rank
because the information contained in descriptors tends to be
more sparse and less consistent.
Another variation is in the way the network and data are
organized. We assume a random graph, but more structure
may be introduced, such as a ring or mesh, as suggested
above. Furthermore, in such systems, data are often restricted in where they can be placed. For example, in the
DHT described in [20], a ring network topology is enforced,
and a data object is placed on a node with a node identifier
that most closely matches the data object’s object identifier.
Consequently, replication of data is not allowed.
dynamism of the P2P environment.
3.1 Source Selection
Source selection in metasearch engines is done by maintaining statistics on the contents of each search engine. This
is often done by sampling [5]. Terms are extracted from a
pre-defined corpus, and the contents of a search engine are
deduced based on the results.
Source selection in P2P file sharing systems is related to
the task of query routing because of the topology of the network. Because of the size of the network, two peers may be
connected, but only through intermediate peers. The most
general form of source selection, used by Gnutella [8], is
through flooding, where queries are routed to all neighbors
in a breadth-first fashion, until a certain query time-to-live
has expired.
Alternatives to flooding include the use previous query
responses and the publication of content signatures to intelligently route queries. A peer may learn how responsive
its neighbors are based on its responses to past queries and
may route future queries accordingly [6, 16, 19]. Another
way a peer can control routing is by looking at signatures
that servers generate to describe their shared content [4, 9].
Queries are routed to servers whose signatures are the best
matches.
Finally, many P2P routing algorithms are based on distributed hash tables [13, 15, 20], which efficiently route single keys to nodes with the closest-matching node identifiers.
This problem searching for data objects described with multiple terms has been addressed in various ways, such as by
generating a query for each term in the query or by using
unhashed queries and unhashed signatures to describe content [3, 17, 21].
The sampling technique used by metasearch engines
does not in general work for P2P file sharing systems because of the dynamism and topology of the latter. Because
all peers are autonomous, they can leave the network at any
time. This can render any collected statistics obsolete.
3 Similarity to Metasearch Engines
Metasearch engines’ main selling points are their ability
to search a larger data repository and return results that are
ranked better. These features stem from the fact that different data sources (other search engines) may index different
data repositories, and, if their data repositories overlap, they
can improve overall ranking by corroborating or contradicting each others’ rankings.
The main tasks carried out by a metasearch engine include source selection, query dispatching, result selection,
and result merging [11]. Source selection is the process of
selecting the search engines to query. Query dispatching
is the process of translating a query to the search engine’s
local format, preserving the semantics of the final results.
Result selection is the selection of the results returned by a
search engine for consideration in the final results. Result
merging is the ranking of the selected documents.
These tasks have analogs in P2P file sharing systems because they and metasearch engines both work in an environment where there are many independent, heterogeneous
data sources. The difference, as we shall see below, is in the
3.2 Query Dispatching
Query dispatching in metasearch engines has received
little attention because it is considered straightforward. For
example, to express term weights in a query, certain terms
may have to be repeated in the translated query. A certain
number of results may be desired from each search engine
so that the total number of results returned to the metasearch
engine is fixed; this can be adjusted as well.
Little attention has been paid in the literature to query
dispatching in P2P file sharing systems as well. In general,
all peers that have been “selected” are given the same query
and are assumed to use the same ranking function. This is
2
68
generally the case in practical P2P file sharing systems. Furthermore, their results are not ranked–results basically have
to conform to the matching criterion, described in Section
2. It therefore makes little sense to modify a query.
Attempts to improve performance by query transformation have been limited. One attempt is to use query expansion by creating graphs that connect related terms as synonyms [12]. This term graph is generated dynamically using the data stored locally on each peer. We are currently
also using a process we call query masking to grow and
shrink queries at the client or server to tune the results that
eventually reach the client [24]. The idea behind query
masking is to control the recall and precision of query results by selecting a subset of the terms in the query.
Applying traditional query transformation techniques in
a P2P file sharing environment is also made difficult by
the scale of the system. To effectively transform a query,
the client must maintain statistical information about each
server. The fact that the number of potential servers is in the
millions obviates the use of traditional methods.
3.3 Result Selection
Result selection in metasearch engines is performed using knowledge of each search engine’s relevance to a particular query. In general, the more relevant search engine is
asked to return more results, so that the metasearch engine
can tune the final number of results to return to the user.
Result selection requires that the search engine rank results so that the top few can be returned. Ranking is generally not supported by servers in P2P file sharing systems.
Recent research efforts, however, have incorporated ranking into the servers. In [9], for example, specialized nodes
(known as ultrapeers) in the P2P system function as servers
for particular content, and all peers that have such content
are directly connected with it. All queries are routed to the
relevant ultrapeer which, knowing the contents and ranking
function of each of its attached peers (K-L divergence, in
this case), can perform effective document selection. In [4],
the client locally ranks results from servers keeping the top
ones for the final result.
In general, however, metasearch engine result selection
is inapplicable in P2P file sharing environments because it is
difficult to control the servers to which a query is sent, not to
mention to maintain knowledge of every potential server’s
contents and ranking functions. Furthermore, it is difficult
to maintain a P2P network that effectively clusters peers
based on their shared content. This complicates the implementation of a hybrid architecture containing ultrapeers.
3.4 Result Merging
Result merging in metasearch engines generally employs
ranking scores returned in the result set of each of the selected search engines. Each of these scores is normalized using knowledge of relevance of each search engine
to the query and then a final result set is created containing the results with the highest normalized scores. Alternatively, if results refer to text documents, all top-ranked
documents from each search engine can be downloaded by
the metasearch engine to perform local ranking. Duplicate
results can be handled by maintaining only the maximum,
the sum, or the average rank score.
Result merging in P2P file sharing systems poses two
fundamental problems. First, it assumes that the client has
knowledge of the ranking process of the servers, which is
unlikely, considering the heterogeneity and dynamism of
the system. Second, it assumes that ranking can be done
at all by any peer–not a certainty, considering the lack of
global statistics.
Traditional IR techniques have been adapted to result
merging in P2P environments with reasonable levels of effectiveness [4, 7, 10]. In [4], servers are ranked and then sequentially searched until a server’s result set does not affect
the current top N results. In [10], semi-supervised learning and Kirsch’s algorithm are used for result merging in
ultrapeers. A novel result in [22], however, is that group
size–the number of results that refer to the same data object
for a given query–is a better ranking metric than tf-idf. Furthermore, [23] shows that different ranking functions can be
effectively used to find data of varying popularity in a P2P
file sharing system.
4 Conclusion
Information retrieval in P2P file sharing systems is complicated due to the fact that global statistics are hard to collect due to the dynamism of the system in terms of the
shared content, the availability of peers, and the topology
of the network. One fundamental question therefore is to
see whether IR can be performed at all in P2P systems.
The works cited in this paper present solutions that put
various levels of constraints on the system (e.g., from a random to a fixed network topology). These constraints affect the applicability of a P2P system (e.g., a fixed topology
would be appropriate for a grid system, but inappropriate
for today’s file sharing systems). As systems become more
constrained, it seems, traditional IR becomes more applicable, because more global statistics can be harvested. In
unconstrained environments, the work becomes more challenging, as fewer assumptions can be made and less information is available; pre-existing IR techniques lose relevance. In effect, we propose that work be done carefully
69
considering the parameters of the P2P system, including the
network topology, the autonomy of the peers, type of data
shared, and the distribution of data. Contributions can still
be made in constrained environments, but fundamental advances, such as link analysis in Web IR [1], can only be
made in unconstrained ones.
References
[12] K. Nakauchi, Y. Ishikawa, H. Morikawa, and
T. Aoyama. Peer-to-peer keyword search using keyword relationship. In Proc. Wkshp. Global and Peerto-Peer Comp. Large Scale Dist. Sys (GP2PC), pages
359–366, Tokyo, Japan, 2003.
[13] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and
S. Shenker. A scalable content-addressable network.
In Proc. ACM SIGCOMM, 2001.
[1] S. Brin and L. Page. The anatomy of a large scale
hypertextual web search engine. In Proc. World Wide
Web Conf., 1998.
[14] C. Rohrs.
Keyword matching [in gnutella].
Technical report,
LimeWire,
Dec. 2000.
www.limewire.org/techdocs/KeywordMatching.htm.
[2] CacheLogic. Cachelogic home page. Web Document.
www.cachelogic.com.
[15] A. Rowstron and P. Druschel. Storage management
and caching in past, a large-scale, persistent, peer-topeer storage utility. In Proc. SOSP, 2001.
[3] A. Crainiceanu, P. Linga, J. Gehrke, and J. Shanmugasundaram. Querying peer-to-peer networks using
p-trees. In Proc. Wkshp. Web and Database, Paris,
France, 2004.
[16] Y. Shao and R. Wang. Buddynet:history-based p2p
search. y. shao, r. wang. in ecir-05. In Proc. Euro.
Conf. on Inf. Ret., 2005.
[4] F. M. Cuenca-Acuna and T. D. Nguyen. Text-based
content search and retrieval in ad hoc p2p communities. In Proc. Intl. Wkshp Peer-to-Peer Comp, May
2002.
[17] S. Shi, G. Yang, D. Wang, J. Yu, S. Qu, and M. Chen.
Making peer-to-peer keyword searching feasible using
multi-level partitioning. In Intl. Wkshp. on P2P Sys.
(IPTPS), 2004.
[5] P. G. Ipeirotis and L. Gravano. Distributed search over
the hidden web: Hierarchical database sampling and
selection. In Proc. VLDB, pages 394–405, 2002.
[18] A. Singla and C. Rohrs.
Ultrapeers: Another step towards gnutella scalability.
Technical report, Limewire, LLC, 2002.
rfcgnutella.sourceforge.net/src/Ultrapeers 1.0.html.
[6] V. Kalogeraki, D. Gunopulos, and D. ZeinalipourYazti. A local search mechanism for peer-to-peer
networks. In Proc. ACM Conf. on Information and
Knowledge Mgt. (CIKM), 2002.
[19] K. Sripanidkulchai, B. Maggs, and H. Zhang. Efficient
content location using interest-based locality in peerto-peer systems. In Proc. IEEE INFOCOM, 2003.
[7] I. A. Klamponos, J. J. Barnes, and J. M. Jose. Evaluating peer-to-peer networking for information retrieval
within the context of meta-searching. In Proc. Euro.
Conf. on Inf. Ret., pages 528–536, 2003.
[8] T. Klingberg and R. Manfredi.
Gnutella
protocol 0.6.
Web Document, 2002.
rfcgnutella.sourceforge.net/src/rfc-0 6-draft.html.
[9] J. Lu and J. Callan. Content-based retrieval in hybrid
peer-to-peer networks. In Proc. ACM Conf. on Information and Knowledge Mgt. (CIKM), pages 199–206,
Nov. 2003.
[10] J. Lu and J. Callan. Federated search of text-based
digital libraries in hierarchical peer-to-peer networks.
In Proc. Euro. Conf. on Inf. Ret., 2005.
[11] W. Meng, C. Yu, and K.-L. Liu. Building efficient and
effective metasearch engines. ACM Comp. Surveys,
34(1):48–84, Mar. 2002.
[20] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and
H. Balakrishnan. Chord: A scalable peer-to-peer
lookup service for internet applications. In Proc. ACM
SIGCOMM, 2001.
[21] C. Tang, Z. Xu, and S. Dwarkadas. Peer-to-peer information retrieval using self-organizing semantic overlay networks. In Proc. ACM SIGCOMM, Aug. 2003.
[22] W. G. Yee and O. Frieder. On search in peer-to-peer
file sharing systems. In Proc. ACM SAC, Santa Fe,
NM, Mar. 2005.
[23] W. G. Yee, D. Jia, and O. Frieder. Finding rare data
objects in p2p file-sharing systems. In Proc. IEEE P2P
Conf., Constance, Germany, Sept. 2005.
[24] W. G. Yee, L. T. Nguyen, and O. Frieder. Improving search performance in p2p file sharing systems by
query masking. In Under Review, June 2005.
70