Recommending Articles for an Online Newspaper - ILK

Transcription

Recommending Articles for an Online Newspaper - ILK

JUNE
09
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
HAIT
Master
Thesis
In
partial
fulfilment
of
the
requirements
for
the
Degree
of
Masters
of
Arts.
J.M.P.
(Joost)
Kneepkens
BICT
June
2009
HAIT
Master
Thesis
Series
no.09‐002
Supervisor:
Drs.
A.M.
(Toine)
Bogers
ILK
Research
Group,
Tilburg
University
Human
Aspects
of
Information
Technology
Communication
and
Information
Sciences
Faculty
of
Humanities
Tilburg
University
Tilburg,
The
Netherlands
Other
exam
committee
members:
Dr.
J.J.
(Hans)
Paijmans
ILK
Research
Group,
Tilburg
University
Drs.
J.D.
(Jaap)
Meijers
Trouw,
PCM
Uitgevers
Recommending Articles for an Online Newspaper
J.M.P. Kneepkens
HAIT Master Thesis series nr. 09-002
THESIS SUBMITTED IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF ARTS IN COMMUNICATION AND INFORMATION SCIENCES,
MASTER TRACK HUMAN ASPECTS OF INFORMATION TECHNOLOGY,
AT THE FACULTY OF HUMANITIES
OF TILBURG UNIVERSITY
Thesis committee:
Drs. A.M. Bogers
Dr. J.J. Paijmans
Drs. J.D. Meijers
Tilburg University
Faculty of Humanities
Department of Communication and Information Sciences
Tilburg, The Netherlands
June 2009
Abstract
This research presents an evaluation of a recommender system that
automatically generates recommendations for articles from an online newspaper. A
prototype of a recommender system was built in cooperation with a Dutch
newspaper called “Trouw”. With the data retrieved from Trouw, the system was able
to recommend articles for news articles that were daily published on the website of
Trouw. Trouw Web editors have to judge each day, for every online article, the top
15 recommendations as correct or incorrect with an online application. During the
research period, we looked at performance differences in combination with article
growth, incorporating temporal information, and incorporating author and section
metadata. The results show that article growth has an influence on the number of
approved recommendations over time, and the MAP and P@n scores get better over
time as well. Incorporating temporal information showed the best results for the
MAP scores when only textual similarity was taken into account. The P@n scores
had the best results when taking both textual similarity and recency in equal
amounts into account. However, the differences in MAP and P@n score between
these two variations were not significant. Finally, incorporating author and section
metadata did not have influence on generating better recommendations. Looking at
the MAP scores it even scored significantly worse than the baseline algorithm.
Table of Contents
1
Introduction ............................................................................. 1
1.1
Motivation .......................................................................................1
1.2
Problem statement & research questions..............................................2
1.3
Research method..............................................................................2
1.4
Scope and relevance .........................................................................3
1.5
Outline ............................................................................................3
2
Related work ............................................................................ 4
2.1
Online newspapers............................................................................4
2.2
News article recommendation.............................................................4
2.2.1
Recommender systems .......................................................................... 5
2.2.2
Information Retrieval............................................................................. 7
2.2.3
Information Filtering.............................................................................. 8
3
The Trouw Recommender architecture ..................................... 9
3.1
Trouw usage scenario........................................................................9
3.2
Data collection ............................................................................... 10
3.3
Judgments ..................................................................................... 13
4
Evaluation .............................................................................. 16
4.1
Recall and Precision ........................................................................ 16
4.2
Precision at rank p and MAP scores ................................................... 17
4.3
Subjective evaluation ...................................................................... 19
5
Article growth ........................................................................ 21
5.1
Experimental setup ......................................................................... 21
5.2
Results .......................................................................................... 23
5.3
Discussion ..................................................................................... 31
6
Incorporating temporal information in recommendations ...... 33
6.1
Experimental setup ......................................................................... 33
6.2
Results .......................................................................................... 35
6.3
Discussion ..................................................................................... 39
7
Incorporating other metadata in recommendations ............... 41
7.1
Experimental setup ......................................................................... 41
7.2
Results .......................................................................................... 42
7.3
Discussion ..................................................................................... 47
8
Conclusion and future work.................................................... 48
8.1
Article growth ................................................................................ 48
8.2
Incorporating temporal information in recommendations ...................... 48
8.3
Incorporating other metadata in recommendations.............................. 49
8.4
Future work ................................................................................... 50
References .................................................................................. 52
Appendix A ................................................................................. 54
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
1
1 Introduction
1.1 Motivation
Newspapers are as old as Ancient Rome, when announcements were carved
on stone or metal and were posted in public places (Newspaper, 2001). However, the
first ‘recognized’ newspaper was published in 1605 by Johann Carolus. When
printing techniques became more advanced during the Industrial Revolution,
newspapers became a more widely circulated means of communication. Due to the
availability of news via 24-hour television and the Internet, newspapers had to
launch their online variant in order to keep up with their readers.
The numbers of visitors of online newspapers are still growing according to
Nielsen Online, which conducted an investigation on behalf of the Newspaper
Association of America (Sigmund, 2008). American newspaper websites attracted more
than 66.4 million unique visitors (40.7% of all American Internet users) on average
in the first quarter of 2008. This is a record number that represents a 12.3%
increase over the same period in 2007.
When news articles are published online, there are some advantages
compared to the printed version. One of the advantages of publishing news articles
online is that they could be used for hyperlinking. Hyperlinks in an article are used
to refer to another section, article, person, or perhaps a location. Links to related
articles are sometimes called recommendations and should attract the user to get
more information about that specific topic. Generating recommendations can be
done manually or automatically with a recommender system. In Chapter 2 we will
discuss different types of recommender systems. Another advantage of publishing
news articles online is personalization. With personalization it is possible to create a
newspaper containing only articles that correspond to the user’s interest. The user
will only read those articles that are interesting to him, just like he would do with a
printed version of a newspaper. All the other uninteresting articles will not be visible
to the user. Personalization will also be discussed in Chapter 2.
This Master’s thesis is about evaluating a recommender system that
automatically
generates
news
article
recommendations.
The
system
makes
recommendations based on a so-called focus article. This article is compared to
other articles presented in an index, the top 15 recommendations returned by the
system, are being judged as correct or incorrect.
1.2 Problem statement & research questions
In this thesis we will evaluate a recommender system for an online
newspaper. A prototype of this system was built for the Dutch newspaper “Trouw”
and was called the Trouw Recommender. This research involves the whole process
from setting up the system architecture, choosing the right algorithms, judging the
recommendations,
and
evaluating
the
algorithms
chosen.
To
evaluate
this
recommender system, the following research questions are formulated:
1. What
kind
of
influence
does
article
growth
have
on
generating
recommendations?
2. What
kind
of
influence
does
recency
have
on
generating
recommendations?
3. What kind of influence does author or section metadata have on
generating recommendations?
For the first research question we will look at what article growth does with
the relevancy of the recommendations returned by the system. We want to find out
if article growth has an influence on getting more reliable recommendations from
the system. For the second research question we will look if recent articles are more
relevant than older articles. Finally, for the last research question we will look if
metadata about the section and author can be incorporated in the recommendation
algorithm and see if this changes the performance of the system.
1.3 Research method
For this thesis, the following methods will be used. A prototype of a
recommender system is set up. It will collect and make recommendations based on
data that come from Trouw. To see these recommendations, the Web editors of
Trouw have to go to an online application, where the recommendations can be
judged as correct or incorrect. Every day, Trouw Web editors judge the
recommendations that are made by the system. Meanwhile, different algorithms for
recommendation are used to collect data for the research questions. The editors will
not be aware of which algorithm is used during the time they are judging, because
they will see the recommended articles presented the same way for each algorithm.
2
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
3
Finally, we will use the results of the judgments to evaluate the predictions of the
system.
1.4 Scope and relevance
Online advertising is one of the most profitable business models for Internet
services to date. Internet advertising revenues in the United States totalled $11.5
billion dollars for the first six months in 2008 (PricewaterhouseCoopers, 2008). Out of
these $11.5 billion, 21% comes from displaying banner ads, where companies show
their advertisement on different websites. The owners of these websites, on which
the advertisements are published, will receive revenue which is related to the
number of times the ad is clicked on (cost-per-click) or how many times the ad was
shown to visitors (cost-per-impression). Therefore, like many other commercial
websites, it is important for online newspapers that viewers of their website stay on
it as long as possible. This is called eyeball time, which refers to the time a user is
on your website. Presenting recommendations with articles could gain longer
eyeball time, because it creates the possibility for readers to go to these
recommendations and read them. The chance that a reader will stay on the website
longer will be higher and also the chance of clicking on one of the banner ads,
displayed on the website, will be higher.
1.5 Outline
In this thesis we will first discuss in Chapter 2 previous studies in information
retrieval, recommender systems, and online news and article recommendation. The
architecture of the prototype of the Trouw Recommender, which will be used during
this research, is described in Chapter 3. In Chapter 4 we will describe how the
evaluation of the system is being done. After that we will describe the three
research questions in Chapter 5, 6, and 7, in which each chapter contains the
experimental setup, the results and we will also discuss these results. Finally, in
Chapter 8, we will draw conclusions and discuss what future work should be
focussed on.
2 Related work
2.1 Online newspapers
The rise of online newspapers started in the middle of the 1990s when
McAdams created an online version of The Washington Post (McAdams, 1995). She and
her team were the first to set up an online newspaper, which resulted in a lot of
difficulties they had to struggle with. They used the Newspaper metaphor as their
structural model for the online service. This means that it had to be so user-friendly
that anyone who can read could figure out how this online version should be used.
The newspaper metaphor also uses the front page as the entry point of the system,
a term still seen on many websites nowadays. They came to the conclusion that an
online newspaper cannot be a direct copy of the printed version. Furthermore, it is
hard to figure out what to keep and what to discard. Finally, a whole new team of
editors had to be formed to get the newspaper online. These editors had to think in
two-way communication rather than in a one-way medium, because online
publishing is bi-directional.
As mentioned in the previous chapter, 40% of all American Internet users
visited one or more online newspapers during the first quarter of 2008. Most of
these visitors probably only read articles that they think are interesting to them, like
they would do with printed newspapers. Although printed newspapers tend to be
more portable and easier to manipulate, online newspapers have an argument in
their favour — personalization (Kamba, Bharat, & Albers, 1994). With personalization it is
possible to create a newspaper containing only articles that correspond to the user’s
interest. The system of Kamba et al. made use of personalization without conscious
user involvement, realized by realistic rendering, dynamic control, interactivity, and
implicit feedback. Their system showed articles in a flexible layout, which could be
manipulated by the user with a set of controls. With these controls, users were able
to reorder articles according to their interest.
2.2 News article recommendation
Recommendation is widely used in different commercial systems, where each
system uses its own data and, in most cases, standard algorithms like K-Nearest
Neighbour are used to optimize its recommendations. In this section we will first
describe how recommender systems work and what kind of different types there
are. A more general explanation of Information Retrieval (IR) will be given. Finally,
another technique called Information Filtering (IF) will be explained.
4
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
5
2.2.1 Recommender systems
Earlier work in recommender systems shows two main approaches in
algorithmic techniques that can be distinguished, collaborative filtering and contentbased filtering. Work on recommender systems started in the early 1990s, when
Goldberg et al. (1992) developed an experimental system that could filter e-mails.
According to a specific user, their filter was able to distinguish interesting and noninteresting e-mails. They used collaborative filtering for their system, which means
that people collaborate to help one another perform filtering by recording their
reactions to documents they read. Others that used the same system with these
filter methods could get access to these reactions and also see only those e-mails of
their interest.
In general, two broad classes of collaborative filtering algorithms can be
distinguished, memory-based algorithms and model-based algorithms (Breese,
Heekerman, & Kadic, 1998). Memory-based collaborative filtering is the more classic
approach and uses statistical techniques for finding sets of neighbours and uses
these as a source for making recommendations. This method can be used for userbased collaborative filtering. Resnick et al. (1994) developed a system that made use
of this user-based approach. Their system, based on Usenet, let users rate articles
in according with their interest. With those rankings, the system could make
predictions for others users and return ranked articles to them. The system
compared each user’s ranking and made use of the heuristic that people who
agreed in the past are likely to agree again. Sarwar et al. (2001) computed the
similarity between different items and used a set of items as nearest neighbours to
do the recommendation, this is called item-based collaborative filtering.
Collaborative filtering can also be used with model-based algorithms. Where
memory-based algorithms operate over the entire user database to make
predictions, model-based collaborative filtering, uses the user database to estimate
or learn a model, which is then used for predictions (Breese, Heekerman, & Kadic, 1998).
These predictions can be seen as an expected value of a vote, which has been
calculated based on what is known about a user. Two common models for modelbased collaborative filtering are cluster models and Bayesian networks (Breese,
Heekerman, & Kadic, 1998). Cluster models use like-minded users as clustered classes.
Each
user’s
ratings
are
assumed
to
give
the
user
his
class-membership
independently. Bayesian networks use titles, as variables within its network and
their values of those titles are the ratings allowed. From these data, the system can
learn the structure of the network, used for encoding the dependencies between
titles, and the conditional probabilities (Pennock, Horvitz, Lawrence, & Giles, 2000).
Das et al. (2007) made used collaborative filtering for personalization of
Google News. Because of the large scalability of their system, Google News receives
millions of page views and clicks from millions of users, and the frequency of
rebuilding the models, they found existing recommender systems unsuitable for
their needs. A mixture of memory-based and model-based algorithms was being
used to generate recommendations. For the memory-based approach, they made
use of PSLI and MinHash and for the model-based they used item covisitation. The
scores of each of algorithm were combined, with an option to give more weight to a
specific algorithm, to obtain a ranked list of stories. Finally, the top K stories were
chosen from this list as recommendations for the user.
There are some problems that can occur while recommending with
collaborative filtering. For example, if a new item appears in the database, it will
never be recommended until more information is obtained by another user either
rating it or specifying which other items it is similar to (Balabanovic & Shoham, 1997).
This is also called the cold start problem. Another problem is when a user’s interest
is very unique and cannot be compared to the rest of the users, which will lead to
poor recommendations (Claypool, Gokhale, Miranda, Murnikov, Netes, & Sartin, 1999).
Next to collaborative filtering, a content-based filtering system selects items
based on a comparison between the contents and the user's preference. In general,
content-based filtering tries to recommend items similar to those a given user has
liked in the past, whereas collaborative filtering identifies users whose tastes are
similar to those of the given user and recommends items they have liked (Balabanovic
& Shoham, 1997). So with content-based filtering, items are recommended based on
information about the item itself rather than on the preferences of the other users
(Mooney & Roy, 2000). For content-based filtering, standard machine learning methods
like naive Bayes classification are commonly used. The naive Bayes assumption
states that the probability of each word event is dependent on the document class
but independent of the word’s context and position.
Bogers et al. (2007) compared and evaluated relatively old and simple
retrieval algorithms against newer state-of-the-art approaches such as language
modelling. They developed a news recommender system and compared three
algorithms. The first was the standard tf-idf algorithm used as a baseline, the
second algorithm was the Okapi retrieval function and the last algorithm was the
6
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
7
language-modeling (LM) framework. The tf-idf algorithm performed worst in
comparison to the other two algorithms. However, there were no significant
differences between the Okapi and LM algorithms, the LM variant generated the
recommendations 5.5 times faster than the Okapi algorithm.
Another way to use content-based filtering was proposed by Maidel et al.
(2008), by using an ontology for ranking items for online newspapers. Into their
personalized newspaper system, the ePaper, they included a well-known ontology in
the news domain called NewsCodes. For both the news items as each user, profiles
were built consisting of ontology concepts. These profiles were used to measure
similarity by considering the hierarchical distance between concepts in the two
profiles hierarchy. The degree of similarity of an item’s profile to an user’s profile
were based on the number of matches of concepts in the two profiles, where in
three degrees matches were possible, and on the weights of the concepts in the
user’s profile (Maidel, Shoval, Shapi, & Taieb-Maimon, 2008).
2.2.2 Information Retrieval
Information Retrieval (IR) has been a field of research since the 1950s, when
it became possible to store large amounts of information on computers and when
finding the useful information became a necessity (Singhal, 2001). The early IR
systems were based on Boolean logic, but had shortcomings such as the difficulty to
form good queries and there were no notions of document ranking. Nowadays users
of IR systems expect ranked results; therefore models as the vector space model,
probabilistic models, and language models are nowadays used most in IR. The
vector space model makes use of vectors that represent documents. One
component corresponds to each term in the dictionary, where dictionary terms that
do not occur in the document get a weight of zero. All documents in the collection
are then viewed as a set of vectors in a vector space model, where each axis
represents a term. To quantify the similarity between two documents, the cosine of
the angle between the similar vectors is calculated or the dot-product is used as a
similarity measurement. Probabilistic models estimate the proximity of relevance of
documents for a query, because true probabilities are not available to an IR system.
This estimated probability of relevance is used for ranking the documents, which is
the basis of the Probability Ranking Principle
(Manning, Raghavan, & Schütze, 2008).
Language models make use of the idea that a document is a good match to a query
if the document model is likely to generate the query. This will happen if the query
words often occur in the document. For each document a probabilistic language
model is build and used to estimate the probality that this model would generate
the query. This probability is again used to rank the documents (Manning, Raghavan, &
Schütze, 2008).
To implement IR systems, using any these models, the data that will be used
needs to be indexed first. Before this data will be indexed, it is likely that it has to
be processed first to add, delete, or modify information to a document, such as
removing stop words, or extracting information about the author. After processing,
the indexing stage makes a searchable data structure, which is called the Index.
This index contains references to the contents, used for requesting information by
queries. IR systems let a user create a query of keywords describing the
information needed. The keywords will be used to look up references in the index
and will display these to the user. It is the intention of the system to return the
most relevant set of results, based on the information need of the user.
2.2.3 Information Filtering
Information filtering (IF) systems focus on filtering information that is based
on a user’s profile. A user’s profile can be created by letting the user specify and
combine interests explicitly, or by letting the system implicitly monitor the user's
behaviour (Hanani, Shapira, & Shoval, 2001). When a user receives the information he
needs according to his profile automatically, filtering within IF systems is
performed. One of the advantages of IF is its ability to adapt to the user's long-term
interest, and give the information to the user. This information can be given by a
notice to the user, or by letting the system use the information to take action on
behalf of the user. Information filtering differs from information retrieval in the way
the interests of a user are presented. Instead of letting the user pull information
using a query, an information filtering system tries to model the user's long-term
interests and push relevant information to the user.
8
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
9
3 The Trouw Recommender architecture
There are six paid national morning newspapers in the Netherlands at the
moment and Trouw is one of them. Trouw, which was founded in 1943 as an illegal
newspaper during the Second World War, distinguishes itself from other quality
newspapers by an explicit focus on news and views from the world of religion and
philosophy (Trouw, 2004). Currently, Trouw has approximately 120 employees
working on the creation of the daily newspaper. Seven out of these 120 employees
are responsible for the website of Trouw1, where anyone can read all the latest
news online.
3.1 Trouw usage scenario
Two types of news articles are published on the website of Trouw; articles
from the ANP2, which are published automatically, and articles from their own
printed version, which are published by the Trouw web editors. For our research we
will only focus on the second type of articles. On a daily basis, around 16 and 24
news articles from the printed version are published on the website of Trouw. When
someone visits the website and reads an article, sometimes there are links to other,
related articles that could be of interest to the reader. These related articles are
called recommendations and in this chapter we are going to explain how these are
generated.
In the past, Web editors from Trouw performed a search task by hand, within
their own content management system (CMS) of their website, to find articles that
were related to a focus article. Not only was this search task very time-consuming,
there was also a chance of not getting back all relevant articles from their CMS
because the editors had to formulate the correct and related keywords. Trouw
wanted to have this search task done automatically by a computer and therefore a
prototype of a recommender system was developed. The idea behind this prototype
was that it should generate recommendations of articles automatically and that the
Web editors would only have to judge the correctness of these recommendations.
1
http://www.trouw.nl
The ANP stands for Algemeen Nederlands Persbureau and is the leading news
agency of the Netherlands.
2
3.2 Data collection
For making the recommendations of the articles, an open source toolkit called
Lemur3 was used. Lemur is designed to facilitate research in language modelling
and information retrieval. With this toolkit it is possible to construct basic text
retrieval systems using language modeling methods, as well as traditional methods
such as those based on the vector space model and Okapi.
For the Trouw Recommender, a decision had to be made which algorithm
Lemur should use. As we have mentioned in section 2.2.1, Bogers et al. (2007)
compared three different algorithms for generating recommendations based on
news articles. Due to their findings, the simple language model (Kullback-Leibler
Divergence) with Jelinek-Mercer smoothing is used for the Trouw recommender.
A simple LM algorithm creates a language model for each document and
estimates the probability of generating the query according to each of these models
(Ponte & Croft, 1998). A simple LM for a document d is the maximum likelihood
estimator. Kullback-Leiber Divergence (KLD) is applied to measure the divergence
between two probability distributions, and can be used as a distance between LMs
(Fernández, 2007). Finally, smoothing tries to balance the probability of terms that
appear in a document with the ones that are missing. Smoothing discounts the
probability mass assigned to the words seen and distributes the extra probability to
the terms unseen according to some fallback model. Jelinek-Mercer smoothing
involves a linear interpolation of the maximum likelihood model with the collection
model, using a coefficient λ (Fernández, 2007).
Before the toolkit can make the recommendations, every night at 3 o’clock
articles are collected from the TERA4 database located at Trouw (Figure 3.1). These
are in an XML5 format and contain the full text and metadata of each news article.
3
4
5
http://www.lemurproject.org
http://www.teradp.com
eXtensible Mark-up Language
10
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
11
Figure 3.1: Daily article collection
Because certain types of articles should never be recommended or never
require recommendations, some pre-processing is done first. First, each article is
formatted as plain text. Articles from the following categories are filtered out:
weather reports and TV & radio guide listings. There is also a filter for articles that
contain less than 80 words, because they are also irrelevant for making
recommendations according to Trouw Web editors. When all this processing has
been done, the formatted text is put in a database called “Trouw”. This Trouw
database is a MySQL database, with 4 tables containing all articles, judgments,
recommendations, and users.
The next step is converting this information in a format that can be used by
Lemur. For the Trouw recommender it is converted to the common Standard
Generalized Mark-up Language (SGML) format used in the TREC6 community. Each
article in an SGML file is organized according to Figure 3.2. When all articles are
converted to this format, they are being indexed. Stop word removal is performed
during this indexing, while stemming is not performed. Stop word removal removes
the words that add little value to the document information, a list of Dutch stop
words from the Snowball project7 is used for this. After that, all new articles are put
in the Lemur index and can be used for recommendation.
6
7
http://trec.nist.gov/
http://snowball.tartarus.org
<DOC>
<DOCNO> Unique article number </DOCNO>
<TEXT>
<TITLE> Article title </TITLE>
<DATE> Article date </DATE>
<SECTION> Section </SECTION>
<ONLINE> Yes or No </ONLINE>
<ITEMLENGTH> Length of the article </ITEMLENGTH>
<AUTHOR> Author of the article </AUTHOR>
<ABSTRACT> Abstract of the article </ABSTRACT>
<CAPTION> Caption of the article </CAPTION>
<BODY> The whole text of the article </BODY>
</TEXT>
</DOC>
Figure 3.2: SGML formatting of an article
The next step in this recommendation process is to find articles that are
related to the new online articles. Because Trouw thought it was not necessary to
make recommendations for articles that were published in the past, only
recommendations for the new online articles will be made. Therefore, only the new
articles are converted into the TREC format (Figure 3.3), while stop word removal is
performed.
Figure 3.3: Daily article recommendation workflow
These new articles are also indexed and Lemur will begin to make the
recommendations. Using the simple language model algorithm mentioned above,
Lemur returns 50 articles as being related for each new article. Because Lemur also
recommends its focus article, only 49 recommendations can be used effectively.
However, the Trouw Web editors wanted to judge only 15 recommendations, so
only the top 15 articles are displayed. Each related article gets a relevance score
12
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
13
and the higher the score the more related an article should be. The simple LM
algorithm (KLD with Jelinek-Mercer smoothing) mentioned above is used to
calculate this relevance score. All new recommendations are then stored in the
Trouw database with their relevance scores and are ready for judgment.
3.3 Judgments
Now that the recommendations have been made for all the new articles,
Trouw Web editors have to judge them. It is up to them to judge the top 15
recommendations of each article as correct or incorrect. Before the Web editors can
judge the articles, they have to login on a web application with their own username
and password. When they are logged in, the editors get to see all article titles of the
current or latest date when there were recommended articles (Figure 3.4).
Figure 3.4: Trouw Recommender main window
The left side of the window shows all the days when there were articles being
recommended. When an editor clicks on a date, all articles of that day are shown in
the middle of the window. Furthermore, there are arrows on the left to navigate to
all available dates on which there are recommended articles. In the middle of the
window, all articles of that date are shown by their title. If the editor wants to read
a whole article, the title of that article can be clicked on and the whole article is
displayed in the centre as can be seen in Figure A.1 in the Appendix.
Figure 3.5: Trouw Recommender interface for judging a focus article
To judge the recommendations of a specific article, the editor clicks on the
green icon
next to that article. In a new window, the focus article is displayed on
the left and the top 15 recommendations are displayed on the right. As can be seen
in Figure 3.5, the recommended articles are also displayed with their titles. To get
an idea what the recommended article is about, the abstract or, if there is no
14
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
15
abstract available, the first 50 words are displayed. The section, date of publication
and the word count are also shown with the recommended article.
The recommendations are ranked according to a normalized confidence score
from high to low. This normalization was performed using the following formula:
Equation 3.1: Normalization of confidence score
scorenorm =
scoreoriginal − scoremin
scoremax − scoremin
scoremin = lowest relevance score of all 49 recommended articles
€
scoremax = highest relevance score of all 49 recommended articles
To get percentages out of these, the normalized scores are multiplied with
100. As a default, recommendations are set as incorrect, but when the normalized
score gets a confidence of 70% or higher, the recommendation is automatically set
as correct. This was done in consultation with Trouw, as they wanted to have the
top of the recommendations already set as correct. They also wanted to have
buttons to set all recommendations as correct or incorrect with one click. It is now
up to the web editor to judge the recommended articles if they are related to the
focus article (see Figure 3.5). The editor can set the radio button of each
recommended article either as correct or incorrect. When all 15 recommendations
have been judged, the editor clicks on a submit button to save his judgments. This
is all the work a web editor has to do on a daily basis
When a focus article has been judged, the search icon
next to the green
icon gets coloured and becomes clickable. When a web editor clicks on this icon, all
articles that are being judged as correct are shown.
Around 11 o’clock each morning, a computer at Trouw automatically collects
the recommendations that were judged by the Web editors of Trouw. From these
collected recommended articles, a maximum of five will be published next to their
focus article. Figure A.2 and Figure A.3 in the appendix show an example of a
recommended focus article and how the recommendations of this article are shown
on the Trouw website.
4 Evaluation
In this chapter, we describe our evaluation of the recommender’s performance.
In search engine evaluation one primary distinction is usually made, the distinction
between effectiveness and efficiency (Croft, Metzler, & Strohman, 2009). Effectiveness
measures the ability to find the right information and efficiency measures just how
quickly this is done. IR research focuses first on improving the effectiveness and
when a technique has been established the focus shifts to find the most efficient
method. Because we are looking for a way to improve the Trouw recommender
system for finding the right information, our evaluation focus will be on
effectiveness.
4.1 Recall and Precision
To measure effectiveness, two measurements are most common, namely
precision and recall (Croft, Metzler, & Strohman, 2009). Precision measures how well the
system rejects non-relevant documents and recall measures how well the system
finds all relevant documents. This presumes that, given a specific query, there is a
set of retrieved and non-retrieved documents and that we know which ones are
relevant and which ones are not. The results of this specific query can be
summarized as shown in Table 4.1, making the assumption that the relevance is
binary.
Table 4.1 Sets of documents defined by a simple search with binary relevance
(Croft, Metzler, & Strohman, 2009)
Relevant
Non-Relevant
Retrieved
"#$
"#$
Non-Retrieved
"#$
"#$
!
!
The set of relevant documents
in this table
!
! is Α , the non-relevant set is
is the retrieved set, and finally
Α, Β
Β is the non-retrieved set. The operator ∩ gives the
intersection between these two sets of documents. With this table we can define the
€
€
two effectiveness measurements as follow:
€
€
precision =
16
€
Α∩Β
Β
€
and
recall =
Α∩Β
Α
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
€
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
17
Because the editors of Trouw will only judge the top 15 recommendations,
only precision will be used as evaluation measurement. Besides that we do not
know what all relevant documents are because we cannot expect from editors to
judge each and every document for each new article.
4.2 Precision at rank p and MAP scores
Because the relevance judgments are true or false (correct / incorrect), a
binary evaluation method is chosen. For using precision as a measurement, we will
introduce two measurements that are based on precision. The first measurement is
precision at rank n (or P@n), where we will be using 5, 10, and 15 as n (because
the editors only judge the top 15 recommendations). This measurement is typically
used to compare the output at the top of the ranking, which is what we want to find
out during this research. However a major disadvantage of this measurement is
that it does not distinguish between the rankings of relevant documents within the
top n results (Croft, Metzler, & Strohman, 2009). Therefore we will also be using a second
measurement called Mean uninterpolated Average Precision (MAP). Using MAP, the
average of the precision scores are calculated after each relevant article. MAP gives
us a single-measure figure of overall system quality according to relevance levels.
Figure 4.1: Recall and precision values for rankings from two different queries.
(Croft, Metzler, & Strohman, 2009)
Figure 4.1 gives an example of the recall and precision from two different
queries. The P@5 score for both queries are 0.4, while the average precision and
MAP for both queries are:
Average precision query 1 = (1.0 + 0.67 + 0.5 + 0.44 + 0.5)/5 = 0.62
Average precision query 2 = (0.5 + 0.4 + 0.43)/3 = 0.44
Mean average precision = (0.62 + 0.44)/2 = 0.53
The average precision is calculated by taking the sum of all precision values,
when an article was relevant, and divide this by the number of relevant articles. The
MAP is calculated by taking the average of all the average precisions.
To get these scores for measurement, some steps have to be completed. First,
Lemur scores of each recommendation are obtained from the database. Secondly,
the scores are calculated with the algorithm used during judging. These algorithms
will be described in the following chapters. Third, the new calculated scores are
ordered descending. Finally, precisions at rank n and MAP scores are calculated with
a tool called trec_eval8. To use trec_eval, a file containing all query relevance
judgments has to be created. This “qrel” file contains all focus and recommended
article combinations that were judged as correct by the editors of Trouw. The qrel
file was constructed into the following format, where a tab divides each field:
Focus id
Recommended id
Relevant
TR_ART0…0281623.2
0
TR_ART0…0281807.1
1
TR_ART0…0281623.2
0
TR_ART0…0281731
1
TR_ART0…0281623.2
0
TR_ART0…0281897.4
1
The first field is used for the id of the focus article, the second can be used as
a dummy field, which will not be used during the calculations. The third field is for
the id of the recommended article and finally in the last field there is a 1, indicating
that this combination is relevant.
Next
to
the
relevant
judgments,
a
file
with
an
ordered
list
of
all
recommendations is needed for trec_eval. In this file the recommendations are
ranked in the same order as they were shown to the web editor who judged it. This
file was constructed into the following format:
8
http://trec.nist.gov/trec_eval/
18
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
19
Focus id
Recommended id
Rank
Score
Run
TR_ART0…0281623.2
Q0
TR_ART0…0281807.1
1
-3.0899
RunName
TR_ART0…0281623.2
Q0
TR_ART0…0281731
2
-3.0932
RunName
TR_ART0…0281623.2
Q0
TR_ART0…0281746
3
-3.1142
RunName
TR_ART0…0281623.2
Q0
TR_ART0…0281897.4
4
-3.1353
RunName
This file looks similar to the qrel file, but it has some different fields. The first
field is again used for the id of the focus article. The second and fourth field are
ignored by trec_eval. The third field is again for the id of the recommended article.
The fifth field is used for the relevance scores that come from Lemur. And finally in
the last field there is also a dummy field, which could be used for the name of the
run.
After creating these files, trec_eval can be run with these two files as
parameters. All kind of measurements will then be returned by trec_eval, including
P@n and MAP. For each research question we used different lists containing the
recommendations and judgments for that period. We will describe the results and
discuss each research question in Chapters 5, 6, and 7.
4.3 Subjective evaluation
Because we could not check how the editors of Trouw worked with the
system, some editors were questioned about working with the Trouw Recommender
afterwards. Two editors, who judged most of the articles, were asked questions of
how the system performed according to them and what they thought could be done
better.
In general they were very pleased with the results returned by the system.
According to them, most of the time the recommendations were very good.
Although, they pointed out that with specific topics like arts, music, or with articles
about a specific person, the recommended articles were not very related. There are
two
things
that
can
cause
these
bad
recommendations.
One
is
that
all
recommendations were normalized before displayed to an editor, always resulting in
one recommended article with 100% confidence for each focus article. This is
confusing for the editor judging the article, because the system sets this 100%
confidence article automatically as correct. So this editor may think that, because
the system sets the first recommendation on 100%, and it is not relevant, the
system is not very reliable. But it could be that the system returned low relevance
scores to these recommendations, but due to the normalization there is always one
article with a 100% confidence. This is a problem that should be solved in future
work. The other thing that can cause bad recommendations is that there are always
15 recommendations displayed, even if the recommendations are barely related to
the focus article. So if even the first recommended article might not be related, the
other 14 recommended articles are even less related according to the system. But it
is agreed with Trouw to always display the top 15 recommendations and therefore
this problem should not be solved but it had to be explained to the editors by
forehand.
Besides these problems, there were also some technical infrastructure
problems that could not always be solved rapidly due to an external company.
Because another company delivers the ICT infrastructure for the website of Trouw,
there were some struggles with delivering the correct data to the Trouw
Recommender. These problems are in some cases connected to the renewed
website of Trouw, which went live in September 2008. Due to this, there were some
periods that no articles were being recommended because they had not the “online”
tag set in their XML file. Another problem that occurred after launching their new
website is that information about the section was not incorporated in the xml
anymore. This would make our third research question not work at all, because we
are not able to compare the section of the focus article with the section of the
recommended article. Fortunately, we chose the period from 5 June 2008 until 20
August 2008 for the third research question, so we did not have to deal with this
problem.
There were also some complaints by the editors about the loading time when
opening the judge window (Figure 3.5). It took some time for the PHP script to
collect all 49 recommendations, calculate these with the new algorithms and then
show them reordered to the editors. Rewriting the PHP and MySQL code should
solve this problem.
20
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
21
5 Article growth
As described in Chapter 2, recommender systems have to deal with a problem
called the “cold-start problem”. This problem occurs when there are not enough
data to make good recommendations. The Trouw Recommender could have the
same problem during the start-up period. On the first day the Trouw Recommender
was launched, only articles from that day were collected and only those articles
could be used as recommendations. For the second day, the number of articles that
could be used for recommendation was approximately doubled. This means that as
the number of articles increase, so does the chance of having related articles in the
database. Therefore we want to find out if the performance of the system gets
better when the number of articles in the index grows over time.
In this chapter we will first describe the experimental setup, then we will show
the results and finally discuss our first research question, “What kind of influence
does article growth have on generating recommendations?”
5.1 Experimental setup
For this experiment we used the scores from the standard algorithm, used by
Lemur as described in Chapter 4, to rank the recommendations. The scores of the
top 15 recommendations were normalized with Equation 3.1. The judgments of the
Web editors were collected during a period of six weeks, from February 5th 2008 till
March 21st 2008. We chose this six-week period, because we wanted to have the
same time-span for each research question. The number of articles added during
this period should be enough to see if the article growth has influence on the
recommendations.
Looking
at
the
article
growth for this first period of
One or more
approvals
six weeks, we see that every
day an average of 76 articles
were
added
during
the
weekdays Monday till Friday.
Out
of
these
average
76
articles, 16 were also used for
467
(61%)
179
(23%)
120
(16%)
No approvals
Unjudged
recommendations
Figure 5.1: Recommended articles during first period
publishing online. Only these online articles are used to generate recommendations
for. Every Saturday, an average of 120 articles were added to the MySQL database
of the Trouw Recommender. From these average 120 articles, 24 articles were
meant for the online newspaper. So for both the weekdays as for Saturdays,
approximately 20% of the newly added articles are meant for online publishing.
Figure A.4 in the Appendix show all days of this period with their number of added
articles.
As can be seen in Figure 5.1, during this first period, 766 articles were being
recommended by the system. Out of these 766 recommended articles, the editors
of Trouw managed to judge only 299 articles (39%). We used trec_eval to calculate
the MAP scores from these 299 articles that were judged and discovered that only
179
articles
(60%
of
all
recommendations. So 40%
articles
judged)
had
one
or
more
approved
of all articles judged did not have any good
recommendations during this first period of six weeks. Figure 5.2 shows how these
percentages change over time, for the articles judged that have no approved
recommendations. Finally, over time, around 20% of all recommended articles have
no approved recommendations.
Figure 5.2: No approved recommendations over time
Not only did we collect the judgments of the recommended articles, but also
information about which editor judged them. In the results we will also go into the
22
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
23
differences between the editors. Because each recommended article was judged by
only one editor, it was not possible to measure inter-annotator agreement.
In addition to the six-week period used for this research question, we have
also collected the baseline runs of all judgments during this research project.
Because research question 2 and the period after research question 3 also used the
baseline Lemur algorithm, we can use this to see how performance changes over
the course of the entire project.
Statistical significance of differences between results was determined using
two-sample equal variance t-tests (p < 0.05). Because we wanted to establish
whether there was a significant difference between editors, only the two-tailed test
was used.
5.2 Results
After creating the qrel file with all approved recommendations and 299 lists
for each judged article with its recommendations for this first period, we can look at
the results using trec_eval. Looking at the MAP scores in Figure 5.3, it can be seen
that only the 179 out of the 299 articles that were judged are visible. This is
because
trec_eval did
not return
zero
scores if there
recommendations.
Figure 5.3: MAP scores of judged articles over time during first period
were
no
approved
As illustrated by the blue line, there is a lot of variance in the MAP scores
over time. However, the most common and also the highest MAP score is 1.0,
occurring 89 out of the 179 times (58.5%). A MAP score of 1.0 can mean two
things; either there is only one approved recommendation at position 1, or all
recommendations that are judged as correct are next to each other with no
incorrect recommendations between them. For example, if there are three correct
recommendations,
there
will only
be
a
MAP
score
of 1.0
if these
three
recommendations are on positions 1, 2 and 3. The lowest MAP score is 0.15 and
occurred only two times during this first period.
The red line in Figure 5.3 shows the average MAP scores over time. The
average MAP scores of first 30 articles show a lot of variances and spikes. This is
probably because some articles had bad recommendations in the beginning,
resulting in low MAP scores. As can be seen, there is a constant upward trend
visible after 30 judged articles. But this growth stagnates around 90 judged articles
with a highest average MAP score of 0.9211. After that, the average MAP scores
have a little downward trend, ending up with an average MAP score of 0.8956 after
179 judged articles. But the differences between the MAP scores from 1 till 90
articles and the MAP scores of 90 till 179 articles were not significant.
Because trec_eval did not return MAP scores for the 120 articles with no
approved recommendations, the average MAP score of 0.8843 is not right. The
corrected average MAP score should be 0.5362, that is the sum of the MAP scores of
the 179 judged articles with one or more approved recommendations, divided by
299, which is the total number of judged articles.
Not only have we calculated the MAP scores of each judged article, but also
the P@n scores. The average and corrected average MAP and P@n scores for this
first period are listed in Table 5.1. When taking all 299 articles in account, all four
score types decrease by 40%. That is the percentage of the 120 articles, judged
with no approved recommendations, out of the total 299 judged articles.
Table 5.1: Average and corrected average MAP and P@n scores during first period
Type score
MAP
P@5
P@10
P@15
Average scores
0.8956
0.5084
0.3101
0.2235
Corrected average scores
0.5362
0.3034
0.1856
0.1338
24
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
25
Figure 5.4 shows the averages of all three P@n scores and the average
number of approved recommendations. As can be seen, all P@n scores have the
same increase after 80 judged articles. Also, after 155 judged articles, the P@n
scores tend to increase a bit more. It is remarkable to see that all the P@n scores
increase after 80 judged articles, while the MAP scores became more constant after
90 judged articles (Figure 5.3). But as can be seen by the purple line, the number
of average approvals has the same increase at the end. Because P@n scores are
calculated by taking the number of approved articles of the top n divided by n, it
means that the upward trend of the average approvals have influence on the
increase of the P@n scores.
Figure 5.4: P@n scores and average approvals of judged articles over time during first period
The P@n scores are still growing at the end, so it seems that after 299
judged articles there is still a growth in the P@n scores. In Figure 5.9 we will indeed
see that this growth continues until 350 judged articles, after that it goes downward
to a constant average P@5 of 0.3787, a constant average P@10 score of 0.2352,
and a constant average P@15 score of 0.1703.
During this first period, three editors judged the recommendations made by
the Trouw Recommender. There was one editor who judged 195 articles, another
one who judged 90 articles, and also one who judged only 14 articles. We have
calculated the average MAP score for each editor and the results are in shown Table
5.2.
Table 5.2: Differences in scores and approvals in between each editor during the first period
Editor ID
Articles
judged
Average
MAP score
Corrected average
MAP score
Approvals
4
11
18
90
14
195
0.8794
1
0.8976
0.8501
0.8571
0.3682
381
24
201
Average
Approvals per
article
4.2
1.7
1.0
The two editors, who judged most of the articles, have both very similar
average MAP scores of 0.8794 and 0.8976. But if we look at the corrected average
MAP scores, when also the articles with no approvals are used during the
calculation, the MAP score of Editor 18 decreases with almost 59% to 0.3682. This
is the percentage of judged articles with no approved recommendations as can be
seen in Table A.1 in the Appendix. Taking all the MAP scores of both editors, the
differences between them are significant. This also applies to Editors 11 and 18, the
MAP scores are significantly different too. Only the MAP scores of Editor 4 and Editor
11 are not significantly different.
Looking at the average approvals per article, which are the total number of
approved recommendations divided by the number of judged articles for each
editor, a similar difference between Editors 4 and 18 can be seen. Editor 18 has an
average approval per article of 1.0, while Editor 4 has an average approval per
article of 4.2. As can be seen at Table A.1 in the Appendix, Editor 4 has only 3.3%
of his articles judged with no correct judgments and therefore his corrected average
MAP is relatively high in comparison to Editor 18.
Because in general one editor judged one day, it has to be taken into
account what day, which editor made the judgments. This difference in days is
shown in Figure 5.5, where Editor 18 judged the first nine days, the tenth day is
judged by both Editors 11 and 18, again one day by Editor 18 followed by four days
by Editor 4. Finally, Editor 18 judged another day and at the end Editor 4 judged
another three days.
26
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
27
Figure 5.5: Dates of judgments and number of judged articles for each editor
Figure 5.6 shows the average MAP scores for each editor after each judged
article over time. The three coloured lines are distributed over the days as seen in
Figure 5.5. Again, the average MAP scores are only based on the articles that were
judged with one or more approved recommendations. This means that Editor 4
judged 87 out of 90 articles with one or more approved recommendations; Editor 11
judged 12 out of 14 articles, and Editor 18 judged 80 out of 195 articles.
Figure 5.6: Average MAP scores of judged articles by each editor
Looking at the green line, corresponding to Editor 18, the average MAP score
is getting better towards 0.9. All 12 articles judged by Editor 11, had a MAP score of
1.0. Finally, the average MAP scores by Editor 4 show a downward trend from a
MAP score of 1.0 and ending up with an average MAP score of 0.85. Because Editor
18 judged the first ten days, it looks like that the recommendations became better
over time. For Editor 4 however, the system seems to get slightly worse over time.
The P@5 scores of each editor in Figure 5.7 show something different from the
MAP scores shown in Figure 5.6. For all three editors, the P@5 scores show a
familiar downward trend in the beginning and become more constant over time.
Editor 4 ends up with an average P@5 score of 0.6161, Editor 11 ends up with an
average of 0.3667, and finally Editor 18 ends up with an average P@5 score of
0.4125.
Figure 5.7: Average P@5 scores of judged articles by each editor
In Table 5.3 the average and corrected average P@n scores are listed.
Corrected average means that also the articles with no approved recommendations
are taken into account while calculating the averages. Just like it was the case with
the MAP scores, the corrected averages drop strongly for Editor 18. All his P@n
scores drop with 59% due to the fact he judged the same percentage of articles
without approved recommendations.
28
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
29
Table 5.3: Editors with their average and corrected average P@n scores
Editor
ID
4
11
18
Average
P@5
0.6161
0.3667
0.4125
Corrected
average P@5
0.5956
0.3143
0.1692
Average
P@10
0.3897
0.2000
0.2400
Corrected
average P@10
0.3767
0.1714
0.0985
Average
P@15
0.2874
0.1333
0.1675
Corrected
average P@15
0.2778
0.1143
0.0687
Only the P@10 and P@15 scores of Editors 11 and 18 were not significantly
different from each other, for all the other editor combinations the scores were
significantly different. This significance was calculated by taking the scores of all
judged articles, so including the zero values of the judged articles with no approved
recommendations.
Because the baseline Lemur algorithm was used for some recommendations
during the second period and after the third period, we were also able to collect this
data after the first period. In total, there were 1765 judged articles whose
recommendations were made using the baseline Lemur algorithm. Out of these
1765
judged
articles,
1370
articles
(77.6%)
had
one
or
more
approved
recommendations. This is 17,6% more than the 299 judged articles during the first
period, where only 60% of the judged articles had one or more approvals. The MAP
and average MAP scores for all these 1370 articles are plotted in Figure 5.8.
Figure 5.8: MAP scores of all judged articles recommended with the baseline Lemur algorithm
As we look at the MAP scores, a lot of
difference in scores over time can be seen.
Table 5.4: Number of articles for
different MAP scores
Although,
MAP score
most
of
the
MAP
scores
are
between 0.9 and 1.0 (68.8%) as can be seen
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
in Table 5.4. The period around 300 articles
has
some
low
scores.
The
red
line,
corresponding to the average MAP scores,
shows the same. After 100 articles this line
goes
downward
until
300
articles
and
becomes more constant to an average MAP
score of 0.88.
–
–
–
–
–
–
–
–
–
–
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Articles
2
11
19
34
62
42
55
69
133
943
% of all
articles
0.1
0.8
1.4
2.5
4.5
3.1
4.0
5.0
9.7
68.8
The P@n scores of all 1370 articles shown in Figure 5.9 have an opposite
trend in comparison to the MAP scores in Figure 5.8. The P@n scores have an
upward trend from 100 articles until about the 300 articles. After that there is a
downward trend, which becomes more constant around 1300 articles.
Figure 5.9: P@n scores of all judged articles with baseline Lemur algorithm
As mentioned in Chapter 4 by the editors of Trouw, some articles had bad
recommendations due to the subject of the article. Because the subject of an article
is related to the section of it, we looked at the different sections and their number
of articles. The results of the different sections are shown in Table 5.5.
Unfortunately, only the period from 5 February until 5 September 2008 could be
30
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
31
used due to the implementation of the new website of Trouw, the daily retrieved
XML files from Trouw did not have information about the section anymore. During
this period, 512 out of the 1937 judged articles had no approved recommendations.
Table 5.5: Different sections with total articles and articles with no approved
recommendations during period 5 February 2008 until 5 September 2008
Section
GI_ART
NI_ECO
NI_NED
NI_SPO
NI_WER
VE_LNG
VE_OVE
VE_POD
VE_REL
empty
Total
number of
articles
4320
2640
6645
2130
4365
255
3765
2400
30
2490
Number of
articles with
no approvals
145
34
96
29
54
10
77
28
1
38
% of all
articles from
same section
3.4
1.3
1.4
1.4
1.2
3.9
2.0
1.2
3.3
1.5
% of all
articles with
no approvals
28.3
6.6
18.8
5.7
10.5
2.0
15.0
5.5
0.2
7.4
Most articles with no approved recommendations (28.3%) were from the
section GI_ART, which are specific articles related to all kinds of recreation like art,
music, and other forms of leisure.
5.3 Discussion
One of the most remarkable events during the first period is that the average
MAP scores show an upward trend in the beginning (Figure 5.3), while the P@n
scores have a more downward trend in the beginning (Figure 5.4). As we look
further into this phenomenon, it can be seen that it is related to the editor who was
judging at the moment. Editor 18, who judged the first 164 articles with 97 articles
having no approved recommendations, approved only 1 recommendation per article
on an average basis for its other 67 judged articles. For the MAP score results this is
good, because one approval per article always results in a MAP score of 1.0. But for
P@n scores this is bad, because one approval per article means that P@5 gets a
score of 0.2. The P@n does not score so high at the beginning because of this, but
later on they get better when editor 4 made his judgments. This editor has an
average approval per article of 4.2, which can result in a P@5 score of 0.8 if all four
articles are in the top 5 recommendations. Due to these differences in approval
among different editors, we prefer MAP as a measurement instrument to P@n.
If we look further than the first period of six weeks, we notice that the number
of judged articles with no approved recommendation drops to 22% after 1400
judged articles (Figure 5.2). Finally, 663 out of the 2951 judged articles (22.5%)
had no approved recommendation. This means that the system returned better
recommendations over time, resulting in more approved recommendations by the
editors. That there are still 22.5% of the judged articles without approved
recommendations will probably always remain. There will always be some articles
that do not have any good recommendations, due to the topic of that article which
can be unique in comparison to all other articles in the index. This could also be
seen in Table 5.5, with 28.3% of the judged articles in the section GI_ART had no
approved recommendations.
At last, we looked at the different sections and the number of judged articles
without approved recommendations. The most of these recommendations with zero
approvals came from the GI_ART section. Because these articles are not related to
news in most cases, it is probably more difficult to make good recommendations
for. This is related to the problem mentioned in Chapter 2; when an article is very
unique it cannot be compared to the rest of the articles, which will lead to poor
recommendations (Claypool, Gokhale, Miranda, Murnikov, Netes, & Sartin, 1999).
32
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
33
6 Incorporating temporal information in
recommendations
A very important, if not the most important, aspect of a newspaper website is
recency. Only articles of the same day or a couple of days old are interesting for
visitors of the website. If an article is published a couple of days after the event has
happened, it is not news anymore. Because news depends very much on recency, it
is likely that this is also the case for the recommendations. Therefore we want to
find out in what amount the recency is important for the recommendations made by
the system.
In this chapter we will first describe the experimental setup, then the results,
and finally we will discuss the second research question, “What kind of influence
does
recency
have
on
generating
recommendations?”,
which
is
related
to
incorporating temporal information in the recommendations.
6.1 Experimental setup
To incorporate dates in the recommendations we had to combine the
relevance scores returned from Lemur and the difference in days of the focus and
recommended articles. We decided to calculate a linear combination of the
relevance score produced by Lemur and the temporal information in the form of
recency. To be able to combine these two scores, we scaled them both to the [0, 1]
interval. Because we wanted to have a weighted average, combining both the
difference in days and the Lemur score, different formulas were being drafted in
order to find the best proportion between these two values. The formula that was
implemented is:
Equation 6.1: Formula for combining relevance score and recency
 score − scoremin 
 1 
new _ score = λ
 + (1− λ )

 d + 1
 scoremax − scoremin 
d = Difference in days between focus and recommended article
λ = Linear combination factor between 0 and 1
€
scoremin = lowest lemur score of the 49 recommended articles
scoremax = highest lemur score of the 49 recommended articles
To retrieve the new score we first take the difference in days between the
focus article and the recommended article as d. To this number d we add 1 to
prevent a division by zero if the focus and recommended article are from the same
day. We then divide one by the square root of this difference in days plus 1,
because this moderates the influence of the discount. After that we multiply this
with a variable factor λ, which can be changed to assign more weight to recency or
to relevance if desired. We decided to vary this λ-variable between five values: 0,
0.25, 0.5, 0.75 or 1. If λ-value is 0, the difference in days is not involved in the
calculating, because it is multiplied by 0. If λ-value is 1, only the difference in days
is involved in the calculation, because the normalized Lemur score is then multiplied
by 0. In that case, λ-value 0.25 tends to go more towards the Lemur score, while λvalue 0.75 tends to go more towards the difference in days. This also means that λvalue 0.5 takes both the Lemur score as the difference in days in equal amount in
account.
To get results of all five λ-values, we randomly assigned one of the five
predefined λ-values to an article as it was judged by one of the editors.
Unfortunately, the editors of Trouw did not manage to judge the whole period from
21-03-2008 till 05-06-2008. Because of that, we could only use the judgments for
284 articles. These articles were divided in 190 articles with λ-value of 0, 15 with a
λ-value of 0.25, 24 articles with λ = 0.5, 26 articles with a λ-value of 0.75 and
finally 29 articles with λ = 1. The proportions of these values are being made visible
in Figure 6.1 and it can
be seen that λ-value 0
occurs
too
much
26
in
comparison to the other
λ-values. Due to a bug in
29
λ-value 0
24
15
190
λ-value 0.25
λ-value 0.5
the PHP code that shows
λ-value 0.75
the interface for judging
λ-value 1
the focus article (Figure
3.5), the randomization
function did not lead to an
Figure 6.1: Division by number of articles for each λ-value
equal distribution of all five different λ-values.
As can be seen in Figure 6.2, there are 9 full days where almost no other λvalue than 0 occurs, therefore only those days were all five λ-values occur will be
used for evaluation. These days are 22 March 26 until 29 March, and 2 April 2008.
34
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
35
Only 117 articles can thus be used for drawing conclusions from the small number
of results.
Again, lists of ranked judgments had to be made. But this time each λ-value
had to have its own list and each set of fifteen judgments had to be reordered
according to Equation 6.1 mentioned above. Different variants of these lists were
being made, for each λ-value and for each editor. All the statistical significance of
the results was again determined using two-tailed, two-sample equal variance ttests (p < 0.05).
Figure 6.2: Number of articles with their λ-value divided in days during this period.
6.2 Results
When we look at the editors
39
who judged during this period, we see
that there are some different editors
73
57
7
in comparison to the first period. This
11
time there were six different editors
who
judged
the
92
recommendations.
From the 284 judged articles, the
number of judged articles per editor
can be seen in Figure 6.3. How the λvalues were distributed among each
editor can be seen in Table 6.1.
Editor 4
Editor 10
Editor 7
Editor 17
Editor 9
Editor 18
Figure 6.3: Number of judged articles per editor during
second period
Table 6.1: Editors of the second period with their statistics
Editor ID
λ-value
4
4
4
4
4
7
9
10
17
17
17
17
17
18
18
18
18
18
0
0.25
0.5
0.75
1
0
0
0
0
0.25
0.5
0.75
1
0
0.25
0.5
0.75
1
Articles
judged
60
1
7
4
1
7
11
92
14
9
11
10
13
7
5
6
12
9
Average
MAP score
0.8278
1.0000
0.7715
0.6088
1.0000
0.4285
0.7197
0.5662
0.4964
0.3333
0.5290
0.4000
0.2248
0.7321
0.2200
0.6210
0.2449
0.2882
Approvals
449
6
36
13
1
47
78
346
14
5
22
20
11
20
5
18
14
10
Average approvals
per article
7.5
6.0
5.1
3.3
1.0
6.7
7.1
3.8
1.0
0.6
2.0
2.0
0.8
2.9
1.0
3.0
1.2
1.1
It is remarkable to see that both Editors 4 and 10 have a high amount of
articles with λ-value 0. Editor 4 judged, next to the λ-value 0, 13 articles with
different λ-values, but Editor 10 judged 92 articles with only the λ-value 0. Table
6.1 shows that only Editors 17 and 18 have both good equal distribution of all five
different λ-values.
From all 284 judged articles during this second period, 81 articles were judged
with no recommended articles (28.5%). Taking only the 117 judged articles by
Editors 4, 17, and 18, there were 51 articles judged with no correct recommended
articles. This time, 43.6% of the judged articles had no recommended articles that
were related to the focus article. Looking at each λ-value of these 117 articles in
Table 6.2, we see that λ-values 0 and 0.5 have the least percentage of no approved
recommendations. Also the number of average approvals is the highest for those
two λ-values.
36
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
37
Table 6.2: Number of judged articles and no approved recommendations
λvalue
Articles
judged
Approvals
0
0.25
0.5
0.75
1
26
15
23
26
28
62
25
70
43
47
Average
approvals
per article
2.5
1.7
3.0
1.8
1.5
No approved
recommendations
% of no approved
recommendations
8
9
7
13
14
32.0
60.0
30.4
50.0
50.0
Looking at all judged articles by each editor, as can be seen at Table A.2 in
the Appendix, we see that Editors 10, 17, and 18 judged most articles with no
recommended articles. Editor 10 was responsible for judging 31.5% of all its articles
with no approved recommendations. Editor 17 judged more than half of its articles,
56.1%, with no approved recommended articles. And finally, Editor 18 managed to
judge 48.7%, also almost half of its articles, with no approved recommended
articles. From the other three editors, Editor 4 has his highest percentage for six
approved recommendations (15.2%). Editor 7, with only seven articles judged, has
the highest percentage of four approved recommendations (28.6%). And finally,
Editor 9 has the highest percentage of 18.2% for three, seven and eleven approved
recommendations.
For calculating the MAP scores for the second period, only the judgments of
Editors 4, 17, and 18 during the days 22 March, 26 March until 29 March, and 2
April 2008 were used. They were able to judge 117 articles, with the average and
corrected average MAP score of each λ-value listed in Table 6.3.
Table 6.3: MAP score for each λ-value of Editors 4, 17, and 18
λ-value
Average MAP
0
0.25
0.5
0.75
1
0.8554
0.7694
0.8177
0.7324
0.6508
Corrected
average MAP
0.5817
0.3078
0.5688
0.3662
0.3254
The first column of average MAP scores is calculated without taking all
articles with no approved recommendations into account. The second column with
corrected average MAP scores has been calculated by taking all judged articles into
account. As can be seen from Table 6.3, λ-values 0 and 0.5 experiences the least of
those articles without approved recommendations. This is because they both have
about 30% of their articles judged with no approved recommendations as we have
seen in Table 6.2. While the other three λ-values have all 50% or more of their
articles with no approved recommendations. The lowest corrected average MAP
score is for λ-value 0.25, while λ-value 1 had the lowest normal average MAP score.
Figure 6.4 shows all average MAP scores of the five different λ-values for Editors 4,
17, and 18 during the period of 22 March, 26 March until 29 March, and 4 April
2008.
Figure 6.4: Average MAP scores of all five λ-values judged by Editors 4, 17, and 18.
These are again the average MAP scores without the zero approved
recommendations. Both λ-value 0 and 0.5 score very similar, ending up with an
average MAP score of respectively 0.8554 and 0.8177. λ-value 0.25 does not
perform that bad according to this figure, but has the disadvantage that 60% of its
articles were judged with no approved recommendations. We looked at the
differences in between the MAP scores, but only the differences in MAP scores
between λ-values 0 and 1, and 0.5 and 1 were significant. All the recommendations,
also the ones with zero approvals, were taken into account while calculating the
differences.
38
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
39
Table 6.4: Corrected average P@n scores for each λ-value of Editors 4, 17, and 18
Type score
P@5
P@10
P@15
λ-value 0
0.2960
0.1960
0.1653
λ-value 0.25
0.1200
0.0667
0.0533
λ-value 0.5
0.3913
0.2565
0.1855
λ-value 0.75
0.2231
0.1500
0.1179
λ-value 1
0.2000
0.1500
0.1024
The P@5 scores in Figure 6.5 show that λ-value 0.5 performed the best over
time, although a downward trend is visible. It is remarkable to see that λ-value 0
scored low at the P@5 scores, while it scored high at the MAP scores. There are
significant differences for the P@5 scores between λ-values 0 and 0.25, 0.25 and
0.5, and finally for 0.5 and 1. Again, all the recommendations were taken into
account while calculating these differences.
Figure 6.5: Average P@5 scores of all five λ-values judged by Editors 4, 17, and 18.
6.3 Discussion
It is regrettable that only 117 judged articles could be used to evaluate the
performance of the algorithm used during this second period. Not only because the
editors of Trouw managed to judge only 284 articles, but also the fact that the
random PHP script did not function well is very disappointing. Due to the small
amount of articles that could be used for evaluating this second research question,
it is hard to tell if these results will also influence the performance in the long term.
It looks like that λ-value 0.5 would be a good alternative for replacing the
baseline Lemur algorithm. Not only because the MAP scores were almost the same
as for λ-value 0, but it performed even better than λ-value 0 with the P@n scores.
It also had the highest number of average approvals and the least number of
judged articles with no approved recommendations. However, the differences
between λ-value 0.5 and λ-value 0 were not significant for both the MAP as P@5
scores. Figure 6.5 showed that the P@5 scores of λ-value 0.5 were the highest of all
different λ-values, while Figure 6.4 show that λ-value 0 had the highest MAP scores
at the end. The higher P@5 scores of λ-value 0.5 mean that there are more
approved recommendations in the top 5 recommendations. The P@5 scores of λvalue 0 show that there were only 2 or less approved recommendations in the top
5, because the P@5 scores were 0.4 or lower. Probably, these 2 approved
recommendations were on positions 1 and 2 (MAP score of 1.0), or 1 and 3 (MAP
score of 0.8333), because of the high MAP scores for λ-value 0.
When only taking recency in account, which was the case with λ-value 1, the
scores of both MAP and P@n were not very good in comparison to the other λvalues. The average MAP score of λ-value 1 ends up at 0.6508, with a corrected
average MAP score of 0.3254. The average P@5 score of λ-value 1 is also 0.4, but
the corrected average P@5 score is only 0.2. However, λ-value 0.25 scored worsted
for both corrected averages, with 0.3078 for MAP and 0.1200 for P@5. This is
probably caused by the small number of judged articles (15) and the high
percentage of judged articles with no approved recommendations (60%).
40
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
41
7 Incorporating other metadata in recommendations
In this chapter we attempt to answer our last research question: “What kind of
influence does author or section metadata have on generating recommendations?”.
It is likely that articles from the same section, for example articles in the section
“Sport”, are more related to each other than articles from different sections. This
could be the same for articles with the same author, who are more likely to write
different articles about similar topics. We will first explain how these two metadata
are being incorporated in the recommender system. After that we will look at the
results of the judgments with these incorporated metadata.
7.1 Experimental setup
Because we had already some experience with incorporating metadata with
the standard score from Lemur from the previous research question, we were able
to use a same sort of computation with these metadata. From each article,
information about the author and section is stored in the MySQL database. With
these data it is possible to compare the metadata of the focus article with the
metadata of the recommended articles for this focus article. So we will use the
author and section data from the database to reorder the recommendations. We
made use of the following calculations to incorporate the author and section
metadata:
Equation 7.1: Formula for combining author and section metadata
with the relevance score from Lemur
Author AND section:
e score ⋅ 2
Author OR section:
e score ⋅1,5
Author NOR section:
e score ⋅1
The relevance scores from Lemur, as they are stored in the Trouw
database, arrange from approximately -1 until -8. But in fact these scores are small
€
positive numbers, as they are the results of the standard Language Model (KLD & JL
Smoothing) algorithm as described in Chapter 3. However, Lemur returns the
natural logarithm of these small positive numbers, resulting in greater negative
numbers. In order to combine the Lemur score with the metadata, we decided to
use the original small numbers and give a bonus if there was any similarity in
metadata. So if both author and section were similar for the focus and
recommended article, the exponent Lemur score was multiplied by two. If only
author or section was similar, the exponent Lemur score was multiplied by one and
a half. And finally, if both author and section were not similar, the standard
exponent Lemur score was being used.
For
this
six-week
period from 6 June 2008 till 20
224
(17%)
judgments that were reordered
according
to
Equation
7.1
One or more
approvals
260
(19%)
August 2008, we only have
No approvals
mentioned above. During this
third
period,
1350
articles
866
(64%)
were being recommended by
Unjudged
recommendations
the system (Figure 7.1). Out of
these 1350 articles, the editors
managed
articles
to
(81%).
judge
224
Figure 7.1: Recommended articles during third period
1090
of
these
1090
articles
were
judged
with
no
correct
recommendations (20.6%). Due to some misunderstanding, all judged articles
during this period used the same algorithm. Fortunately, the baseline Lemur score
was used again while judging the articles for the period after 20 August 2008. So
we were able to use the same number of judged articles of that period for
comparison. Out of these 1350 articles, 209 were judged with no correct
recommendations (19.2%). Seven editors judged the recommendations in both
periods, and besides that there were two other editors judging only in one of the
two periods. We used the two-tailed, two-sample equal variance t-tests (p < 0.05)
again for determining the statistical significance of the results.
7.2 Results
We
created
2180
lists,
of
which
1090
lists
were
based
on
the
recommendations with the author and/or section algorithm for this third period, and
1090 lists were based on the baseline Lemur algorithm after this third period. Now
we can look if the performance of the system improves when author and section
metadata are incorporated. We calculated the average MAP scores for both periods
and plotted them in Figure 7.2.
42
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
43
Figure 7.2: Average MAP scores of judged articles over time during both periods
The blue line represents the judged articles of the third period, while the red
line shows the average map scores of the judged articles after that third period. It is
remarkable to see that both lines seem to have similar movements till 300 judged
articles, although they are both from different periods. After these 300 judged
articles, the normal Lemur score increases to a top average MAP score of 0.9167,
while the metadata scores decrease to a bottom average MAP score of 0.8262.
Because both periods had judged articles with no approved recommendations, the
corrected average MAP scores rate lower. The corrected average MAP score for the
metadata algorithm should be 0.6536, while the corrected average MAP score for
the baseline Lemur algorithm should be 0.7142. Taking all MAP scores for these two
periods, also those with no approved recommendations, the differences between
them were significant.
The average P@5 scores of the judged articles over time are shown in Figure
7.3. The starting points of the two periods are very different from each other. While
the average P@5 scores from the metadata algorithm starts high with a maximum
score of 0.8, the average P@5 score from the baseline Lemur algorithm starts at a
minimum of 0.2. The recommendations of the first articles seem to be better for the
metadata algorithm than for the baseline lemur algorithm. After 50 judged articles,
both periods are quite similar with an average P@5 score of 0.5, and these P@5
scores are not significant different from each other. The corrected average P@5
scores are 0.3943 for the metadata algorithm and 0.3701 for the baseline Lemur
algorithm.
Figure 7.3: Average P@5 scores of judged articles over time during both periods
In both periods, a group of eight editors made the judgments of the
recommendations. Editors 13 and 16 only judged in one of the two periods. The
number of articles judged, average MAP scores, corrected average MAP scores,
approvals, and average approvals of all editors can be seen in Table 7.1 and Table
7.2.
Table 7.1: Editors who judged during third period
Editor
ID
Articles
Judged
4
10
11
12
15
16
17
18
91
47
40
40
30
463
284
97
% Articles with
no approved
recommendations
0.0
40.4
20.0
5.0
16.7
14.7
31.0
36.1
Average
MAP
score
0.8651
0.7574
0.8104
0.7006
0.6087
0.8260
0.8603
0.8199
Corrected
average
MAP score
0.8651
0.4512
0.6484
0.6656
0.5073
0.7047
0.5937
0.5241
Approvals
Average
approvals
496
137
114
202
73
1433
614
186
5.5
2.9
2.9
5.1
2.4
3.1
2.2
1.9
Editor 4 judged all of his articles with one or more correct recommendation,
therefore his corrected average MAP score of 0.8651 is equal to his average MAP
44
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
45
score. All the other editors judged one or more articles with no approved
recommendations. Editor 10 has the highest number of judged articles with no
approved recommendations (40.4%) and because of this, Editor 10 has the lowest
final average MAP score of 0.4512.
Table 7.2: Editors who judged after the third period
Editor
ID
Articles
Judged
4
10
11
12
13
15
17
18
11
49
581
297
18
127
152
41
% Articles with
no approved
recommendations
0.0
26.5
17.6
5.7
5.6
19.7
44.7
26.8
Average
MAP
score
0.9197
0.8723
0.9375
0.7978
0.8696
0.8723
0.9045
0.9269
Corrected
average
MAP score
0.9191
0.6409
0.7729
0.7521
0.8213
0.7006
0.4999
0.6783
Approvals
Average
Approvals
52
72
1236
918
104
254
202
122
4.7
1.5
2.1
3.1
5.7
2.0
1.3
3.0
As can be seen in Table 7.2, for the period after the third period, Editor 4
judged again all articles with one or more correct recommendation. But this time he
judged only one day with 11 articles. The percentages of articles with no approved
recommendations are very equal for Editors 11, 12, and 15, in both periods they
judged respectively around 20%, 5% and again 20%. During this second period,
Editor 17 judged the most articles with no approved recommendations (44.7%).
This
means
that
almost
half
of
the
articles
he
judged
had
no
correct
recommendations.
If we look at the MAP scores of the seven editors who judged in both
periods, with four of them these are significantly different. So, Editors 10, 11, 15,
and 17 judged significantly different in the first period than in the second period.
Where Editors 10, 11, and 17 judged the articles with the baseline Lemur algorithm
better, Editor 15 judged the articles with the metadata algorithm better.
We have also looked at the number of authors of the judged articles that were
recommended during both periods. In the period from 6 June 2008 till 20 August
2008, 340 different authors wrote the 1092 judged articles. As can be seen in Table
7.3, the author that had the most articles during this period was just a dash (-).
The numbers two, three, and five were all groups of editors from different sections
(foreign countries, politics, and economy). Only the fourth, Koos Dijksterhuis, is a
real person who wrote 28 of the judged articles.
Table 7.3: Top five authors from judged focus articles during the third period
Name of the author
Number of articles
Van onze redactie buitenland
Van onze redactie politiek
Koos Dijksterhuis
Van onze redactie economie
119
43
31
28
22
For the same period after this third period, with the standard Lemur
algorithm, 324 different authors wrote the 1097 judged articles. The same authors
that wrote the articles for the third period were also the top five authors for the
articles after that period. Only the number of articles written differs for all five
authors as can be seen in Table 7.4.
Table 7.4: Top five authors from judged focus articles after the third period
Name of the author
Number of articles
Van onze redactie economie
Koos Dijksterhuis
Van onze redactie buitenland
Van onze redactie politiek
104
40
37
36
34
So in the top 5, there is almost no difference in authors between the two
periods, only in the number of articles. However, we have found out that there is a
problem that can occur with authors. When looking at the list of different authors, it
was remarkable to see that sometimes the location of the author was also included
in its name. An example of one author with different names was: “Frank Kools”, this
author has also names as: “Frank Kools New Hampshire” – “Frank Kools New York”
– “Frank Kools Pikeville, Kentucky” – “Frank Kools Richmond, Virginia” – “Frank
Kools Steelton, Pennsylvania” – “Frank Kools Washington”. This author is probably
an American columnist, due to all the different American cities added after his
name. But this makes simple comparison very hard, due to all these different
names for one author.
46
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
47
7.3 Discussion
From the results of the average MAP scores during this third period, it is
obvious to see that incorporating author and section metadata did have negative
influence on the performance of the system. The MAP scores from the metadata
algorithm are going downward, while the MAP scores with the baseline Lemur
algorithm have an upward trend. These differences were also significant.
Although the P@5 scores were slightly higher for the metadata algorithm
during the third period, the differences between the two algorithms were not
significant. This is also the case for the P@10 and P@15 scores, which show the
same trend over time as the P@5 scores.
Looking at the different editors during the period of this third research
question, we saw that seven editors judged articles during both periods. Four of
these seven editors judged the articles with the metadata algorithm significantly
different from the articles with the baseline Lemur algorithm. In general, the
corrected average MAP scores for judged articles with the baseline Lemur algorithm,
were higher for all editors in according to the judged articles with the metadata
algorithm. Only Editor 15 judged the articles with the metadata algorithm higher
with a corrected average MAP score of 0.5937, while he judged the articles with the
baseline Lemur algorithm with a corrected average MAP score of 0.4999.
Finally, the difference in authors seemed harder to distinguish than we had in
mind on forehand. The most common author in both periods was only a dash,
meaning that the author field was empty during indexing. Besides that, three of the
top five authors were groups of editors from a specific topic, which can be seen as
rough versions of the section. Also the problem that one author can have different
names, with including the location or section he writes the article for, should be
solved in order to make good comparisons for the focus and recommended articles.
8 Conclusion and future work
Now that all research questions have been described and discussed in the
previous chapters, we will present our conclusions in this chapter. First we will draw
our conclusions for each research question separately. After that, we will discuss
what future work on the Trouw Recommender should be focussed on.
8.1 Article growth
Our first research question was: “What kind of influence does article growth
have on generating recommendations?”. If we looked at the article growth in
comparison to the measurement tools MAP and P@n, we saw that for both
measurements the average scores get better over time. However, the highest
average MAP score was already measured after 100 judged articles and became
reasonable constant afterwards. The corrected average P@n scores however, had
their highest averages after 350 judged articles and show a more downward trend
afterwards. In general, it could be said that article growth had less influence for
MAP scores than it does for P@n scores.
As we looked at the number of approvals per articles, we could say that article
growth had positive influence on the performance of the system. For the whole
period when the baseline Lemur algorithm was being used, the number of judged
articles with no approved recommendations dropped from 75% after 50 judged
articles to 20% after 2900 judged articles. It seems that the system returned better
recommendations over time, resulting in more judged articles with approved
recommendations.
8.2 Incorporating temporal information in recommendations
The second research question of this thesis was: “What kind of influence does
recency have on generating recommendations?”. By taking the MAP and P@n
measurements, we saw that λ-value 0 scored the best for MAP, while λ-value 0.5
scored the best with P@n. Where λ-value 0 relies only on the Lemur relevance
score, λ-value 0.5 takes both the recency as the Lemur relevance score into
account equally. Besides that, λ-value 0.5 also scored best after λ-value 0 with the
average MAP scores. While taking only recency into account, which was the case
with λ-value 1, both MAP as P@n had the lowest averages. However, the corrected
averages for both MAP and P@n were the lowest for λ-value 0.25, due the small
48
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
49
number of judged articles and high amount of judged articles without approved
recommendations.
So for the MAP measurement, recency did not have influence in generating
better recommendations, while for P@n recency in combination with Lemur
relevance did indeed result in better recommendations. But in both cases, the
difference in scores between λ-value 0 and λ-value 0.5 were not significant.
It can be said that incorporating temporal information had no influence in
generating better recommendations. Maybe because the recommendations were
already recent, also for the baseline lemur algorithm, which will not result in
significant differences between these two algorithms. But also the small number of
judged articles for each λ-value, will probably have lead to these results. So it is not
clear if these results will also influence the performance of the Trouw Recommender
on the long term.
8.3 Incorporating other metadata in recommendations
Our third research question was cited as: “What kind of influence does author
or section metadata have on generating recommendations?”. We used two periods
to compare the baseline Lemur algorithm with the metadata algorithm. Looking at
the average MAP scores for these two periods, the baseline Lemur scores were
better than the metadata algorithm. The difference in MAP scores between these
two periods was also significant. The P@n scores however were slightly better for
the metadata algorithm, a corrected average P@5 score of 0.3943 for the metadata
algorithm, while the baseline Lemur algorithm scored 0.3701. However, the
difference between these two was not significant.
With significant differences in the MAP scores for the two periods, it can be
said that author and/or section metadata did not have influence in better
recommendations. The baseline Lemur algorithm performed better over the same
number of articles.
Author metadata seemed to be too messy to make good comparisons between
the author of the focus article and authors of the recommended articles. It is also
not sure whether recommendations were already from the same section as the
focus article. Incorporating this metadata then, would not change the ordering of
the recommendations.
8.4 Future work
Now that we have investigated article growth, incorporating temporal data,
and incorporating metadata for the Trouw Recommender, other options could be
researched. One option that is related to incorporating temporal data, is that only
the top 15 recommendations would be reordered according to recency. You will
make use of the best recommendations according to the baseline Lemur algorithm
and show them ordered by date instead of the Lemur relevance score. It would be
nice to see how this method performs in comparison to Equation 6.1 with λ-value
0.5, mentioned in Chapter 6.
Another option that could result in better recommendations for the Trouw
Recommender is relevance feedback. With relevance feedback it is possible to use
the information about the results of the judgments, whether or not they were
relevant, to perform a new query. This should lead to better recommendations when
Lemur uses this query to generate recommendations again. However, it is
questionable if the same query will be generated again in the future. Because the
Trouw Recommender uses the whole focus article as its query, it is not likely that
the exact same focus article will be recommended again.
The problem that one author can occur with different names in the MySQL
database should be solved in future work to see if incorporating “real” author
metadata could have influence for better recommendations. An option is to cut off
all the words that are behind the real author names, for example the locations or
sections, and only keep the author his first and last name.
During the subjective evaluation, we mentioned that always having one article
with a 100% confidence score, was sometimes seen as a problem. The fact that
there is always one article with 100% confidence occurred due to normalization.
This should be solved in future work, otherwise the editors could think that the
system is not working properly.
Like we have noticed in related work, online newspapers have in favour that
they could make use of personalization. If we look at the possibilities for this in
combination with the Trouw Recommender, we first need to build up user profiles.
These user profiles can than be used to collect the user’s interest and these interest
can than be used to generate personal recommendations. This would however have
50
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
51
great influence in the architecture of both the Trouw Recommender as the website
of
Trouw.
Because
the
Trouw
Recommender
needs
to
generate
personal
recommendations for each unique user, it will become more time-consuming and
will probably need more resources for computing the recommendations. The
website of Trouw on the other hand needs a login for the users. Besides that, it has
to collect all click-through data of each user what can be used for building up a user
his profile. A lot of research has to be conducted, before personalization can be
used in combination with the Trouw Recommender.
References
Balabanovic, M., & Shoham, Y. (1997). Fab: content-based, collaborative
recommendation. Communication of the ACM , 40 (3), 66-72.
Bogers, T., & van den Bosch, A. (2007). Comparing and evaluating
information retrieval algorithms for news recommendation. RecSys '07:
Proceedings of the 2007 ACM Conference on Recommender Systems (pp.
141-144). ACM Press.
Breese, J., Heekerman, D., & Kadic, C. (1998). Empirical analysis of
predictive algorithms for collaborative filtering. Proceedings of the Fourteenth
Annual Conference on Uncertainty in Artificial Intelligence , 43-52.
Claypool, M., Gokhale, A., Miranda, T., Murnikov, P., Netes, D., & Sartin, M.
(1999). Combining content-based and collaborative filters in an online
newspaper. Proceedings of ACM SIGIR Workshop on Recommender Systems .
Croft, W. B., Metzler, D., & Strohman, T. (2009). Search Engines:
Information Retrieval in Practice. Addison Wesley.
Das, A., Datar, M., & Garg, A. (2007). Google news personalization: scalable
online collaborative filtering. WWW '07: Proceedings of the 16th international
conference on World Wide Web (pp. 271-280). New York, NY, USA: ACM
Press.
Fernández, R. T. (2007). The Effect Of Smoothing In Language Models For
Novelty Detection. Future Directions in Information Access, FDIA'2007.
Glasgow.
Goldberg, D., Nichols, D., Oki, B., & Terry, D. (1992). Using collaborative
filtering to weave an information tapestry. Communications of the ACM , 35
(12), 61-70.
Hanani, U., Shapira, B., & Shoval, P. (2001). Information filtering: Overview
of issues, research and systems. User Modeling and User-Adapted Interaction
, 11 (3), 203-259.
Kamba, T., Bharat, K., & Albers, M. (1994). The krakatoa chronicle: An
interactive, personalized newspaper on the web. Proceedings of the 4th
International Conference on World Wide Web, (pp. 159–170).
Maidel, V., Shoval, P., Shapi, B., & Taieb-Maimon, M. (2008). Evaluation of
an ontology-content based filtering method for a personalized newspaper.
RecSys '08: Proceedings of the 2008 ACM Conference on Recommender
Systems (pp. 91-98). New York, NY, USA: ACM Press.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to
Information Retrieval. New York, NY, USA: Cambridge University Press.
McAdams, M. (1995). Inventing an online newspaper. Interpersonal
Computing and Technology , 64-90.
52
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
53
Mooney, R. J., & Roy, L. (2000). Content-based book recommending using
learning for text categorization. DL '00: Proceedings of the Fifth ACM
Conference on Digital Libraries (pp. 195-240). New York, NY, USA: ACM
Press.
Newspaper. (2001, 12 28). Retrieved 02 12, 2009 from Wikipedia, the free
encyclopedia: http://en.wikipedia.org/wiki/Newspaper
Pennock, D. M., Horvitz, E., Lawrence, S., & Giles, C. L. (2000). Collaborative
filtering by personality diagnosis: A hybrid memory- and model-based
approach. Proceedings of the Sixteenth Conference on Uncertainty in Artificial
Intelligence (pp. 473-480). San Francisco: Morgan Kaufmann.
Ponte, J., & Croft, W. (1998). A language modeling approach to information
retrieval. SIGIR '98: Proceedings of the 21st Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 275281). New York, NY, USA: ACM Press.
PricewaterhouseCoopers. (2008, 09 7). IAB Internet Advertising Revenue.
Retrieved 02 11, 2009 from IAB:
http://www.iab.net/media/file/IAB_PWC_2008_6m.pdf
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & Riedl, J. (1994).
Grouplens: An open architecture for collaborative filtering of netnews. CSCW
'94: ACM conference on Computer Supported Cooperative Work (pp. 175186). New York, NY, USA: ACM Press.
Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2001). Item-based
collaborative filtering recommendation algorithms. WWW '01: Proceedings of
the 10th International Conference on World Wide Web (pp. 285-295). New
York, NY, USA: ACM Press.
Sigmund, J. (2008, 04 14). Newspaper web sites attract record audiences in
first quarter. Retrieved 05 06, 2008 from Newspaper Association of America:
www.naa.org
Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data
Engineering Bulletin , 24 (4), 35-43.
Trouw. (2004, 01 30). Retrieved 10 27, 2008 from Wikipedia, de vrije
encyclopedie: http://nl.wikipedia.org/wiki/Trouw_(krant)
Yang, C., Chen, H., & Hong, K. (2003). Visualization of large category map
for internet browsing. Decision Support Systems , 35 (1), 89 – 102.
Appendix A
Figure A.1: Trouw Recommender full article view
54
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
55
Figure A.2: Trouw Recommender interface for judging article “Conferentie Antillen
lokt demonstratie uit”
Figure A.3: Article “Conferentie Antillen lokt demonstratie uit” on the Trouw website
with the two recommended articles in the grey box on the bottom left
Figure A.4: New articles added during the first period
Table A.1: Approvals and percentages of all judged articles for each editor in the first period
Approvals
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
56
Editor ID 4
Number of
articles
3
9
16
16
17
12
3
2
3
1
2
1
1
1
2
1
% Of all
articles
3.3
10
17.8
17.8
18.9
13.3
3.3
2.2
3.3
1.1
2.2
1.1
1.1
1.1
2.2
1.1
Editor ID 11
Number of
articles
2
5
6
0
0
0
0
1
0
0
0
0
0
0
0
0
% Of all
articles
14.3
35.7
0
0
0
0
0
7.1
0
0
0
0
0
0
0
0
Editor ID 11
Number of
articles
115
30
23
15
4
2
2
1
0
1
0
0
1
0
1
0
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
% Of all
articles
59
15.4
11.8
7.7
2.1
1
1
0.5
0
0.5
0
0
0.5
0
0.5
0
RECOMMENDING
ARTICLES
FOR
AN
ONLINE
NEWSPAPER
57
Table A.2: Approvals of all judged articles for each editor in the second period
Approvals
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Editor ID 4
Number of
articles
0
2
10
8
7
9
12
4
2
3
4
6
3
1
1
7
% Of all
articles
0
2.5
12.7
10.1
8.9
11.4
15.2
5.1
2.5
3.8
5.1
7.6
3.8
1.3
1.3
8.7
Editor ID 7
Number of
articles
1
0
0
0
1
1
0
1
0
0
0
0
1
0
0
1
% Of all
articles
16.7
0
0
0
16.7
16.7
0
16.7
0
0
0
0
16.7
0
0
16.7
Editor ID 9
Number of
articles
0
1
1
2
0
0
0
2
0
1
2
1
0
0
0
1
% Of all
articles
0
9.1
9.1
18.2
0
0
0
18.2
0
9.1
18.2
9.1
0
0
0
9.1
Approvals
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Editor ID 10
Number of
articles
29
15
12
7
2
1
3
4
3
1
2
3
2
0
3
5
% Of all
articles
31.5
16.3
13.0
7.6
2.2
1.1
3.3
4.4
3.3
1.1
2.2
3.3
2.2
0
3.3
5.4
Editor ID 17
Number of
articles
32
8
7
3
5
0
0
1
0
0
0
0
0
0
1
0
% Of all
articles
56.1
14.0
12.3
5.3
8.8
0
0
1.8
0
0
0
0
0
0
1.8
0
Editor ID 18
Number of
articles
19
4
7
2
1
3
0
1
1
1
0
0
0
0
0
0
% Of all
articles
48.7
10.3
18.0
5.1
2.6
7.7
0
2.6
2.6
2.6
0
0
0
0
0
0