- Philipp Berger

Transcription

- Philipp Berger
Master’s Thesis
Ranking Blogs based on
Topic Consistency
by
Philipp Berger
Potsdam, September 2012
Supervisor
Prof. Dr. Christoph Meinel
Internet-Technologies and Systems Group
Disclaimer
I certify that the material contained in this master’s thesis is my own work
and does not contain significant portions of unreferenced or unacknowledged
material. I also warrant that the above statement applies to the implementation
of the project and all associated documentation.
Hiermit versichere ich, dass diese Arbeit selbständig verfasst wurde und dass
keine anderen Quellen und Hilfsmittel als die angegebenen benutzt wurden.
Diese Aussage trifft auch für alle Implementierungen und Dokumentationen
im Rahmen dieses Projektes zu.
Potsdam, September 27, 2012
(Philipp Berger)
iii
Kurzfassung
Gängige Blog Rankings, wie PageRank, Technorati Authority,
und BI-Impact, bevorzugen Blogs, die sich mit einer Vielzahl von
Themen auseinander setzen, da diese ein größeres Publikum und
damit mehr Besucher, Links, und Kommentare anziehen.
Ein
Beispiel dafür ist der Blog spreeblick.com, der sich mit Themen rund
um Politik, Gesellschaft und IT beschäftigt.
Andererseits, erreichen Nischenblogs, welche sich auf ein Thema
konzentrieren, nur wenig Einfluss. Nischenblogs sind Blogs wie
telemedicus.info, dieser veröffentlicht nur Artikel über Datenschutz
und Urheberrecht.
Dadurch erhalten diese nur eine niedrige
Bewertung von heutigen Blog-Suchmaschinen.
Diese Arbeit erörtert, dass die Konsistenz von Blogs, d.h. wie
konzentriert ein Autor ein Thema behandelt, ein Zeichen für Expertenwissen ist. Solche Blogs zu finden ist besonders wichtig für
andere Experten, um diese Blogs zu identifizieren, damit sie diesen
folgen und in einen aktiven Diskurs treten können.
Um das Auffinden dieser Blogs zu erleichtern, d.h.
sie von
der Masse der vielseitig interessierten Blogs zu trennen, wird eine
Metrik für Blogs vorgestellt, welche auf der thematischen Konsistenz
basiert. Das Konsistenz-Ranking basiert auf der (1) Intra-Post, der
(2) Inter-Post, der (3) Intra-Blog, und der (4) Inter-Blog Konsistenz.
Die vorgestellte Metrik wird auf einem Datensatz von 12.000
gesammelten Blogs ausgewertet und somit die Plausibilität dieses
Ansatzes demonstriert.
iv
Abstract
Current ranking algorithms, such as PageRank, Technorati authority, and BI-Impact, favor blogs that report on a diversity of topics
since those attract a large audience and thus more visitors, links, and
comments. One example is the spreeblick.com blog, which offers articles on politics, society, and IT.
On the other side, niche blogs with a very specific topic only attract a small audience and thus have only a small reach. Niche blogs
are blogs like telemedicus.info, which only publishes articles on privacy and copyright. This results in a low ranking from today’s blog
retrieval systems.
This thesis argues that the consistency of a blog, i.e. how focused
an author reports on a single topic, is a sign for expert knowledge. To
find these blogs is particular important for other domain experts to
identify blogs that they would like to follow and stay in active contact. To ease the retrieval of expert blogs, i.e. to separate them from
the mass of blogs that report on random topics, a metric for blogs
based on topic consistency is introduced. The consistency ranking
is based on four different aspects: (1) intra-post, (2) inter-post, (3)
intra-blog, and (4) inter-blog consistency.
By evaluating the metric with a test data set of 12,000 crawled
blogs, the plausibility of this approach is demonstrated.
v
Contents
Contents
1
Introduction
1
2
Background
5
2.1
Weblogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
BlogIntelligence Framework . . . . . . . . . . . . . . . . . . . . . .
8
2.2.1
Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.2
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.3
Visualization . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.3
Apache Nutch - the Crawling Framework . . . . . . . . . . . . . .
12
2.4
SAP HANA - the Persistence Layer . . . . . . . . . . . . . . . . . .
13
2.5
Clustering and Apache Mahout . . . . . . . . . . . . . . . . . . . .
15
3
4
5
6
vi
Related Work
17
3.1
General Rankings . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.2
Blog-Specific Rankings . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.3
Consistency-Related Rankings . . . . . . . . . . . . . . . . . . . .
20
Definition of the Topic Consistency Metric
23
4.1
Consistency between Posts (Inter-Post) . . . . . . . . . . . . . . . .
23
4.2
Internal Consistency of Posts (Intra-Post) . . . . . . . . . . . . . .
26
4.3
Consistency between Posts and Classification (Intra-Blog) . . . . .
27
4.4
Consistency of Linking and Linked Blogs (Inter-Blog) . . . . . . .
28
4.5
Combined Topic Consistency Rank . . . . . . . . . . . . . . . . . .
30
Implementation of Topic Detection
33
5.1
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
5.2
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Implementation of the Topic-Consistency Rank
39
6.1
Intra-Post Consistency . . . . . . . . . . . . . . . . . . . . . . . . .
39
6.2
Inter-Post Consistency . . . . . . . . . . . . . . . . . . . . . . . . .
40
6.3
Intra-Blog Consistency . . . . . . . . . . . . . . . . . . . . . . . . .
41
6.4
Inter-Blog Consistency . . . . . . . . . . . . . . . . . . . . . . . . .
41
Contents
6.5
7
8
9
BI-Impact Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Evaluation
45
7.1
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
7.2
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
7.3
Results of the Topic Consistency Sub Ranks . . . . . . . . . . . . .
47
7.4
Comparison of BI-Impact and Combined Topic Consistency Rank
51
Recommendations for Future Research
55
8.1
Enhanced Topic Detection . . . . . . . . . . . . . . . . . . . . . . .
55
8.2
Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
8.3
Full integration with SAP HANA . . . . . . . . . . . . . . . . . . .
59
Conclusion
61
List of Abbreviations
65
List of Figures
67
List of Tables
69
Bibliography
71
vii
Contents
“Blogging is ... to writing what extreme sports are to athletics;
more free-form, more accident-prone, less formal, more alive. It is in
many ways, writing out loud.”
-from Andrew Sullivan, The Atlantic, Why I Blog
viii
1 Introduction
Weblogs, called blogs, are one of the most popular “social media tools” of the
World Wide Web (WWW) [1]. They are specialized, but easy-to-use, content
management systems. Blogs focus on frequently updated content, social interactions, and interoperability with other Web authoring systems.
Blogs are part of the rise of social media, i.e. the move of the internet to more
user participation and freedom of speech [2]. This is caused by their various
application areas: beginning with personal diaries and holiday photo collections, reaching to knowledge management, educational, scientific research and
corporate platforms, and finally to forums for traditional journalists and the upcoming concept of Citizen Journalists who leaped into fame during the Arab
spring [3, 4, 5].
The actual power of blogs evolves through their common superstructure, i.e. a
blog integrates itself into a huge think tank of millions of interconnected weblogs, called blogosphere that creates an enormous and ever-changing archive
of open source intelligence [6].
Through the various application areas and the immense amount of blogs, the
diversity of discussed topics continuously increases. As shown in Fig. 1, the
diversity reaches from travel and news, to politics and gaming.
Blog readers are not able to access all the information of the blogosphere because
they are overwhelmed by the enormous number of blogs and the blogs’ diversity. To handle this information overload, the research and application area of
blog retrieval evolved [8]. Equally to traditional information retrieval (IR) and
data mining approaches, the target is to ease the understanding of the causal
relations in the blogosphere and the retrieval of the most relevant blogs to the
user’s information need [9].
Facing this unique challenge, the BlogIntelligence (BI) [10] project got initiated
with the objective to map, and ultimately reveal, content-oriented, networkrelated structures of the blogosphere by using an intelligent crawler and tailormade analyses for the blogosphere [10]. Beside normal search engine function-
1
1 Introduction
Personal blogs
63,5%
Family of friend blogs
38,9%
Music
33,1%
News
29,1%
Opinions on products and brands
26,6%
Film/TV
26,4%
Computers
24,8%
Travel
22,5%
Technology
20,8%
Gaming
18,2%
Sport
16,7%
Science
13,6%
Business
13,5%
Business news
12,1%
Celebrities
9,8%
Other
6,7%
0%
10%
20%
30%
40%
50%
Topics written about
60%
70%
Figure 1: Topics blogged about in 2008 [7].
alities, BlogIntelligence has to consider the specific characteristics of the blogosphere, social interaction from other social networks, and leverage content mining [11].
This thesis originates from the BlogIntelligence project and presents a ranking
approach based on the topical consistency of blogs. This ranking aims to ease
the retrieval of expert blogs that are particular important for users to identify
blogs to follow and interact with.
The ranking of documents is a common technique in IR [12]. It aims to assess
the relevance of documents for a specific user’s information need. Current IR
systems mainly calculate the ranking or authority of a blog based on its position
in the web graph or social graph [13, 14]. Advanced ranking approaches also
consider the up-to-dateness of the content and level of readers’ engagement [11].
In contrast to current approaches, the goal of this thesis is to establish the topic
consistency as primary factor for the ranking.
2
Topical consistency is defined as the degree to which a blog author focuses on
a specific set of topics [15]. If blog authors cover several topics, like in random interest blogs or diaries, they have a low topic consistency and thus cannot
create topical thrust. In contrast, a blog has the highest topic consistency if it
continuously concentrates on one topic. It is argued that such a blog develops
a sufficiently high expertise in this topic [16]. Thus, the content of this blog author is expected to be more relevant to an information need than the content of a
topically versatile, and influential author. Analogous to frequently cited experts
in the real world, it is expected that blog readers are more likely to trust and
interact with a blog author with a high topic consistency.
To implement the ranking, it is integrated into the BlogIntelligence framework.
BlogIntelligence essentially consists of three components. The data extraction
component is the basis of the BI framework that harvests the web, analyzes
each web page, extracts blog-specific information, and stores the harvested
data into the storage layer.
The analysis component provides prototypical
implementations for the detection of trending topics and the ranking of blogs.
The third component is the visualization that communicates the analysis results
with the user.
To implement the topic consistency rank, it is necessary to integrate a topic detection mechanism into the analysis layer and to calculate the actual ranking
based on the detected topics and the crawled data. Further, this thesis introduces
an extension for the visualization that communicates the topic consistency of a
blog with the user.
In order to evaluate the plausibility of a topic consistency ranking, it gets formally defined and prototypically implemented in the course of this thesis. Further, it is tested whether a correlation between the topical consistency of a blog
and its influence is observable. This evaluation can make recourse to the BlogIntelligence data set that currently consists of 12,000 blogs with over 600,000 posts.
The remainder of this thesis is structured as follows. Section 2 outlines the foundations of this thesis. It introduces the reader to the concept of weblogs, to the
layers of BlogIntelligence, and to the technique of data clustering. In Section 3,
related research concerning ranking approaches and topical content analyses is
3
1 Introduction
described. Section 4 presents the formal definition of the topic consistency rank
and its sub ranks. Section 5 outlines the implementation of the underlying topic
detection mechanism. Section 6 describes the implementation of the topic consistency rank and its integration into the BlogIntelligence framework. Section 7
discusses the results and the plausibility of the topic consistency rank. Future
work is introduced in Section 8. Finally, Section 9 presents the conclusion of this
thesis.
4
2 Background
This Section presents the basic concept of weblogs. The analysis of weblogs
is the main goal of the BlogIntelligence framework, which is the foundation of
this work. Therefore, the layers of BlogIntelligence are introduced, too. Further,
this Section gives an overview to the technologies that are utilized for the topic
consistency rank calculation: Apache Nutch, Apache Mahout, and SAP HANA.
2.1 Weblogs
As discussed in Sec. 1, blogs are specialized content management systems (CMS)
that enable the authors to share content and open discussions.
Blog platforms, like Blogger1 , WordPress2 , and TypePad3 , provide a unified structure for the published content. This structure reflects the requirements of a frequently updated and socially active medium.
Weblog is a made-up word composed of the terms web and log [17]. The entries
of this log are called posts.
Posts are usually displayed in reverse chronological order with the most recent
entry first. These posts can contain texts, images, and videos to express the
author’s opinion. They are the counterpart to traditional newspaper articles.
Each post can be referenced via a URI (Uniform Resource Identifier) in the World
Wide Web (WWW). A special kind of URI is the permalink. A permalink is the
durable address of a blog post which is guaranteed to be reachable and unique
during the life-time of a blog.
In addition, a blog author can categorize his posts based on two classification
mechanisms: categories and tags.
Categories offer a hierarchical structure for classifying a blog’s contents equally
to traditional libraries. They are frequently used to emphasize distinct discus1 http://www.blogger.com/
2 http://wordpress.com/
3 http://www.typepad.com/
5
2 Background
sion streams within a blog.
In contrast, tags are unordered keywords attached to a post and do not offer a
hierarchy. They summarize the content of a post. Readers use tags to navigate a
blog and to find posts related to a very specialized concept [18]. One prominent
application is the tag cloud (see Fig. 2, generated by Wordle4 ).
Figure 2: A generated tag cloud.
A tag cloud visualizes all tags (keywords) of all posts of a blog and gives an
impression of the most discussed topics. It becomes popular method to support
navigation and retrieval of posts [19].
The social component of a blog is the reader’s ability to write comments [20].
Comments enable blog readers to open an active discussion attached to a post
and communicate their opinions or to offer help. This enables the users of a blog
to communicate in an highly interactive way [21]. Nevertheless, blog comments
are manually moderated by the blog author because the author is responsible
for the content published on his blog. The blog author also wants to control the
discussion and to avoid inappropriate comments.
Further, blogs have special technical features that simplify harvesting and analyzing their posts and comments.
The most prominent technical feature of blogs is the publishing format feed [22].
Feeds present the content of a blog in standardized, XML-based formats (namely
4 http://www.wordle.net/
6
RSS and ATOM). A feed is an integrated part of the blog system and is always
up-to-date with the blogs content. These feeds ease the machine readability of
blogs. They contain all relevant information like the publishing date, the author,
categories, tags, the title, and a short description of a post.
Thus, a new kind of application develops, named aggregators. An aggregator
requests an user-selected set of feeds and displays the content to the user in a
unified, enriched, and compact view. This way, users do no longer request a
blog directly. Instead, they are provided the content of their favorite blogs and
do not have to actively retrieve it from the WWW.
Concerning the social interaction of blogs, important technical features are
blogrolls and linkbacks [10].
A blogroll is noticeable placed on a blog’s starting page and contains links to
other blogs. These blogs are considered as followed or friend blogs. Thus,
blogrolls form close communities based on mutual linking.
Linkbacks are methods a blog author can use to get notified when other authors
link to his posts. This enables authors of different blogs to bidirectionally link
their discussions.
There are three kinds of linkbacks: refback, trackback, and pingback. Refback is not
part of the blogging system. Instead it is part of the HTTP protocol and of today’s browsers. A refback occurs when a blog reader follows a link and the
receiving blog recognizes the HTTP referrer value of the reader’s browser. In
contrast, trackback and pingback are automatized mechanisms of blog systems
based on HTTP-POST and XML-RPC. In the moment a blog author A references
another blog author B, the blog system sends a notification to the server of B.
The server stores this message which contains all relevant meta information like
referencing post URI and post title. Thus, B can display this back reference under
his post to lead the blog reader to further discussions.
7
2 Background
2.2 BlogIntelligence Framework
To exploit the unique features of blogs, the BlogIntelligence project was initiated [10]. This project shows in which perspectives the entirety of weblogs can
be analyzed and visualized in order to extract valuable aggregated information.
The visualizations and insights are composed into a web portal5 .
To generate the data for this web portal, BI provides a framework consisting of
three layers: extraction, analysis, and visualization. An illustration of the complete
architecture is shown in Fig. 3.
2.2.1 Extraction
The extraction layer consists of a web harvesting application called crawler.
Web crawlers are computer programs that browse the web in an automatic, methodical manner [23]. They are mainly used to store a copy of each visited page.
Search engines and other services analyze and index these pages to provide fast
search interfaces. The crawler starts with a fixed set of URIs. After visiting and
copying the first pages, the crawler extracts all hyperlinks and continues by visiting the linked pages.
The BI crawler is a tailor-made adaption of Apache Nutch for the special requirements of the blogosphere (see Sec. 2.3). Similar to common crawlers, the BI
crawler traverses the link graph of the web to harvest web pages.
Two parts of the crawler are adapted to the special needs of harvesting blogs.
This first part is the URI selection. It is responsible for selecting the next set of
URIs to crawl from the queued URIs in the joblist (see Fig. 3). It distinguishes
between special types of links present in blogs. These types reflect the position
of a link in a blog. Thereby, the crawler prefers links from blogrolls, posts, comments, and links explicitly marked as feeds.
The second adapted part is the post-processor. This part is responsible for extracting meta data from the downloaded page and attach it to the persistent data ob5 http://www.blog-intelligence.com/
8
EXTRACTION
WWW
...
LE
TIT
T
AU
T
TEN
CON
DS
-FEE
RSS
BLOGOSPHERE
R
HO
P
STAM
TIME
ORY
CATEG
LINKS
...
BLOGPARS
ING
NEWS-PORTALS
CRAWLER#1
CRAWLER#2
CRAWLER#3
...
BLOGR
T WIT
ANALYSIS
...
TREND
TER
OLLS
ACC
.
COMMUNITIES
WHAT‘S
RANKING
PERSONALIZED
SEARCH
NET WORK
CONTENT
DATA
ANLAYZERS
INFORMATION
SPREADING
VISUALIZATION
WEBINTERFACE
COMMUNITIES
RANKING
INFORMATION
SPREADING
WHATS’S UP?
TRENDS
Figure 3: The BlogIntelligence architecture [10].
9
2 Background
ject. The default extraction includes language detection, text content extraction,
meta tag extraction and link extraction. In addition, the BI crawler creates post
and comment objects. Content, description, author, publishing date, language,
tags, and categories of a post are extracted.
The post-processor recognizes the specific blog system, like Blogger6 , based on
hints in the HTML structure. This way, platform-specific informations like trackbacks are extracted, as well.
Further, it analyzes the position of links in the content structure of a blog. After
the completion of the post-processor, the crawler stores the enriched web page
into the persistence layer.
2.2.2 Analysis
The second layer of the framework, the analysis layer, performs while the crawler
continuously collects new information. The analysis consists of multiple loosely
coupled modules. Each module performs a specific algorithm that delivers data
necessary for the third layer of the framework, the visualization layer. The current BI prototype includes a ranking, a clustering, and a dimension reduction
algorithm.
The ranking algorithm is described by Bross et al. [11]. The authors define a
complex metric called BI-Impact score. The BI-Impact score combines multiple
quality metrics of blogs to one score (see Sec. 3.2).
The current implementation of the prototype runs as an over-night batch job
to calculate a new ranking. It only considers the specific link types from blog,
blogroll, post, or comment. The analysis elements (see Fig. 3), like trend detection, ranking, and community recognition, are prototypically implemented. The
details of the ranking are discussed in Sec. 3.2.
The integration of the topic consistency rank calculation in this analysis layer is
described in Sec. 6.
6 http://www.blogger.com
10
2.2.3 Visualization
The visualization layer is based on the results of the data analyses. It allows users
to browse the preprocessed information of the data analyzers in an unlimited,
personalized and intuitive way. The visualization layer consists of three visualizations that give different insights at different abstraction levels (see Fig. 3).
The first visualization is directly integrated into the web portal. It basically
shows the frequently discussed topics (What’s up), and trending terms (Trends)
of the blogosphere.
Figure 4: Screenshot of the BlogConnect visualization [24].
The second visualization is an interactive visualization tool to powerfully explore and browse through the network of blogs, called BlogConnect. Essentially,
it displays all blogs as bubbles on a 2D canvas (see Fig. 4) [24]. The position of
a blog reflects its topical identity. The size of a blog indicates its position in the
ranking. Thus, users can orientate themselves in the network and find the most
relevant blog for a topic area, called community.
The third visualization, called PostConnect [25], serves as a visualization for blog
archives. As shown in Fig. 5, it arranges all posts of a blog in a circle. By activat-
11
2 Background
Figure 5: Screenshot of the PostConnect visualization [25].
ing a post, each topically linked post of a blog archive gets highlighted. Hereby,
a post is topically related if it uses the same categories or tags as the activated
post. PostConnect helps users to explore the topical nature of a blog and identify
highly related subsets of posts.
2.3 Apache Nutch - the Crawling Framework
As described by Berger et al. [26], the underlying framework of the BlogIntelligence extraction layer is the open-source web search engine Apache Nutch7 [27].
Apache Nutch provides a transparent alternative to private global scale search
services. It comes with an easily extensible and scalable crawler component.
Following the MapReduce paradigm [28], Apache Nutch defines four different
phases for crawling: generator, fetcher, parser and updater that are executed iteratively [29]. The generator job selects the next URIs to fetch. The fetcher job asynchronously downloads the selected pages. Afterwards, the parser job extracts
7 http://nutch.apache.org/
12
metadata, links and the actual text content. In addition, the framework offers
an extension point to insert new parsing algorithms. This functionality is used
by the topic detection implementation to integrate the term extraction module
(see Sec. 5). Finally, the updater job inserts new links and calculates scores for the
parsed web pages. These scores are used to select the next URIs to crawl. Each
job works on a large amount of pages in parallel.
Apache Nutch is a MapReduce application dedicated for scale-out scenarios,
i.e. runs on a large number of small machines. For example, researchers at
Google [28] report to use a massive cluster of small machines to crawl the web.
Nevertheless, even on scale-up scenarios, i.e. execution on one big machine,
MapReduce applications perform as scale-out-in-a-box more effective than pure
scale-up approaches [30]. This enables us to run the crawler on a large cluster of
small machines as well as on a large shared-memory server.
In this context, the Hasso-Plattner Institute offers a testing platform, the Future SOC Lab8 , which provides researchers access to the latest multi/many-core
hardware. Thus, the crawler implementation currently runs on a scale-up scenario.
2.4 SAP HANA - the Persistence Layer
The persistence layer of the BlogIntelligence framework has an high impact on the
performance of the extraction and analysis layer [26]. In addition, the overall
target of BI is to provide real-time analytics for the whole blogosphere. Therefore, three different database technologies compete: a row-oriented, disc-based
database, a distributed file system and a column-oriented, in-memory database.
The evaluation considers a traditional row-oriented, disc-based databases,
namely PostgreSQL9 . This database makes the data discoverable and easy to
query by offering a SQL query API that ease the implementation of the analysis
layer. However, the query performance of PostgreSQL massively decreases with
growing data amounts during the extraction phase [26].
8 http://www.hpi.uni-potsdam.de/forschung/future_soc_lab.html
9 http://www.postgresql.org/
13
2 Background
An alternative is the distributed file system HDFS10 . HDFS is the original persistence API of Apache Nutch. It is able to handle and process huge amounts of
data using commodity hardware [31]. It does not provide a query API like SQL.
Further, HDFS is not able to take full advantage of today’s high-end hardware
with massive amount of memory, because it requests only minimal hardware
resources [32].
Since costs for main memory are decreasing and access to data in the main memory is extremely fast, it makes sense to store all data mainly in the main memory.
Thus, an in-memory database, namely SAP HANA11 , is tested. Although SAP
HANA is targeting enterprise applications, the majority of analysis algorithms
also apply to social media. Because of the effective usage of main memory, the
versatile analysis capabilities, and the SQL API, the extraction component got
adapted to store all collected data in SAP HANA [33].
To integrate the extraction component with the in-memory database, the persistence layer of Apache Nutch is replaced. The Apache Gora12 framework already offers an object relational mapper(ORM) for traditional SQL databases like
PostgresSQL. This ORM is adapted to also support the SAP HANA database,
because HANA uses a special SQL dialect. Hence, the complete extraction component is currently integrated with SAP HANA.
Caused by the tight coupling of the persistence layer and the analysis layer, this
change implies the adaption of the whole analysis layer. Thereby, most of the algorithms have to be modified regarding the direct integration into SAP HANA.
HANA offers various kinds of programming interfaces to run analytics direct
in-memory without transferring the data to the application layer.
Besides saving transfer time, the main advantage of the database is the
dictionary-encoded column-oriented in-memory computing that outruns
file-based database solutions [33]. The dictionary encoding saves space and
access time for highly redundant tables like the link or dictionary tables used
for analysis (see Sec. 6). Further, the column-orientation performs best on tables
10 http://hadoop.apache.org/
11 http://www.sap.com/HANA/
12 http://gora.apache.org/
14
with a large number of columns, but only few columns questioned. This applies
to the main table of BlogIntelligence, called web page table, which essentially
stores all informations of a web page like content, date, author, and many more
into one table.
However, HANA is still under development and the transfer of analysis algorithms is out of scope for this work. As a consequence, the clustering needed by
the topic consistency rank is outsourced as described in the following Section.
2.5 Clustering and Apache Mahout
One major foundation of the algorithms in the analysis layer is clustering. This
also applies for the topic detection mechanism needed for the topic consistency
rank (see Sec. 5).
Clustering is an unsupervised classification technique of data items into groups,
called clusters. These clusters contain data items that are similar in meaning.
Beside density-based clusterings, frequently used clusterings in data mining are
distance-based [34, 35].
Essentially, a distance-based clustering works like follows. Each data item has a
number of numerical features. The feature vector of each data item is the combination of these features. One can think about the feature vector as a position in
an n-dimensional space. The clustering defines a distance metric for the feature
vectors. The task of the clustering is to group feature vectors together that have
a low distance between each other. Thus, all data items are grouped together
that have similar numerical features.
The current clustering of BlogIntelligence groups blogs together based on the
word occurrences in the blogs. Blogs are in one cluster if they contain similar
words with a similar frequency and thus are regarded as topical similar. These
clusters are visualized in the BlogConnect visualization (see Fig. 4).
The prototypical analysis layer of the BI framework consists of Java implementations for the clustering and other analysis techniques. The logical next step is
to integrate it with a well-established framework for information analysis.
15
2 Background
Such an established framework is Apache Mahout13 [36]. Similar to Apache
Nutch, it is based on a MapReduce framework.
It provides various algorithms for clustering, classification, and collaborative
filtering. In order to provide maximal distribution during the execution, these
algorithms are customized for the MapReduce framework.
Mahout is primarily built for batch analyses that are able to handle big data.
This data has to be present on the distributed file system of Hadoop. Hence, the
complete data has to be loaded from the persistence layer.
However, the long-term target is to integrate all needed clustering and classification algorithms directly into the persistence layer to avoid high transfer costs.
Although, first clustering algorithms for HANA are under development, the integration is out of scope of this thesis.
13 http://mahout.apache.org/
16
3 Related Work
The related work can be divided into three categories of ranking approaches.
The first category consists of general rankings that assess web pages and other
documents. The second category includes blog-specific rankings that are specialized on blogs and other social media channels. The last category comprises
consistency-related rankings that incorporate the topic consistency of a document
or blog into the ranking.
3.1 General Rankings
PageRank is one of the most frequently used algorithms, e.g. by Google [37],
for ranking traditional web pages based on the web link graph. It has been
introduced by Page et al. [13] and is based on the random surfer model. A web
page’s PageRank is defined as the probability of a random surfer visiting this
web page. The random surfer traverses the web by choosing repeatedly between
two options: clicking on a random link on the current page or randomly jumping
to another web page.
The second option is necessary to make sure the random surfer also visits pages
that have no incoming links and to make sure that it is possible to escape from
pages that have no outgoing links. The calculation of the PageRank algorithm is
shown in the following equation.
PR( pi ) =
PR( p j )
1−d
+d ∑
N
L( p j )
p ∈ M( p )
j
i
The probability of clicking on a random link is determined by the damping factor d. p j ∈ M ( pi ) if p j has a link to pi . L( p j ) gives the number of outgoing links
for p j and PR( p j ) is the previous PageRank of p j . The PageRank algorithm is
iterative and converges after a certain number of iterations depending on the
implementation used.
A very similar algorithm to PageRank is TrustRank [38]. In contrast to PageRank,
TrustRank is initialized with a fixed set of trustworthy or untrustworthy web
17
3 Related Work
pages. The trust propagates through the web graph equally to the PageRank
algorithm.
Another approach is the Hyperlink-Induced Topic Search (HITS) algorithm by
Kleinberg [39]. It is based on the concept of hubs and authorities. In the traditional view of the web, hubs are link directories and archives that only refer to
information authorities, which actually offer valuable information. The HITS
algorithm operates on a subgraph of the web that is related to a specific input
query. Each page gets an authority score and a hub score. The authority score is
increased based on the hub score of linking web pages and vice versa.
These traditional ranking algorithms are all based on the web link graph. However, traditional web pages show a different linking behavior than blogs. Blogs
offer different types of links, e.g. trackbacks or blogroll links, with different semantics (see Sec. 2.1). Furthermore, the blog link graph tends to be rather sparse
in comparison to the overall web [40].
3.2 Blog-Specific Rankings
To address the special characteristics of blogs, blog ranking engines and current
research introduce tailor-made ranking algorithms for the blogosphere [11].
The most popular platforms ranking the blogosphere are Technorati14 and
Spinn3r15 [11]. Other services like BlogPulse, PostRank, or BlogScoop went offline
during the last year and got integrated into commercial products. Thus, the free
services of Technorati and Spinn3r are described.
Technorati established the authority score as their unique ranking. It is calculated
based on a blog’s linking behavior, categorization and other associated data over
a small period of time [41]. Furthermore, Technorati calculates its authority score
also for topical segments of the blogosphere to identify topic-specific opinion
leaders.
Although Spinn3r is well known for its crawling service, it also provides a simple
14 http://technorati.com/
15 http://spinn3r.com/
18
PageRank and a Social Media Rank. The Social Media Rank is an adaption of the
TrustRank algorithm. It incorporates social networks as incoming link providers
and uses a fixed number of initially trusted users to prevent spam.
Beside these platform specific rankings, current research also discusses blogspecific ranking approaches.
A ranking score, called BlogRank, is introduced by Kritikopoulos et al. [42]. It is
a modified version of the PageRank algorithm. The BlogRank score is based on
the link graph and different similarity characteristics of weblogs. The authors
create an enriched graph of inter-connected weblogs with additional edges and
weights representing the specific features of blogs. Mainly, these features are
shared authorship and topics. For example, the authors create a pseudo link
between two posts that share the same topic that is identified by category annotations.
Bross et al. [11] propose the BlogIntelligence-Impact-Score (BI-Impact) ranking, a
more complete approach to successfully rank blogs. Their definition is the basis
for the currently implemented scoring algorithm in the BlogIntelligence framework.
Figure 6: Ranking variables of the BI-Impact score [11].
Similar to the above mentioned rankings, they give special weightings for special link types of the blogosphere. In contrast to BlogRank, their algorithm does
not create new links between blogs. It rather weights the different interaction
19
3 Related Work
types of blog authors like links to comments, posts, and to the start page of a
blog. Like Spinn3r, they also consider links from outside the blogosphere such
as from Twitter16 and news portals.
All used ranking variables are shown in Fig. 6. They distinguish between a post
and a blog ranking.
The post ranking incorporates the different kinds of links between posts like
linkbacks, tweets, and normal links. Further, the content of a post gets rated.
In contrast to consistency-related rankings, the authors do not incorporate the
topics of a post. Instead, the authors focus on the detection of spam keywords
and trend keywords. Trend keywords are terms extracted by a hot topic analyzer,
which is also part of the BI framework.
The blog ranking combines the ranking of all posts with blog-specific characteristics. Among others, these are the publishing frequency and the blogroll links
of a blog.
All these variables are combined into one score for a blog and propagate through
a PageRank-like algorithm to all linked blogs.
The work presented in this thesis introduces a new score that complements the
BI-Impact score to foster the retrieval of topically consistent blogs for hot topics. Thereby, users of the BI framework are able to find niche blogs that discuss
trending and interesting topics.
3.3 Consistency-Related Rankings
Consistency-related rankings are blog rankings that incorporate the topical consistency of a blog. This topical consistency adds to other factors to form one rank
for each blog.
A trend detection system, called Social Media Miner, is presented by Schirru
et al. [43]. This system extracts topics and the corresponding, most relevant
posts. The topics are detected using a clustering on word importance vectors
(see Sec. 2.5).
16 http://twitter.com/
20
Their approach is rather simple and does not directly reflect a consistency. They
cluster topics for a given period, find relevant terms (or labels), and visualize the
term mentions over time as a trend graph. Nevertheless, posts that consistently
handle a specific topic have a constant term frequency of topic terms. Thus,
topically consistent blogs get a good trend graph, at least for trending topics.
Sriphaew et al. [44] discuss how to find blogs that have great content and are
worth to be explored. They show how to identify these blogs, called cool blogs,
based on three assumptions: cool blogs tend to have definite topics, enough posts,
and a certain level of consistency among their posts. The level of consistency,
called topical consistency, tries to measure whether a blog author focuses on a
solid interest. Thus, it favors blogs with stable topics like reviews on mobile
devices. The authors measure the consistency based on the similarity of topic
probabilities of preceding posts.
Eleven indicators of credibility to improve the effectiveness of topical blog retrieval are introduced by Weerkamp et al. [15]. Beside some syntactic indicators,
they also present the timeliness of posts, and the consistency of blogs. The timeliness of a post is defined as the temporal distance of a blog post to a news
portal post of the same topic. Their topical consistency represents the blog’s
topical fluctuation. The authors define the consistency as a tf*idf-like score over
all terms of a blog. Although this measure favors blogs that frequently use rare
terms, it does not reflect when a blog author changes the topic from one post to
another. In contrast to other related research, the authors do not use the natural
ordering of posts. Nevertheless, the authors show that their indicators improve
the topical blog retrieval significantly.
The detection of spam blogs (splogs) is a frequently discussed topic in ongoing
research [45, 46, 47]. However, Liuwei et al. [48] describe a spam blog filtering technique that also incorporates the writing consistency of a blog author.
Similar to Weerkamp et al., the consistency on topic level is defined as the average topical similarity of posts. Each post gets compared with its preceding
post. The topical similarity is defined as the distance of the posts’ tf*idf word
vectors. Thereby, blogs with a extremely high topical consistency are expected
to be auto-generated. They integrate their topic consistency into a blog filtering
21
3 Related Work
system.
Another approach for ranking blogs is introduced by Jiyin He et al. [49]. They
define a coherence score to measure the topical consistency of a blog. The authors define a consistent blog as a blog that contains lots of coherent posts. A
post is coherent to another post if both posts are in the same cluster of the whole
collection. The authors integrate the coherence score into a blog ranking for
boosting the topically relevant and topically consistent blogs.
Chen et al. [50] present a blog-specific filtering system that measures topic concentration and variation. They assess the quality of blogs via two main aspects:
content depth and breadth. In essence, the authors present a score that contains
five criteria. Each criterion is based on an external topic model derived from
Wikipedia17 articles. For example, the completeness of a blog is defined as the
ratio of words used in a blog in comparison to all words assigned to a topic. Further, the authors define the topical consistency of a blog as the mean distance of
used topics in a post. A blog is consistent if it only handles closely related topics. The ordering of posts, which can indicate a topic shift of the author, is not
considered.
In contrast to related work, the topic consistency rank presented in this thesis calculates the consistency of a blog based on multiple aspects. Thereby, it
measures the topical consistency at four different granularities and thus offers a
differentiated view on the blogs consistency.
Further, during the calculation of the score, topics are not considered as probability distribution over words. Instead, a topic is defined as a fix set of words
derived from a prior word clustering, which is also used by Sriphaew et al. [44].
17 http://www.wikipedia.org/
22
4 Definition of the Topic Consistency Metric
To evaluate the topical consistency of a blog author, four different facets of consistency are defined.
First, the consistency between posts defines the inter-post consistency. It investigates whether the contents of the latest posts discuss closely related topics.
Next, the internal consistency of a post, called intra-post consistency, is a measure that considers to which extend all paragraphs of a post discuss a similar
topic. In difference to the inter-post consistency, the intra-blog consistency compares the topic space created by each posts with the topic space created by tags
and categories of this post. Therefore, it is a measure for the quality of the blog’s
classification system. The inter-blog consistency measures whether a blog is part
of a domain expert communitiy. Hereby, the rank of a blog is increased if blogs
handling a similar topic link to it. In addition, a blog is boosted if it links to
topically related blogs.
Finally, all four facets get combined into the topic consistency rank.
4.1 Consistency between Posts (Inter-Post)
As a first step, the inter-post consistency is formally defined. The inter-post consistency compares topical distance of succeeding posts. Each post is represented
as a topic vector. Each component of this topic vector gives the probability of a
post talking about one topic. The sum of all vector components is one as usual
for a probability distribution.
Fig. 7 shows the assignment of ten example posts to ten topics. Each column
symbolizes a topic vector of a post. The size of a bubble indicates the probability
of a post p to be in topic t.
The transient nature of the blogosphere motivates us to only consider the latest
posts that lay outside the outdated post area. There are two approaches to define
outdated posts: exclude all posts exceeding a specific time span, or including
only a specific number of latest posts. The latter solution punishes blogs that are
frequently publishing new content by shrinking the observed time window to
23
4 Definition of the Topic Consistency Metric
High
Low
distance distance
12
Outdated
post area
Topic Vector
10
Topic ID
8
6
Topic Probability
4
2
0
0
2
4
6
Post Number
8
Time
10
Figure 7: Visualization of post-topic-probabilities.
a day’s work. The time span variant is beneficial for small blogs because only
a small part of the content is considered. However, the time span variant is
applied because it is assumed that it fits the user’s perception.
Sriphaew et al. [44] calculate the average difference o topic vectors of posts with
the blog’s topic centroid. This favors blogs with a central interest, but does not
consider the change of a blog’s topic over time. As shown in Fig. 7, blogs can
have low distances and high distances between posts. Thus, the average difference of topic vectors of two successive posts serves as indicator for topic consistency.
In the following, the formal definition of the inter-post consistency is shown.
Before defining the metric, the sets and functions used for the calculation have
to be defined. The set Blog contains all blogs of the used data set. Post is a set
that contains all posts. The set Postb with b ∈ Blog contains all posts of blog b.
The function publishedDate( p) with p ∈ Post returns the publishing time and
date of a post. The function LatestPostsb,d with b ∈ Blog and d ∈ Date being a
24
point in time is a set defined in Eq. 1.
LatestPostsb,d = { p ∈ Postb | publishedDate( p) ≥ d}
(1)
Term is the set of all terms. The set Topic contains all topics discussed in the
considered subset of the blogosphere. Similarly to Eguchi et al. [51], the set
TTtp ⊂ Term is defined as all terms of a topic tp ∈ Topic. All TTtp are pairwise
disjoint.
∀tp ∈ Topic ∀ j ∈ Topic : tp 6= j ⇒ TTtp ∩ TTj = ∅
(2)
PTp ⊂ Term is the set of all used terms of a post p ∈ Post. The function
Prob( p, tp) with p ∈ Post and tp ∈ Topic gives the probability of the post p
being about the the topic tp.
Prob( p, tp) =
∑t∈TTtp ∩ PTp t f ∗ id f (t, p)
∑t∈ PTp t f ∗ id f (t, p)
(3)
Salton et al. [52] give an overview to the components of the tf*idf-function and
its variances. Essentially, it is the product of a term frequency component t f and
a collection frequency component id f .
t f ∗ id f (t, p) = t f (t, p) × id f (t, Post)
(4)
t f is the raw term frequency (number of times a terms occurs in a post). id f is
the inverse document frequency. Postt with t ∈ Term is the set of all posts in
which a term is contained.
id f (t, Post) = log
| Post|
| Postt |
(5)
The funtion topicalDistance( pi , p j ) with pi , p j ∈ Post is defined as the Euclidean
distance between the topic vectors of both posts (see Eq. 6). The Euclidean distance is a frequently used distance metric and has proven to apply best for text
vector comparison [44].
topicalDistance( pi , p j ) =
s
∑
( Prob( pi , tp) − Prob( p j , tp))2
(6)
tp∈ Topics
25
4 Definition of the Topic Consistency Metric
The function predecessor ( p) ∈ Post returns the direct predecessor of p ∈ Post.
Given these definitions the inter-post distance is formalized as shown in Eq. 7
with b ∈ Blog and d ∈ Date.
interPostDistance(b, d) =
∑ p∈ LatestPostsb,d topicalDistance( p, predecessor ( p))
| LatestPostsb,d |
(7)
interPostDistance(b, d) is the average topical distance of two succeeding posts
among the latest posts of a blog. It returns high values for very inconsistent
blogs and low values for very consistent blogs. To give consistent blog a high
inter-post consistency score, it is defined as the inverse interPostDistance(b, d),
as shown in Eq. 8.
interPostConsistency(b, d) =
1
interPostDistance(b, d)
(8)
4.2 Internal Consistency of Posts (Intra-Post)
The intra-post consistency focuses on the inner consistency of one post. It is high
if a blog author focuses on one single topic and does not change the subject while
writing one single post. Thus, it favors self-contained and complete posts that
do not cover several topics. A consistent post should handle just a few topics,
but discuss them in more detail.
The intra-post consistency is very similar to the inter-post consistency except
that it operates on the sections of posts. Each post is subdivided into sections
by splitting the post’s content by each occurrence of more than one line break or
HTML separator.
Each section gets assigned one topic vector. The components of this topic vector
represent the probability to which a section is about a specific topic.
Two additional concepts need to be defined before formalizing the intrapost consistency. Firstly, Section is the set of all sections in the data set and
Section p ⊂ Section is the set of all sections of one specific post p ∈ Post. Secondly, predecessor (s) with s ∈ Section is the function that returns the preceding
section of one section s.
26
Further, the function topicalDistance(si , s j ) with si , s j ∈ Section is defined in the
same manner as Eq. 6.
intraPostDistance( p) =
∑s∈Section p topicalDistance(s, predecessor (s))
Section p (9)
The intra-post distance is also defined for a whole blog. It is the mean of all
distance values of the latest posts.
intraPostDistance(b, d) =
∑ p∈ LatestPostsb,d intraPostDistance( p)
| LatestPostsb,d |
(10)
Thereby, the intraPostConsistency(b, d) is defined as the inverse intra-post distance to provide consistent blogs with a high score (see Eq. 11).
intraPostConsistency(b, d) =
1
intraPostDistance(b, d)
(11)
4.3 Consistency between Posts and Classification (Intra-Blog)
The intra-blog consistency serves as a measure for the quality of a blog’s classification. It evaluates to which extent the content of posts is consistent with
tags and categories that form the classification system of a blog. As discussed in
Sec. 2.1, tags and categories are very important for the orientation of a user and
the navigation through the blog. It is crucial that blog authors choose tags and
categories wisely and appropriate to their content. In addition, spam blogs tend
to overuse tags and categories to earn a higher rank in blog search engines for a
high number of keywords. These low quality blogs and spam blogs get a very
low intra-blog consistency score.
For a high consistency, tags and categories should span an equal topic distribution as the overall content of a blog.
The intra-blog consistency is the distance of the topic vector of each post and the
topic vector for the post’s classification system.
27
4 Definition of the Topic Consistency Metric
Before defining the intra-blog consistency it is needed to formally define additional concepts. Tag is the set of all tags and Category is the set of all categories
in the data set. Further, Tag p and Category p with p ∈ Post are the set of tags
and categories of one post. The Classi f ication p set is the defined the union of
categories and tags of one post p.
Classi f ication p = Tag p ∪ Category p
(12)
Given the classification of each post, Classi f ication p , and the set of all posts in
a blog, Postb , the intra-blog distance is defined as the average topical distance
between each post and its classification (see Eq. 13).
intraBlogDistance(b) =
∑ p∈ Postb topicalDistance(Classi f ication p , p)
| Postb |
(13)
Finally, the intraBlogConsistency(b) is defined as shown in Eq. 14.
intraBlogConsistency(b) =
1
intraBlogDistance(b)
(14)
A low value of intraBlogConsistency(b) indicates a mismatch between the classification and the actual content. Thus, the quality of the blog is questionable
and it is supposed to be of a lower rank.
4.4 Consistency of Linking and Linked Blogs (Inter-Blog)
Finally, the inter-blog consistency serves as a context-based consistency metric.
It measures the consistency between the blog’s content and the content of linking and linked blogs. Thus, it measures whether a blog is part of an expert
community. An expert community is a set of blogs that focus on one topic and
discuss this topic interactively. For example, during the Arab spring one single
blog starts the discussion and other blogs build an active discussion around this
initial blog [5].
Among other motivations, the followers of blogs have two targets: First, they
like to spread the word of the referenced blog author to widen the reach of the
28
message. Second, referencing blog authors want to discuss the message and get
into an active discourse with the referenced blog author. Those discourses are
the essence of the blogosphere. Similar to Wikipedia, blog authors increase the
information quality by evaluating and iterating posts of each other.
As already discussed for the BI-Impact score, blogs have a set of special link
types, but only a few of them are actual interaction links and not only friendly
links or advertisements (see Sec. 2.1).
Blogroll links and links, which are not located in posts or comments, have no
evaluating or commenting nature. In contrast, if a blog author links from a post
directly to a post of another blog author, he indicates a reply or similar reaction like a reference. Further, comment authors can also link to other posts, this
is formally regarded as a linkback. Linkbacks are also indicators for an active
discourse between two blogs. These links, linkbacks and links from posts, are
interaction links.
The inter-blog consistency defines the consistency of a blog and blogs that link
or are linked via an interaction link.
The post linking post relation (PLP) contains the tuple ( pi , p j ) with pi , p j ∈ Post
if pi has an interaction link to p j . The set IPpi , incoming posts, with pi ∈ Post is
defined as follows:
IPpi = { p j | p j ∈ Post ∧ ( p j , pi ) ∈ PLP}
(15)
In parallel, the set OPp , outgoing posts, p ∈ Post is defined.
OPpi = { j | p j ∈ Post ∧ ( pi , p j ) ∈ PLP}
(16)
Incoming links cannot be controlled by the blog author. Hence,two constants
α, β introduce a weighting for incoming and outgoing posts.
The postContextDistance( p) with p ∈ Post as the weighted sum of the average distance to all incoming and the average distance to all outgoing posts (see
Eq. 17).
29
4 Definition of the Topic Consistency Metric
postContextDistance( p) = α ∗
β∗
∑ j∈ IPp topicalDistance( p, j)
+
| IPp |
∑ j∈OPp topicalDistance( p, j)
(17)
|OPp |
A typical weighting is α = 0.6; β = 0.4 to slightly emphasize incoming links for
their unbiased nature.
The interBlogDistance(b, d) with b ∈ Blog and d ∈ Date is defined in Eq. 18. The
inter-blog distance calculation considers only the latest posts due to the transient
nature of the blogosphere.
interBlogDistance(b, d) =
∑ p∈ LatestPostsb,d postContextConsistency( p)
| LatestPostsb,d |
(18)
Analogously to the other three aspects, the interBlogConsistency(b, d) is defined
as the inverse interBlogDistance(b, d) (see Eq. 19).
interBlogConsistency(b, d) =
1
interBlogDistance(b, d)
(19)
4.5 Combined Topic Consistency Rank
Finally, the topic consistency rank is defined as the combination of all four facets.
All facets are combined by calculating a weighted sum for each blog.
The topicConsistency(b, d) with b ∈ Blog and d ∈ Date is defined in Eq. 20. The
four constants, χ, δ, e,and γ, give a weighting for each component of the topic
consistency rank.
topicConsistency(b, d) = χ ∗ interPostConsistency(b, d) +
δ ∗ intraPostConsistency(b, d) +
e ∗ intraBlogConsistency(b) +
γ ∗ interBlogConsistency(b, d)
30
(20)
The weighting can be varied according to the characteristic of the analyzed data
set. Caused by the low usage of categories and tags in the BlogIntelligence data
set and the high usage of content summaries in posts’content, the weights used
in this thesis are: χ = 0.3; δ = 0.2; e = 0.2; γ = 0.3.
The final topic consistency rank is calculated by normalizing the results of the
topicCosistency function over all considered blogs. Through this normalization
the values will be in the interval [0, 1], which is a common approach for rank
normalizations [53].
31
4 Definition of the Topic Consistency Metric
32
5 Implementation of Topic Detection
As mentioned in Sec. 4.1, all topic consistency metrics depend on topic term sets.
To find topics and assign terms to topic term sets, the topic detection procedure,
shown in Fig. 8, is implemented.
1.
2.
Download
Post
4.
Parse
Content
3.
Extract
Terms
BlogIntelligence
Crawler
5.
Calculate
Tf*Idf
SAP Hana
Database
Build
Word Vectors
6.
7.
Write
Word Clusters
Run k-Means
Apache Mahout
Analyzer
Figure 8: Flow diagram of the topic detection.
5.1 Prerequisites
There are several steps necessary before running the actual clustering algorithm,
which creates the topic term sets. The preprocessing covers steps 1-5 of the topic
detection flow (see Fig. 8).
Step 1.
First of all, the BI crawler harvests the blogosphere. It stores all data of
blogs into the SAP HANA database. The crawler traverses the blog link graph
and downloads every blog post. Immediately after downloading, the crawler
parses the downloaded HTML files (see Fig. 8).
33
5 Implementation of Topic Detection
Step 2.
The parsing includes the removal of non-textual content like images
and videos. Further, it removes markups like HTML tags. After parsing a web
page, the crawler stores the pure text content as a character large object (CLOB)
into the database.
Step 3.
The Nutch crawling cycle is extended by a new component that allows
a word extraction on the text of posts. During this extraction, the crawler first
segments the text into words. This is done by splitting on non-word characters.
Afterwards, the extraction component removes all stop words from the word
set. Stop words are the most common words of a language, such as the, is, at,
and on. It uses the stop word lists from the Weka18 project. Weka is a collection
of machine learning algorithms for data mining tasks.
The word set is still redundant. It contains inflected or derived words. Thus,
a stemming of words is applied to reduce the words to their stem form. The
extraction component incorporates the stemmers of the Weka framework which
provides stemmer classes for various languages like German.
The preprocessing of the crawler assigns to each post the set of word stems. This
set of words is stored in a separate table into the database, called dictionary table.
The word extraction process is actually a common feature among text databases
like Apache Lucene19 . Although SAP HANA already contains a word count
matrix, which is the dictionary table for the topic detection, this matrix is not
accessible via an application interface (API).
In contrast, the next two steps are directly performed in the database.
Step 4.
An SQL procedure calculates the tf*idf values for each word. SQL
procedures have the advantage that they can directly access the data in memory
without transferring them for processing. The implementation follows Eq. 4.
18 http://www.cs.waikato.ac.nz/ml/weka/
19 http://lucene.apache.org
34
Step 5.
Further, the database is used to create the word vectors for each post
and the post vectors for each word. The latter are used for the clustering of
words that finally produces the desired topics. The vectors are computed by an
SQL view that directly refers to the basic web page table and the result table of
the tf*idf calculation. An example result of the view is shown in Tab. 1.
post id
word id
tf*idf
p4
w5
t f id f 4,5
p7
w8
t f id f 7,8
p5
w5
t f id f 5,5
p8
..
.
w8
..
.
t f id f 8,8
..
.
Table 1: Example tf*idf vectors resulting from the SQL view.
With step 5 the preprocessing is completed and all vectors can be loaded into
the HDFS file system of Mahout. This is implemented by a tailor-made class
for the BlogIntelligence analytics. It uses the adapted object relational mapper
(ORM), Apache Gora, to access the tf*idf vector view of HANA and transfer all
vectors to the HDFS file system. These vectors are the word vectors with posts
as dimensions.
Two example vectors are shown in Tab. 2. Mahout uses a sparse vector implementation. Sparse vectors are specially designed for document-word vectors
that are only sparsely filled. Sparsely filled means that most of the vector components are zero because words only appear in a small set of documents compared
to the overall collection.
5.2 Clustering
The two last steps are executed by the adapted Mahout framework (see Sec. 2.5).
Mahout offers various clustering algorithms like mean shift clustering, spectral
clustering, latent Dirichlet allocation, and k-means clustering [36].
35
5 Implementation of Topic Detection
w5
w8
p4
t f id f 4,5
0
p5
t f id f 5,5
0
p6
0
0
p7
..
.
0
..
.
t f id f 7,8
..
.
Table 2: Sparse word vectors from HDFS.
The current implementation of SAP HANA does not support a clustering applicable for the high number of dimensions created by the word-blog-vectors.
The total post-word-matrix size for L20 is limited to the maximum integer. This
value is too small for the approximately 1,000,000 by 500,000 word-post matrix.
Thus, the L API is not applicable for the clustering task.
Another alternative is the R21 integration of HANA. R is a programming language and software environment with special focus on statistical calculation.
Besides clustering algorithms, R supports various algorithms like time-series
analysis and statistical tests. The road block for R is also the massive amount
of data. The database needs to transfer all vectors to an external R component.
This process also fails due to the high transportation cost.
To sum up, until the integration of advanced text analysis algorithms in HANA
is completed, the external analysis framework Apache Mahout is used.
As discussed in Sec. 4.1, the topic consistency rank relies on a 1:n relation between words and topics. This approach simplifies the prototypical implementation, because it does not require a complex clustering technique based on probability distributions. Advanced, more complex clustering techniques are subject
to further research (see Sec. 8).
20 http://wiki.tcl.tk/17068
21 http://www.r-project.org/
36
Step 6.
k-means is a well known algorithm for clustering objects that creates
pair-wise distinct clusters. All objects need to be represented as a numerical feature vector. In this case, these objects are the words that are grouped into topic
term sets. The components of the feature vector are the tf*idf values of these
words in each crawled post. The k in k-means identifies the user-defined number of clusters that is also input for the algorithm. The feature vector represents
a vector in an n-dimensional space with n being the number of posts.
The algorithm operates as illustrated in Fig. 9.
k-means randomly chooses k points in the n-dimensional space that serve as
initial centers of the clusters, or called centroids (see Fig. 9 A). In the next phase
each word is assigned to the closest centroid. The closest centroid is the centroid
with the minimal distance to the feature vector of the word (see Fig. 9 B). One
can apply various distance measures depending on the data set to be clustered.
As discussed in Sec. 4.1, the established Euclidean distance serves as distance
measure.
After assigning the words to centroids, each cluster gets a new centroid. These
centroids are calculated by averaging the feature vectors of all words assigned
to one cluster (see Fig. 9 C). This process of assigning words and computing new
centroids is repeated until the convergence of the algorithm. The convergence
can be reached if the centroid movement is below a predefined threshold.
A)
B)
C)
Figure 9: An example iteration of k-means (∆ - centroids; x - points).
A) Random centroids. B) Assign clusters. C) Compute new centroids.
37
5 Implementation of Topic Detection
Mahout’s version of k-means is implemented by the KMeansDriver class. Esteves
et al. [54] describe the performance of this implementation. They highlight that
the Mahout implementation scales with increasing data set size and increasing
number of computing nodes.
After each iteration, the KMeansDriver stores the new centroids into the HDFS.
After the completion of all iterations, Mahout runs an extra job that writes the
clustered points, i.e. the word to topic assignment, to the file system.
Step 7.
This assignment is readable from the cluster writer module of Mahout.
An additional class, called HANAClusterWriter, is implemented. This class transfers the clustered points to the HANA database. It is not a MapReduce job because it only sequentially transfers the data from the HDFS to the database.
word id
cluster id
4
1
8
1
2
..
.
3
..
.
Table 3: Resulting cluster table.
An example of the resulting table is shown in Tab. 3. The choice of the feature
vector is crucial for the meaning of the clustering results. By selecting the tf*idf
values in each post for each word, words are grouped together that frequently
appear in the same post. Thus, words with a similar meaning are assigned to
the same cluster [10].
These word groups are the topic term sets used for the calculation of the topical
distance. The granularity of the topics is dependent on the user-defined number
of clusters k. As proposed by Abe et al. [55], the aim is to find clusters with
around 100 words per cluster.
In the evaluation (see Sec. 7), different settigs for k and the number of iterations
are tested to achieve an average cluster size of 100.
38
6 Implementation of the Topic-Consistency Rank
This Section presents the details of the implementation of the topic-consistency
rank. The rank is completely integrated into the database and only relies on
basic SQL constructs.
The theoretical foundations for each of the underlying partial scores are already
discussed in Sec. 4. Each score implementation consists of a combination of SQL
views, permanent and temporary tables. The combined score for each blog is
the weighted sum of the single scores (see Sec. 4.5).
6.1 Intra-Post Consistency
To calculate the intra-post consistency, an additional tf*idf calculation view is
implemented based on paragraphs. Equal to the normal tf*idf view (see Sec. 5.1),
this view is also based on the dictionary tables. The dictionary tables are the
result of the word extraction phase of the topic detection. An example dictionary
table is shown in Tab. 4. For each word of a post a row is created that contains
the word, the post id, and the word number.
word
post
position
hello
postid1
0
world
..
.
postid1
..
.
1
..
.
Table 4: The dictionary table maps words to the containing posts and positions.
To create a tf*idf value based on paragraphs, all words within a specific window
are regarded as paragraphs. The size of this window is set to 100 based on the
average length of a paragraph, which is 100-150 words [56].
The calculation is a direct implementation from the formal definition (see
Sec. 4.2). It creates a join between all succeeding sections. The result of this
join are the tf*idf values for each section and each occurring word. Afterwards,
this tf*idf values are joined with the cluster table. The score for each cluster is
39
6 Implementation of the Topic-Consistency Rank
calculated by summing up the tf*idf values per cluster.
Afterwards, the topical differences of the sections are calculated by joining the
sections of each post on the topic cluster. The topical distance of two section
is the square root of the sum of the differences for each cluster. The intra-post
distance on post level is the average of the section distances. Based on the postlevel distance, the blog-level distance is calculated by averaging the intra-post
distance values of each post. Finally, the intra-post score is computing by inverting the intra-post distance.
To sum up, the intra-post score calculation is a combination of nine joins and
four aggregations in the database. The mapping from ids to words and URIs
and vice versa introduces the most complexity to this operation. Further, one
has to mention that the intra-post rank is the most detailed rank in respect to
size of the tf*idf view results.
6.2 Inter-Post Consistency
The inter-post consistency builds upon the tf*idf view based on posts, called
post-tf*idf, which is also used by the topic clustering (see Sec. 5.1). Posts are
objects in the database and thus do not require an additional segmentation.
To get succeeding posts, each post is joined with the post that has the minimal
next publishing date. After this join, the topic vector differences of each post
and its successor can be computed. By grouping for each post, the Euclidean
distances between all succeeding posts are calculated. Afterwards, the average
of all distances results in the inter-post distance and thus in the inter-post consistency score of a blog.
This operation is pretty similar to the intra-post consistency except that it is
based on the latest posts. The selection of the latest posts is implemented as
a simple where-condition on the post publishing date.
40
6.3 Intra-Blog Consistency
The intra-blog consistency calculates the distance between the classification of
each post and its content. Therefore, it uses the post-tf*idf view to get the term
importance values for the content. Further, it uses a tf*idf view based on the
classification system, called class-tf*idf. This view returns the importance values
for each term used in tags or categories.
The intra-blog consistency on post-level is calculated by the topical distance of
the post’s classification and the post’s content vector. Finally, all topical distances are combined by performing an average operation for each blog.
To accelerate the calculation the tf*idf vectors become persistent as temporary
column tables. Thereby, a join between vectors can be performed as a column
search operation in the SAP HANA database, which is the fastest way of joining [33].
Further, blogs do not get an intra-blog consistency if they are not using tags
or categories. These blogs are regarded as inconsistent with their non-existing
classification system. Thus, they are assigned the minimal score, i.e. zero.
6.4 Inter-Blog Consistency
The context-based consistency, called inter-blog consistency, of a blog is based
on its linking and linked blogs.
To calculate this score a join with the biggest table of the data set, the link table (see Tab. 5), is necessary. This table consists of the linking and linked blog
URIs and the corresponding link type, which represents whether a blog links to
another blog via a post or a comment.
To calculate the topical distance between all outgoing and incoming links the
blog-topic-probability table is joined with the link table. This is the most costly
operation for the data set because the link table is rapidly growing and contains
currently around 160 million rows.
After the join computation, the post context distances can be calculated. By
41
6 Implementation of the Topic-Consistency Rank
linking post
linked post
link type
spreeblick.de?p=22
netzwertig.de?p=31
via post
carta.info?p=12
spreeblick.de?p=26
via comment
promicabana.de?p=76
..
.
gesichtet.net?p=3
..
.
via post
..
.
Table 5: Example rows of the link table.
grouping for the blog, the inter-blog consistency score is computed as defined
in Eq. 19.
6.5 BI-Impact Score
As discussed in Sec. 3.2, BlogIntelligence implements a blog ranking metric called
BI-Impact score as a prove-of-concept prototype. In the course of evaluating the
topic consistency metrics against a blog-specific ranking, the BI-Impact score is
transfered to SAP HANA.
The score contains two components: the blog interaction and the post interaction. These components are also calculated as SQL views. The calculation requires numerous joins over the link table to calculate the partial rank for each
distinct link type.
The BI-Impact score is calculated by a recursive algorithm. It needs multiple
iterations until the rank converges. After each iteration, a temporary table stores
the ranks for each blog and serves as input for the next iteration.
The whole calculation spans a complex query tree. It contains about 52 join
operations. Although the majority of tables have a low number of rows, the
usage of the link table introduces an high complexity.
Listing 1 shows the simplified code for one of the basic views for the rank calculation. This view creates a score for each post based on the scores of all incoming
links of blogs. It differentiates between the various link locations or link types
of the incoming links. The final rank is calculated by the weighted sum of the
42
different link types [11].
Listing 1: SQL view creates post score per link type
CREATE VIEW postScoreByLinkType AS
SELECT post , l i n k t y p e , AVG( scoreOfIncomingBlogs ) AS s c o r e
FROM
postByIncomingPostAndLinkType AS i n B l o g
JOIN
normalizedBiImpactScore AS s c o r e
ON s c o r e . h o s t = i n B l o g . h o s t
GROUP BY post , l i n k t y p e ;
43
6 Implementation of the Topic-Consistency Rank
44
7 Evaluation
This Section discusses the results and the plausibility of the topic consistency
rank. Therefore, the evaluation shows the results of the partial ranks, the overall
rank, and compares it to the results of the BI-Impact score.
7.1 Experimental Setup
For the evaluation of this master’s thesis, we activated the BlogIntelligence
crawler for one month. The crawler uses an 8 core machine with 24 gigabyte
RAM running Ubuntu Linux.
The harvested data is stored in a separate
database machine with 32 cores and 1 terabyte RAM running Suse Linux. This
machine also runs the SQL analytical queries.
The cluster setup for the topic detection consists of 12 machines with 2 cores and
4 gigabyte RAM each. These machines are grouped into one Hadoop cluster that
is configured to run 50 parallel tasks.
The key data indicators of the data set are shown in Tab. 6.
Indicator
data set size
crawled web pages
Value (approx.)
500 GB
2.5 million
identified blogs
12,000
identified posts
600,000
average words per post
57.5
average number of categories per post
2.6
average number of tags per post
4.2
number of news portals
1,300
Table 6: State of the BlogIntelligence data set.
45
7 Evaluation
7.2 Clustering
The quality of the underlying clustering is crucial for the quality of the topic
consistency rank. Especially, the size of clusters determines whether blogs with
a versatile interest wrongly get a good consistency rank.
The k-means clustering of the Mahout implementation runs on the cluster setup.
The runtime depends on the number of iterations and the number of desired
clusters. It varies between 8 to 20 minutes per iteration. However, the topic
detection only has to be repeated if the number of words significantly changes.
After the term extraction procedure, the data set contains 450 000 words. The
resulting matrix for words and posts consists of 2.7 billion tf*idf values. Most of
the values are zero. Therefore, Mahout uses a sparse vector representation that
results in a matrix size of only 144 megabyte.
For the clustering, four different variants are evaluated. The indicators for the
quality of the clusterings are shown in Tab. 7.
Variant 1
Variant 2
Variant 3
Variant 4
100
10 000
10 000
20 000
10
10
40
40
maximum cluster size
448 546
419 453
187 093
21 234
minimum cluster size
1
1
1
1
52
5 398
4 419
18 546
minimum filtered cluster size
2
2
2
2
maximum filtered cluster size
37
83
52
383
8.73
4.55
3.86
10.1
Parameters:
k
iteration
Results:
number of filtered clusters
average filtered cluster size
Table 7: Quality of the tested clustering configurations.
The number of filtered clusters is always below the actual calculated number of
clusters of k-means, called k. This is caused by the filtering of too small and too
large clusters. The filtering is conservative. It removes clusters with a size of
46
one. This avoids expensive and too specific word distance calculations. Further,
clusters with more than 1,000 words are ignored, because the word diversity of
this cluster harms the validity of the topic consistency rank.
Variant 1 creates 100 clusters with a maximum cluster size of 448 546 words.
These words cannot be considered, because the cluster size is larger than 1,000.
Thereby, only 1 500 words are grouped into meaningful clusters. With an average cluster size of 8.73, there are enough words per cluster to describe a topic.
Variant 1 creates too few clusters. Therefore, the cluster number is increased in
variant 2 to 10 000. Although it creates more than 5 000 filtered clusters, the average cluster size halves and the number of unused words in the biggest cluster
only negligible decreases. Hence, variant 3 increases the number of iterations to
get a better word distribution among the clusters.
Unexpectedly, the number of filtered clusters decreases for variant 3. The size
of the maximum cluster decreases and the average size of filtered clusters also
decreases. Consequently, variant 4 creates more clusters with a size over 1,000
than variant 2.
To further increase the number and average size of filtered clusters, variant 4
increases the number of created clusters. Variant 4 gives the best results in the
evaluation. It contains over 18,000 filtered clusters and the maximum cluster size
decreases to about 20,000. In addition, variant 4 has on average 10 words per
cluster, which is a far more promising distribution than all three other variants.
As a consequence of the clustering evaluation, the topic consistency rank calculation uses the filtered clusters of variant 4.
7.3 Results of the Topic Consistency Sub Ranks
The ten best blogs for each of the topic consistency sub ranks are calculated. The
BI crawler is focused to crawl the German blogosphere. Therefore, the majority
of all blogs is German and the top consistency blogs are German, too. For each
of the sub ranks, two highly ranked representatives are introduced in detail.
The top ten blogs for the two post-related sub ranks are shown in Tab. 8.
47
7 Evaluation
Rank
Intra-Post
Inter-Post
1
promicabana.de
blog.de.playstation.com
2
dsds2011.info
upload-magazin.de
3
blog.beetlebum.de
blog.studivz.net
4
schockwellenreiter.
der-postillon.com
5
hornoxe.com
allfacebook.de
6
netbooknews.de
achgut.com
7
iphoneblog.de
gutjahr.biz
8
carta.info
elmastudio.de
9
blog.studivz.net
netzwertig.com
seo.at
lawblog.de
10
Table 8: The top ten ranked blogs for intra-post and inter-post consistency.
One example for an high intra-post consistency is the dsds2011.info blog. The
intra-post consistency gives the average internal consistency of posts in a blog.
dsds2011.info is a follower blog of a German TV show that has the aim to cast a
new superstar. This blog is a fan blog. Therefore, each post mostly focuses on
one person, e.g. the current candidate. Further, some posts discuss the performance of each candidate of a show. This causes that each paragraph of such a
post focuses on another person, but also uses the same attributes to describe the
performance.
Another blog with an high intra-post consistency is the iphoneblog.de. Obviously,
the topics of each post are all related news about Apple’s iPhone. Each post of
this blog contains on average five paragraphs, is carefully investigated, and concentrates on one feature, game, or accessory of the iPhone. These special interests are fully investigated in a post over several paragraphs. As a consequence,
the internal consistency of the posts is high.
A representative for an high inter-post consistency is the blog.de.playstation.com
blog. This blog has an high topical consistency between the latest published
posts. The main focus of this blog is on PlayStation games. Hereby, it frequently publishes posts about the latest games, which are discussed regarding
48
their game play, graphics, and story line. Each post presents a game in a similar
structure and phrasing. Thus, the topical distance between these posts is very
low and the topical consistency is very high.
Another highly ranked blog regarding the consistency between posts is allfacebook.de. It publishes posts about new features of the social network, discussion
about privacy, and the latest news about Facebook. Although this blogs handles these three topics, it usually publishes multiple posts per topic in a row.
This decreases the distance between succeeding posts and boost its inter-post
consistency.
Rank
Intra-Blog
Inter-Blog
1
readers-edition.de
innenaussen.com
2
iphoneblog.de
shopblog author.de
3
eisy.eu
nachdenkseiten.de
4
karrierebibel.de
helmschrott.de
5
meinungs-blog.de
blog.studivz.net
6
dsds2011.info
fanartisch.de
7
macerkopf.de
achgut.com
8
kwerfeldein.de
internet-law.de
9
events.ccc.de
scienceblogs.de
mobiflip.de
events.ccc.de
10
Table 9: The top ten ranked blogs for intra-blog and inter-blog consistency.
The top ten blogs for the two blog-related sub ranks are shown in Tab. 9.
One example of an high intra-blog consistency rank is also the iphoneblog.de
blog. This blog uses the post classification in an appropriate way. As mentioned above, the posts of this blog are carefully edited. By investigating the
content of the blog, it is observable that each post contains beside the common
categories also at least six content-specific tags. This shows that a blog gains a
high consistency ranking for the intra-post and intra-blog consistency by carefully authoring its posts.
Another example is the macerkopf.de blog. In contrast to iphoneblog.de, the posts of
49
7 Evaluation
this blog handle a higher variety of topics and comment more critical. For example, they frequently compare the iPhone against other mobile phones. Hereby,
a post covers at least two topics. Nevertheless, categories and tags address each
topic of the post, which results in a high quality of the classification and in a
high intra-blog consistency rank.
The inter-blog consistency measures the consistency of a blog with a linking
and linked blogs. The best ranked blog for the inter-blog consistency is the innenaussen.com blog. This blog writes reviews about diverse beauty products.
The blog link graph indicates that this blog is mainly linking other product reviews e.g. for referencing another opinion on the product. Further, it is observable that it is also linked by product review blogs on beauty products like the
lipglossladys.com blog.
The scienceblogs.de blog has also an high inter-blog consistency rank. This is
caused by its link directory nature. It mainly collects and summarizes posts
from other science-related blogs and provides an entry point into a science community. This blog mainly references the original content. Thereby, its summaries
are very consistent with the linked content.
In addition, by comparing all four sub ranks of Tab. 8 and Tab. 9, the
blog.studivz.net shows high consistency ranks for each subrank except the
intra-blog consistency. This blog writes about topics around a German social
network called studiVZ. It is a typical corporate blog that describes news and
new features of a company and the company’s products. Hereby, the blog
has highly consistent posts that discuss a topic over multiple paragraphs. It
constantly posts about activities of the company and is linked by blogs, which
spread the news of the company. Nevertheless, each post of this blog is not
tagged and is only categorized as allgemein (German for miscellaneous), which
is a common standard configuration for blog systems.
By investigating the top ten rank blogs for each subrank, two examples for each
subrank are analyzed and the evaluation shows that the sub ranks create plausible results.
50
7.4 Comparison of BI-Impact and Combined Topic Consistency Rank
The weighted combination of all sub ranks is the combined topic consistency
rank. It identifies the topical consistent blogs in the data set. Thereby, it creates
a ranking of experts depending on the consistency of their writing. In contrast,
the BI-Impact aims to identify the most influential blog authors with the highest
reach and famousness.
During the evaluation, both ranks are compared against each other to find possible correlations.
Blog
Combined topic
consistency rank
BI-Impact
helmschrott.de
1
85
gedankendeponie.net
2
94
yuccatree.de
3
104
upload-magazin.de
4
96
nachdenkseiten.de
5
117
events.ccc.de
6
54
telemedicus.info
7
118
bei-abriss-aufstand.de
8
90
stereopoly.de
9
87
10
88
annalist.noblogs.org
Table 10: Top ten ranked blogs for the combined topic consistency rank with
their BI-Impact rank
First, the top ten blogs concerning the combined topic consistency rank are investigated. As shown in Tab. 10, each top ten blog is listed with its ranking
position regarding both rankings.
The two sample blogs, yuccatree.de and telemedicus.info, have high combined
topic consistency ranks. yuccatree.de has a low inter-post consistency value
caused by the diversity of discussed topics. However, it has a high combined
consistency score because all remaining three consistency sub ranks are very
51
7 Evaluation
high. In contrast, the telemedicus.info blog focuses only on privacy and patent
right discussions. Thus, it has a very high inter-post consistency that results in
combination with the proper usage of tags in a high combined topic consistency
rank.
In contrast, both have a very low BI-Impact score. Thus, both are not identified
as highly influential blogs, because there position in the blog link graph has not
enough influence. This can be seen for all other blogs of the top ten, as well.
Blog
Combined topic
consistency rank
BI-Impact
fuenf-filmfreunde.de
54
1
sistrix.de
97
2
142
3
t3n.de
49
4
scienceblogs.de
75
5
fontblog.de
37
6
de.engadget.com
52
7
achgut.com
34
8
schockwellenreiter.de
77
9
saschalobo.com
35
10
elektrischer-reporter.de
Table 11: Top ten ranked blogs for the BI-Impact rank with their combined topic
consistency rank
Secondly, the top ten blogs regarding the BI-Impact rank are investigated. As
shown in Tab. 11, the blogs are ordered by the BI-Impact rank and listed with
their combined topic consistency rank.
By investigating three sample blogs, namely t3n.de, de.engadget.com, and
saschalobo.com, it is observed that the most influential blogs deal with a high
number of topics. These blogs summarize current events in technology or give
their opinions to diverse political discussions.
Although these blogs contain high quality content, the number of discussed
52
1
0,9
Normalized score
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
97
100
94
91
88
85
82
79
76
73
70
67
64
61
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
13
7
10
4
1
0
Rank position
Consistency
BI-Impact
Figure 10: BI-Impact and topic consistency rank for top 100 blogs ordered by
topic consistency rank.
topics is very high. Further, the inter-blog consistency decreases through the
number of different view points and the wide range of linking blog authors.
The intra-post consistency also decreases by the usage of summary posts which
summarize the news of a day.
The exemplarily analysis of the top ten implies an inverse relation between the
topic consistency of a blog and its reach. Thus, the expectation is to find a correlation between the BI-Impact rank and the topic consistency rank.
To evaluate this, an analysis of the top 100 ranked blogs is done. The behavior
of both ranks is shown in Fig. 10 and Fig. 11.
In Fig. 10, the blogs are ordered by their ranking position in topic consistency
ranking. The best blog gets the rank position one. The topic consistency rank
is monotonously decreasing with the ranking position. Contradictory to the
expectation, no correlation is observable between both ranks.
However, an accumulation of higher BI-Impact scores can be identified in the
area of low consistency ranks. It looks like blogs, which handle a higher diversity of topics, gain more influence in the blogosphere. In contrast, the BI-Impact
53
7 Evaluation
1
0,9
Normalized score
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
97
100
94
91
88
85
82
79
76
73
70
67
64
61
58
55
52
49
46
43
40
37
34
31
28
25
22
19
16
13
7
10
4
1
0
Rank position
Consistency
BI-Impact
Figure 11: BI-Impact score and topic consistency rank for top 100 blogs ordered
by BI-Impact rank.
score of the most topical consistent blogs is low. Consequently, these blogs have
low impact and a low reach. The assumption is that they form closed expert
communities which are less integrated into the blogosphere.
The same is observable by looking at the behavior of the topic consistency rank
if the blogs are ordered by their BI-Impact score. There is an accumulation of
high topic consistency ranks at the long-tail of the BI-Impact score. In addition,
a small accumulation of medium topic consistency ranks at rank position 3-16 is
observable. However, a correlation between both scores cannot be observed.
54
8 Recommendations for Future Research
The focus of this thesis is to motivate and define a topic consistency rank for
blogs. The formal definition and implementation specially focus on a resource
efficient and fast calculation. Therefore, complex algorithms and dependencies
to external resources are avoided. Nevertheless, this should be focus for future
research.
8.1 Enhanced Topic Detection
The central part of our topic consistency rank is the topic detection. As already
discussed, k-means clustering detects the topics in the introduced implementation. Nevertheless, the central shortcoming of this approach is that it is highly
dependent on the underlying collection. Thus, the rank depends on the crawl
coverage of BlogIntelligence. There are several approaches that can circumvent
this problem.
Wikipedia.
Although the content creation in the blogosphere is highly interac-
tive, it does not aim to provide reliable knowledge. In contrast, Wikipedia offers
a great information source of reviewed content. Wikipedia is fully available for
download. The whole set of articles is available online and covers each imaginable topic. Thus, a word clustering based on this data has to be tested whether
it can provide more reliable clusters.
Thesauri.
Another solution is the usage of thesauri.
A thesaurus is a
dictionary-like database that additionally contains acronyms, synonyms, and
hypernyms. Currently, the most important words are identified by calculating
the tf*idf score for each word. By using thesauri, the collection of common hypernyms for the most important words of a post is possible. These hypernyms
can serve as new clusters with all their subordinated words.
Thesauri are human-made collections and several times iterated by linguistic
researchers. Thereby, the clustering will have an high quality and an intuitive
55
8 Recommendations for Future Research
grouping. One frequently referenced thesaurus is WordNet [57]. WordNet allows
the complete download of its database. This enables the analysis to load the
complete knowledge in-memory and perform a fast matching of words and hypernyms. Although this process is expected to perform slower than the k-means
clustering, the results can be more promising.
Ontologies.
A promising solution is the usage of ontologies.
"An ontology is an explicit, formal specification of a shared conceptualization. The term is borrowed from philosophy, where an
Ontology is a systematic account of existence. For AI systems, what
"exists" is everything that can be represented." [58]
An ontology holds numerous relations between concepts. Among others, an
ontology defines classes of resources and super classes of classes.
To use ontologies, the post’s content has to be assigned to the concepts present
in the ontologies. This is a hard problem and frequently discussed in ongoing
research [59, 60, 61]. Hereby, the probability of a word or word group representing a specific concept is needed. The probability is influenced by the direct
context of the word and by the overall collection.
Although this results in a hard calculation problem, the data is semantically
enriched. These semantics can be used to easily derive clusters with different
granularities. Further, it enables us to make the results machine readable and to
offer more semantic filtering to users.
Sentiments.
Beside the quality of blog posts, incorporating the opinion of blog
authors in the ranking is a future challenge. For example, the user may want
to identify a blog author that constantly writes positively or negatively about
a topic like Apple. Thereby, BlogIntelligence should provide special insights to
identify fans and haters of products or persons.
Therefore, sentiment analysis should be applied to the posts’ content. Sentiment
analysis determines the attitude of a writer [62]. The attitude is the emotional
state of the authors.
56
Probability distributions.
As discussed in Sec. 5.2, a k-means clustering assigns
words to topics. Although this gives promising results, another approach is to
view topics as probability distributions over words. Thus, each word is assigned
to a topic with a specific probability. This probability distribution creates overlapping topic clusters that represent the reality in more detail than a distinct
assignment of word to topics. Hereby, the word ray (light ray) get assigned to
physics, but also to fishing (ray-bones at the fin of a fish) with a smaller probability.
Multilingual clustering.
The word clustering in this thesis is limited to a Ger-
man data set. Thereby, the problem of multilingual clustering is circumvented.
Due to the future extension of BlogIntelligence to the whole blogosphere, the clustering also has to detect topics over language boundaries. This problem is discussed by Chen et al. [63], who propose to first cluster each language and afterwards merge the resulting topic clusters. Future work has to integrate this or
a similar approach into the topic detection to solve the multilingual clustering
problem.
8.2 Visualization
The key component of the BlogIntelligence framework is the visualization. It enables users to understand and use the results of the BI analyses.
The topic consistency rank presented in this paper is a complex calculation. It
results in a numerical value for each blog. By displaying this number, the user is
not able to relate it to other blogs or to interpolate its meaning. Therefore, future
work will address the creation of an appropriate visualization.
This visualization helps the user to explore and categorize blogs based on their
visual perception. As discussed in Sec. 2.2.3, the BlogConnect visualization of
BlogIntelligence already shows an exploratory overview to the blogosphere. To
integrate the topic consistency rank into this view, another visual dimension gets
introduced. This dimension has to symbolize the consistency of a blog. The user
has to be able to perceive the order of blogs regarding their consistency. Thus,
57
8 Recommendations for Future Research
Topic
Granularity
Society
Politics
Minimum
BI-Impact Score
Health
Minimum
Topic Consistency
Tech
Movies
Search
War
Figure 12: BlogConnect 2.0 with topic consistency represented as color value.
the color value of blog bubbles serves as the indicator for their topic consistency.
The value is hereby the direct mapping from the normalized rank multiplied
with a constant parameter.
The prototypically BlogConnect 2.0 visualization is shown in Fig. 12. As shown,
the user still controls the set of blogs via a search term at the lower right corner of the visualization. Blogs are only shown if they are related to the search
term. Essentially, there are three extensions to the current BlogConnect visualization. First, blog bubbles are now ordered around their assigned topics. The
topic names have to be calculated via a cluster labeling algorithm, which is also
subject to future research. Further, the arrangement around the topics is based
on a gravitation simulation where the force is determined via the distance of a
blog to the clusters centroid.
As mentioned above, the color value of the blog bubble represents the degree
of topic consistency. As shown in Fig. 12, blogs with a high consistency shine
threw the cloud of dark inconsistency blogs. Hereby, the small light point also
helps the user to compare less consistent blogs.
58
Topic
Granularity
Society
Politics
Minimum
BI-Impact Score
Health
Minimum
Topic Consistency
Tech
Movies
Search
War
Figure 13: BlogConnect with a high minimal topic consistency threshold.
Third is the introduction of an interactive toolbar with three controls. The first
control regulates the topic granularity of the visualization. One can see five topics. By raising the granularity, the BlogIntelligence framework calculates a higher
number of clusters. This enables the user to explore the blogs in more detail. In
addition, the user is able to configure the minimum BI-Impact score. All blogs
with a lower score get excluded from the view leaving the most important blog
for the user. Similarly, the minimum topic consistency can be controlled by the
user. Thus, the user can exclude inconsistent blogs from his overview. As shown
in Fig. 13, the higher the topic consistency threshold the less blogs are shown.
One can see, that even big blogs disappear caused by their versatile interest.
8.3 Full integration with SAP HANA
The full integration into SAP HANA is one of the main goals for the future of
BlogIntelligence. Hereby, the focus lies on transferring the text analysis foundations into the core of HANA and creating an API for future text analysis algorithms.
59
8 Recommendations for Future Research
As discussed in Sec. 5, the tf*idf calculation runs inside SAP HANA. Although
the SQL procedures run totally on the database,they use a externally extracted
dictionary table instead of the database owned word index.
Transportation costs can be decreased by implementing the k-means clustering
directly into the database. Although the k-means algorithms already runs in a
distributed environment, an full in-memory computation can achieve an additional performance boost. However, this expectation has to be tested by integrating it into SAP HANA.
Furthermore, the actual consistency ranking calculation can be adapted to incrementally update the rank of each blog on the insertion of new posts. Due to the
integration of the text analysis algorithms into SAP HANA, the overall aim of
BlogIntelligence, which is to provide real-time analytics of the blogosphere, can
be approached.
60
9 Conclusion
This master’s thesis proposed a metric for topical consistency of a blog with the
goal to identify domain experts in the blogosphere.
It is discussed that current blog ranking approaches focus on finding the most
influential blogs that attract a large audience and thus more visitors, links, and
comments. Further, it is argued that niche blogs with a very specific topic can
only attract a limited audience and thus have only a small reach. For a blog to
develop expert knowledge, it should show recurring interest in its topics and
therefore concentrate on a small set of topics. To identify those experts blogs is
particular important for domain experts to find blogs which they can observe
and interact with.
To ease the retrieval of these blogs, four different aspects of topic consistency
were defined: (1) intra-post, (2) inter-post, (3) intra-blog, and (4) inter-blog consistency. These aspects define the consistency of a blog on different granularities:
from the internal consistency of a post’s paragraphs to the global consistency between a blog and its linking and linked blogs. The four aspects are combined
into a joint rank, called topic consistency rank.
The implementation of the topic consistency rank was introduced. Further, this
thesis showed how the topic consistency rank is integrated into the blog analytics framework, BlogIntelligence. The foundation of the topic consistency rank
is based on the topic detection, which implements the automatic assignment of
words into groups of highly related words. These groups are defined as topics.
Using this topic detection, the implementation of the four aspects and the final rank were described with focus on the specifics of the persistence layer SAP
HANA.
The plausibility of the topic consistency rank was evaluated based on a real
world data set. This data set consisted of 12,000 crawler blogs that were collected by the BlogIntelligence crawler. The top ten results of each aspect were
analyzed and two representatives were discussed in detail.
In addition, the correlation between the topic consistency of a blogs and it influ-
61
9 Conclusion
ence was evaluated. This was done by implementing the BI-Impact score that is
a measure for the reach and the impact of a blog and incorporates blog-specific
characteristics.
The analysis of the top ten blogs appeared to imply an inverse relation between
the topic consistency of a blog and its reach i.e. the more consistent a blog is,
the less influence it can gain in the blogosphere. In contrast, by analyzing the
distribution of ranks among the top hundred, it could not be observed that there
is a correlation between the influence and the consistency of blogs. Thus, both
metrics are considered to be independent.
As a consequence, the topic consistency rank is established as an additional indicator, beside the influence of a blog, to ease the blog retrieval for domain experts.
Future work includes the enhancement of the topic detection to provide more
specific and accurate topics that allows words to be part of multiple topics. The
influence of this enhancement on the results of the topic consistency rank should
be analyzed. In addition, the proposed visualization, BlogConnect 2.0, should
be integrated into the BlogIntelligence web portal to offer the results of the topic
consistency rank to the user.
62
..
63
9 Conclusion
64
List of Abbreviations
API
Application Programming Interface
ATOM
Atom Syndication Format
BI
BlogIntelligence
BI-Impact
BlogIntelligence-Impact-Score
Blog
Weblog
HDFS
Hadoop Distributed File System
HITS
Hyperlink-Induced Topic Search
HTTP
Hypertext Transfer Protocol
IR
Information Retrieval
RAM
Random-Access Memory
RPC
Remote Procedure Call
RSS
Rich Site Summary
Splog
Spam Blog
SQL
Structured Query Language
tf*idf
Term Frequency-Inverse Document Frequency
URI
Uniform Resource Identifier
WWW
World Wide Web
XML
Extensible Markup Language
65
9 Conclusion
66
List of Figures
List of Figures
1
Overview of blog topics. . . . . . . . . . . . . . . . . . . . . . . . .
2
2
An example tag cloud. . . . . . . . . . . . . . . . . . . . . . . . . .
6
3
BlogIntelligence architecture overview. . . . . . . . . . . . . . . . .
9
4
BlogConnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
5
PostConnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
6
Ranking variables of the BI-Impact. . . . . . . . . . . . . . . . . . .
19
7
Visualization of post-topic-probabilities. . . . . . . . . . . . . . . .
24
8
Topic detection flow diagram. . . . . . . . . . . . . . . . . . . . . .
33
9
An example iteration of k-means. . . . . . . . . . . . . . . . . . . .
37
10
BI-Impact and topic consistency ordered by topic consistency rank. 53
11
BI-Impact and topic consistency ordered by BI-Impact. . . . . . .
54
12
BlogConnect 2.0, . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
13
BlogConnect 2.0 with a minimal topic consistency. . . . . . . . . .
59
67
List of Figures
68
List of Tables
List of Tables
1
Example tf*idf vector table. . . . . . . . . . . . . . . . . . . . . . .
35
2
Sparse word vector representation. . . . . . . . . . . . . . . . . . .
36
3
Resulting cluster table. . . . . . . . . . . . . . . . . . . . . . . . . .
38
4
Example of the dictionary table. . . . . . . . . . . . . . . . . . . . .
39
5
Example of the link table. . . . . . . . . . . . . . . . . . . . . . . .
42
6
State of the BlogIntelligence data set. . . . . . . . . . . . . . . . . . .
45
7
Clustering quality results. . . . . . . . . . . . . . . . . . . . . . . .
46
8
Top 10 blogs for intra-post and inter-post consistency. . . . . . . .
48
9
Top 10 blogs for intra-blog and inter-blog consistency. . . . . . . .
49
10
Top 10 blogs for combined topic consistency rank with BI-Impact.
51
11
Top 10 blogs for BI-Impact with combined topic consistency rank.
52
69
List of Tables
70
References
References
[1] T. Cook and L. Hopkins: Social media or, “how i learned to stop worrying and
love communication”, September 2007.
http://trevorcook.typepad.com/weblog/files/
CookHopkins-SocialMediaWhitePaper-2007.pdf.
[2] R. Ramakrishnan and A. Tomkins: Toward a peopleweb.
Computer, 40(8):63–72, 2007.
[3] H. Kircher: Web 2.0-plattform für innovation.
IT-Information Technology, 49(1):63–65, 2007.
[4] N.J. Thurman: Forums for citizen journalists? adoption of user generated content
initiatives by online news media.
New Media & Society, 10(1):139–157, 2008.
[5] S.D. Reese, L. Rutigliano, K. Hyun, and J. Jeong: Mapping the blogosphere
professional and citizen-based media in the global news arena.
Journalism, 8(3):235–261, 2007.
[6] J. Schmidt: Weblogs: eine kommunikationssoziologische studie.
2006.
[7] Tom Smith: Power to the People: Social Media Tracker Wave 3. Technical report
2008, 2008.
http://www.slideshare.net/Tomuniversal/
wave-3-social-media-tracker-presentation.
[8] J. Arguello, J. Elsas, J. Callan, and J. Carbonell: Document representation and
query expansion models for blog recommendation.
In Proc. of the 2nd Intl. Conf. on Weblogs and Social Media (ICWSM), 2008.
[9] I.H. Witten, E. Frank, and M.A. Hall: Data Mining: Practical Machine Learning Tools and Techniques: Practical Machine Learning Tools and Techniques.
Morgan Kaufmann, 2011.
[10] J. Bross: Understanding and Leveraging the Social Physics of the Blogosphere.
PhD thesis, Hasso-Plattner-Institute, 2011.
[11] J. Bross, K. Richly, M. Kohnen, and C. Meinel: Identifying the top-dogs of the
blogosphere.
Social Network Analysis and Mining, pages 1–15, 2011.
[12] D.L. Lee, H. Chuang, and K. Seamons: Document ranking and the vector-space
71
References
model.
Software, IEEE, 14(2):67–75, 1997.
[13] L. Page, S. Brin, R. Motwani, and T. Winograd: The pagerank citation ranking:
Bringing order to the web.
1999.
[14] M. Clements, A.P. de Vries, and M.J.T. Reinders: Optimizing single term
queries using a personalized markov random walk over the social graph.
In Workshop on Exploiting Semantic Annotations in Information Retrieval
(ESAIR), 2008.
[15] W. Weerkamp and M. De Rijke: Credibility improves topical blog post retrieval.
Association for Computational Linguistics (ACL), 2008.
[16] K. Balog, L. Azzopardi, and M. De Rijke: Formal models for expert finding in
enterprise corpora.
In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 43–50. ACM, 2006.
[17] R. Blood: Weblogs: a history and perspective.
Rebecca’s Pocket, 7(9), 2000.
[18] C. Körner, R. Kern, H.P. Grahsl, and M. Strohmaier: Of categorizers and describers: An evaluation of quantitative measures for tagging motivation.
In Proceedings of the 21st ACM conference on Hypertext and hypermedia, pages
157–166. ACM, 2010.
[19] O. Kaser and D. Lemire: Tag-cloud drawing: Algorithms for cloud visualization.
arXiv preprint cs/0703109, 2007.
[20] C. Marlow: Audience, structure and authority in the weblog community.
In International Communication Association Conference, volume 27, 2004.
[21] M. Gumbrecht: Blogs as “protected space”.
In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis
and Dynamics, volume 2004, 2004.
[22] S. Thies: Content-Interaktionsbeziehungen im Internet: Ausgestaltung und Erfolg.
Springer DE, 2004.
[23] M. Kobayashi and K. Takeda: Information retrieval on the web.
ACM Computing Surveys (CSUR), 32(2):144–173, 2000.
[24] J. Broß, P. Schilf, M. Jenders, and C. Meinel: Visualizing the blogosphere with
72
References
blogconnect.
In Privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing
(socialcom), pages 651–656. IEEE, 2011.
[25] J. Bross, P. Schilf, and C. Meinel: Visualizing blog archives to explore contentand context-related interdependencies.
In Conf. Web Intelligence and Intelligent Agent Technology, 2010.
[26] P. Berger, P. Hennig, J. Bross, and C. Meinel: Mapping the blogosphere–towards
a universal and scalable blog-crawler.
In 2011 IEEE Third International Conference on Social Computing (SocialCom),
pages 672–677. IEEE, 2011.
[27] M. Cafarella and D. Cutting: Building nutch: Open source search: A case study
in writing an open source search engine.
ACM Queue, 2(2), 2004.
[28] J. Dean and S. Ghemawat: Mapreduce: Simplified data processing on large clusters.
Communications of the ACM, 51(1):107–113, 2008.
[29] R. Khare, D. Cutting, K. Sitaker, and A. Rifkin: Nutch: A flexible and scalable
open-source web search engine.
Oregon State University, 2004.
[30] M. Michael, J.E. Moreira, D. Shiloach, and R.W. Wisniewski: Scale-up x scaleout: A case study using nutch/lucene.
In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1–8. IEEE, 2007.
[31] D. Borthakur: The hadoop distributed file system: Architecture and design.
Hadoop Project Website, 11:21, 2007.
[32] J. Shafer, S. Rixner, and A.L. Cox: The hadoop distributed filesystem: Balancing
portability and performance.
In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pages 122–133. IEEE, 2010.
[33] H. Plattner and A. Zeier: In-memory data management: an inflection point for
enterprise applications.
Springer, 2011.
[34] A.K. Jain, M.N. Murty, and P.J. Flynn: Data clustering: a review.
73
References
ACM computing surveys (CSUR), 31(3):264–323, 1999.
[35] J. Han and M. Kamber: Data mining: concepts and techniques.
Morgan Kaufmann, 2006.
[36] S. Owen, R. Anil, T. Dunning, and E. Friedman: Mahout in action.
Online, pages 1–90, 2011.
[37] A.N. Langville, C.D. Meyer, and P. FernÁndez: Google’s pagerank and beyond:
The science of search engine rankings.
The Mathematical Intelligencer, 30(1):68–69, 2008.
[38] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen: Combating web spam with
trustrank.
In Proceedings of the Thirtieth international conference on Very large data basesVolume 30, pages 576–587. VLDB Endowment, 2004.
[39] Jon Kleinberg: Bursty and hierarchical structure in streams.
In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’02, page 91, New York, New York,
USA, July 2002. ACM Press, ISBN 158113567X.
[40] K. Fujimura, H. Toda, T. Inoue, N. Hiroshima, R. Kataoka, and M. Sugizaki:
Blogranger - a multi-faceted blog search engine.
In Proceedings of the WWW 2006 3nd annual workshop on the weblogging ecosystem: Aggregation, analysis and dynamics, 2006.
[41] Technorati: What is technorati authority?, September 2012.
http://technorati.com/what-is-technorati-authority.
[42] A. Kritikopoulos, M. Sideri, and I. Varlamis: Blogrank: ranking weblogs based
on connectivity and similarity features.
In Proceedings of the 2nd international workshop on Advanced architectures and
algorithms for internet delivery and applications, page 8. ACM, 2006.
[43] R. Schirru, D. Obradović, S. Baumann, and P. Wortmann: Domain-specific
identification of topics and trends in the blogosphere.
Advances in Data Mining. Applications and Theoretical Aspects, pages
490–504, 2010.
[44] K. Sriphaew, H. Takamura, and M. Okumura: Cool blog identification using
topic-based models.
In Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT’08.
IEEE/WIC/ACM International Conference on, volume 1, pages 402–406.
74
References
IEEE, 2008.
[45] L. Zhu, A. Sun, and B. Choi: Online spam-blog detection through blog search.
In Proceedings of the 17th ACM conference on Information and knowledge management, pages 1347–1348. ACM, 2008.
[46] T. Katayama, T. Utsuro, Y. Sato, T. Yoshinaka, Y. Kawada, and T. Fukuhara:
An empirical study on selective sampling in active learning for splog detection.
In 5th International Workshop on Adversarial Information Retrieval on the Web,
pages 29–36. ACM, 2009.
[47] P. Kolari, A. Java, and T. Finin: Characterizing the splogosphere.
In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference. University
of Maryland, Baltimore County, 2006.
[48] W. Liu, S. Tan, H. Xu, and L. Wang: Splog filtering based on writing consistency.
In Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT’08.
IEEE/WIC/ACM International Conference on, volume 1, pages 227–233.
IEEE, 2008.
[49] J. He, W. Weerkamp, M. Larson, and M. de Rijke: An effective coherence measure to determine topical consistency in user-generated content.
International journal on document analysis and recognition, 12(3):185–203,
2009.
[50] M. Chen and T. Ohta: Using blog content depth and breadth to access and classify
blogs.
International Journal of Business and Information, 5(1):26–45, 2010.
[51] K. Eguchi, K. Kuriyama, and N. Kando: Sensitivity of ir systems evaluation to
topic difficulty.
In Proceedings of the 3rd International Conference on Language Resources and
Evaluation (LREC 2002), volume 2, pages 585–589. Citeseer, 2002.
[52] G. Salton and C. Buckley: Term-weighting approaches in automatic text retrieval.
Information processing & management, 24(5):513–523, 1988.
[53] M. Fernández, D. Vallet, and P. Castells: Probabilistic score normalization for
rank aggregation.
Advances in Information Retrieval, pages 553–556, 2006.
[54] R.M. Esteves, R. Pais, and C. Rong: K-means clustering in the cloud–a mahout
75
References
test.
In Advanced Information Networking and Applications (WAINA), 2011 IEEE
Workshops of International Conference on, pages 514–519. IEEE, 2011.
[55] Hidenao Abe and Shusaku Tsumoto: Evaluating a temporal pattern detection
method for finding research keys in bibliographical data.
pages 1–17, January 2011.
[56] J.C. Tressler, M.H. Larock, and C.E. Lewis: Mastering Effective English.
The Copp Clark., 1980.
[57] C. Fellbaum: Wordnet.
Theory and Applications of Ontology: Computer Applications, pages 231–
243, 2010.
[58] T.R. Gruber et al.: A translation approach to portable ontology specifications.
Knowledge acquisition, 5(2):199–220, 1993.
[59] A. Hotho, A. Maedche, and S. Staab: Ontology-based text document clustering.
KI, 16(4):48–54, 2002.
[60] L. Jing, L. Zhou, M.K. Ng, and J.Z. Huang: Ontology-based distance measure
for text clustering.
In Proc. of SIAM SDM workshop on text mining, 2006.
[61] Y. Ding and X. Fu: A text document clustering method based on ontology.
Advances in Neural Networks–ISNN 2011, pages 199–206, 2011.
[62] B. Pang and L. Lee: Opinion mining and sentiment analysis.
Now Pub, 2008.
[63] H.H. Chen and C.J. Lin: A multilingual news summarizer.
In Proceedings of the 18th conference on Computational linguistics-Volume 1,
pages 159–165. Association for Computational Linguistics, 2000.
76
References
..
77
References
..
78
References
..
79
References
..
80
References
..
81
References
..
82