Lecture Notes in Computer Science 6045

Transcription

Lecture Notes in Computer Science
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
6045
John G. Breslin Thomas N. Burg
Hong-Gee Kim Tom Raftery
Jan-Hinrik Schmidt (Eds.)
Recent Trends
and Developments
in Social Software
International Conferences on Social Software
BlogTalk 2008, Cork, Ireland, March 3-4, 2008,
and BlogTalk 2009, Jeju Island, South Korea,
September 15-16, 2009
Revised Selected Papers
13
Volume Editors
John G. Breslin
National University of Ireland
Engineering and Informatics
Galway, Ireland
E-mail: [email protected]
Thomas N. Burg
Socialware
Vienna, Austria
Hong-Gee Kim
Seoul National University
Biomedical Knowledge Engineering Laboratory
Seoul, Korea
Tom Raftery
Red Monk
Seattle, WA, USA
Jan-Hinrik Schmidt
Hans Bredow Institut
Hamburg, Germany
Library of Congress Control Number: 2010936768
CR Subject Classification (1998): H.3.5, C.2.4, H.4.3, H.5
LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web
and HCI
ISSN
ISBN-10
ISBN-13
0302-9743
3-642-16580-X Springer Berlin Heidelberg New York
978-3-642-16580-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
springer.com
© Springer-Verlag Berlin Heidelberg 2010
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
06/3180
Preface
1 Overview
From its beginnings, the Internet has fostered communication, collaboration and
networking between users. However, the first boom at the turn of the millennium
was mainly driven by a rather one-sided interaction: e-commerce, portal sites and
the broadcast models of mainstream media were introduced to the Web. Over the
last six or seven years, new tools and practices have emerged which emphasise
the social nature of computer-mediated interaction. Commonly (and broadly)
labeled as social software and social media, they encompass applications such as
blogs and microblogs, wikis, social networking sites, real-time chat systems, and
collaborative classification systems (folksonomies). The growth and diffusion of
services like Facebook and Twitter and of systems like WordPress and Drupal has
in part been enabled by certain innovative principles of software development
(e.g. open APIs, open-source projects, etc.), and in part by empowering the
individual user to participate in networks of peers on different scales.
Every year, the International Conference on Social Software (BlogTalk) brings
together different groups of people using and advancing the Internet and its usage: technical and conceptual developers, researchers with interdisciplinary backgrounds, and practitioners alike. It is designed to initiate a dialog between users,
developers, researchers and others who share, analyse and enjoy the benefits of
social software. The focus is on social software as an expression of a culture
that is based on the exchange of information, ideas and knowledge. Moreover,
we understand social software as a new way of relating people to people and to
machines, and vice versa. In the spirit of the free exchange of opinions, links and
thoughts, a wide range of participants can engage in this discourse.
BlogTalk enables participants to connect and discuss the latest trends and
happenings in the world of social software. It consists of a mix of presentations,
panels, face-to-face meetings, open discussions and other exchanges of research,
with attendees sharing their experiences, opinions, software developments and
tools. Developers are invited to discuss technological developments that have
been designed to improve the utilisation of social software, as well as reporting
about the current state of their software and projects. This includes new blog and
wiki applications, content-creation and sharing environments, advanced groupware and tools, client-server designs, GUIs, APIs, content syndication strategies,
devices, applications for microblogging, and much more. Researchers are asked
to focus on their visions and interdisciplinary concepts explaining social software
including, but not limited to, viewpoints from social sciences, cultural studies,
psychology, education, law and natural sciences. Practitioners can talk about
the practical use of social software in professional and private contexts, around
topics such as communication improvements, easy-to-use knowledge management, social software in politics and journalism, blogging as a lifestyle, etc.
VI
Preface
2 BlogTalk 2009
The 2009 conference was held on the picturesque Jeju Island in South Korea, and
was coordinated locally by the prominent Korean blogger and researcher Channy
Yun. This was the first BlogTalk to be held in Asia, and given its success, it will
not be the last. The following presentations from BlogTalk 2009 are available in
this volume.
Philip Boulain and colleagues from the University of Southampton detail
their prototype for an open semantic hyperwiki, taking ideas from the hypertext
domain that were never fully realised in the Web and applying them to the
emerging area of semantic wikis (for first-class links, transclusion, and generic
links). Justus Broß and colleagues from the Hasso-Plattner Institute and SAP
study the adoption of WordPress MU as a corporate blogging system for the
distributed SAP organisation, connecting thought leaders at all levels in the
company.
Michel Chalhoub from the Lebanese American University analyses areas
where the development and use of knowledge exchange systems and social software can be effective in supporting business performance (resulting in a measure
for evaluating the benefit of investment in such technologies). Kanghak Kim and
colleagues from KAIST and Daum Communications discuss their study on users’
voting tendencies in social news services, in particular, examining users who are
motivated to vote for news articles based on their journalistic value.
Sang-Kyun Kim and colleagues from the Korea Institute of Oriental Medicine
describe research that connects researchers through an ontology-based system
that represents information on not just people and groups but projects, papers,
interests and other activities. Yon-Soo Lim, Yeungnam University, describes the
use of semantic network analysis to derive structure and classify both style and
content types in media law journalistic texts from both blogs and news sources.
Makoto Okazaki and Yutaka Matsuo from the University of Tokyo perform
an analysis of microblog posts for real-time event notification, focussing on the
construction of an earthquake prediction system that targets Japanese tweets.
Yuki Sato et al. from the University of Tsukuba, NTT and the University of
Tokyo describe a framework for the complementary navigation of news articles
and blog posts, where Wikipedia entries are utilised as a fundamental knowledge
source for linking news and blogs together.
Takayuki Yoshinaka et al. from the Tokyo Denki University and the University of Tokyo describe a method for filtering spam blogs (splogs) based on
a machine-learning technique, along with its evaluation results. Hanmin Jung
and colleagues from KISTI detail a Semantic Web-based method that resolves
author co-references, finds experts on topics, and generates researcher networks,
using a data set of over 450,000 Elsevier journal articles from the information
technology and biomedical domains.
Finally, Jean-Henry Morin from the University of Geneva looks at the privacy
issues regarding the sharing and retention of personal information in social networking interactions, and examines the need to augment this information with
an additional DRM-type set of metadata about its usage and management.
Preface
VII
There were three further peer-reviewed talks that are not published here.
Daniele Nascimento and Venkatesh Raghavan from Osaka City University described various trends in the area of social geospatial technologies, in particular,
how free and open-source development is shaping the future of geographic information systems. Myungdae Cho from Sung Kyun Kwan University described
various library applications of social networking and other paradigm shifts regarding information organisation in the library field. David Lee, Zenitum, presented on how governments around the world are muzzling the Social Web.
BlogTalk has attracted prominent keynote speakers in the past, and 2009
was no exception: Yeonho Oh, founder of Ohmynews, spoke about the future
of citizen journalism; and Isaac Mao, Berkman Center for Internet and Society
at Harvard, presented on cloud intelligence. The conference also featured a special Korean Web Track: Jongwook Kim from Daum BloggerNews spoke about
social ranking of articles; Namu Lee from NHN Corporation talked about the
Textyle blogging tool; and Changwon Kim from Google Korea described the
Textcube.com social blogging service.
3 BlogTalk 2008
In 2008, BlogTalk was held in Cork City, Ireland, and was sponsored by BT,
DERI at NUI Galway, eircom and Microsoft. In these proceedings, we also gather
selected papers from the BlogTalk 2008 conference.
Uldis Bojars and colleagues from DERI, NUI Galway describe how the SIOC
semantic framework can be used for the portability of social media contributions.
David Cushman, FasterFuture Consulting, discusses the positives he believes are
associated with the multiple complex identities we are now adopting in various
online communities. Jon Hoem from Bergen University College describes the
Memoz system for spatial web publishing. Hugo Pardo Kuklinski from the University of Vic and Joel Brandt from Stanford University describe the proposed
Campus Móvil project for Education 2.0-type services through mobile and desktop environments.
José Manuel Noguera and Beatriz Correyero from the Catholic University of
Murcia discuss the impact of Politics 2.0 in Spanish social media, by tracking
conversations through the Spanish blogosphere. Antonio Tapiador and colleagues
from Universidad Politecnica de Madrid detail an extended identity architecture
for social networks, attaching profile information to the notion of distributed
user-centric identity. Finally, Mark Bernstein from Eastgate Systems Inc. writes
about the parallels between Victorian and Edwardian sensibilities and modern
blogging behaviours.
Also, but not published here, there were some further interesting presentations at BlogTalk 2008. Joe Lamantia from Keane gave some practical suggestions for handling ethical dillemmas encountered when designing social media.
Anna Rogozinska from Warsaw University spoke about the construction of self
in weblogs about dieting. Paul Miller from Talis described how existing networks
of relationships could be leveraged using semantics to enhance the flow of ideas
VIII
Preface
and discourse. Jeremy Ruston from Osmosoft at BT presented the latest developments regarding the TiddlyWiki system. Jan Blanchard from Tourist Republic
and colleagues described plans for a trip planning recommender network.
Andera Gadeib from Dialego spoke about the MindVoyager approach to qualitative online research, where consumers and clients come together in an online
co-creation process. Martha Rotter from Microsoft demonstrated how to build
and mashup blogs using Windows Live Services and Popfly. Robert Mao, also
from Microsoft, described how a blog can be turned into a decentralised social
network. Brian O’Donovan and colleagues from IBM and the University of Limerick analysed the emerging role of social software in the IBM company intranet.
Hak-Lae Kim and John Breslin from DERI, NUI Galway presented the int.ere.st
tag-sharing service.
The 2008 conference featured notable keynote speakers from both Silicon
Valley and Europe talking about their Web 2.0 experiences and future plans for
the emerging Web 3.0: Nova Spivack, CEO, Radar Networks, described semantic
social software designed for consumers; Salim Ismail, formerly of Yahoo! Brickhouse, spoke about entrepreneurship and social media; Matt Colebourne, CEO
of coComment, presented on conversation tracking technologies; and Michael
Breidenbrücker, co-founder of Last.fm, talked about the link between advertising and Web 2.0. There were also two discussion panels: the first, on mashups,
microformats and the Mobile Web, featured Sean McGrath, Bill de hÓra, Conor
O’Neill and Ben Ward; the second panel, describing the move from blog-style
commentary to conversational social media, included Stephanie Booth, Bernard
Goldbach, Donncha O Caoimh and Jan Schmidt.
4 Conclusion
We hope that you find the papers presented in this volume to be both stimulating
and useful. One of the main motivations for running BlogTalk every year is for
attendees to be able to connect with a diverse set of people that are fascinated by
and work in the online digital world of social software. Therefore, we encourage
you to attend and participate during future events in this conference series. The
next BlogTalk conference is being organised for Galway, Ireland, and will be held
in autumn 2010.
February 2010
John Breslin
Thomas Burg
Hong-Gee Kim
Tom Raftery
Jan Schmidt
Organization
BlogTalk 2009 was organised by the Biomedical Knowledge Engineering Lab,
Seoul National University. BlogTalk 2008 was organised by the Digital Enterprise
Research Institute, National University of Ireland, Galway.
2009 Executive Committee
Conference Chair
Organising Chair
Event Coordinator
John Breslin (NUI Galway)
Thomas Burg (Socialware)
Hong-Gee Kim (Seoul National University)
Channy Yun (Seoul National University)
Hyun Namgung (Seoul National University)
2009 Programme Committee
Gabriela Avram
Anne Bartlett-Bragg
Mark Bernstein
Stephanie Booth
Rob Cawte
Josephine Griffith
Steve Han
Conor Hayes
Jin-Ho Hur
Ajit Jaokar
Alexandre Passant
Robert Sanzalone
Jan Schmidt
Hideaki Takeda
University of Limerick
Headshift
Eastgate Systems Inc.
Climb to the Stars
eSynapse
NUI Galway
KAIST
DERI, NUI Galway
NeoWiz
FutureText Publishing
DERI, NUI Galway
pacificIT
Hans Bredow Institute
National Institute of Informatics
2008 Executive Committee
Conference Chair
John Breslin, NUI Galway
Thomas Burg, Socialware
Tom Raftery, Tom Raftery IT
Jan Schmidt, Hans Bredow Institute
X
Organization
2008 Programme Committee
Gabriela Avram
Stowe Boyd
Dan Brickley
David Burden
Jyri Engeström
Jennifer Golbeck
Conor Hayes
Ajit Jaokar
Eugene Eric Kim
Kevin Marks
Sean McGrath
Peter Mika
José Luis Orihuela
Martha Rotter
Jeremy Ruston
Rashmi Sinha
Paolo Valdemarin
David Weinberger
Sponsoring Institutions
BT
DERI, NUI Galway
eircom
Microsoft
University of Limerick
/Message
Friend-of-a-Friend Project
Daden Ltd.
Jaiku, Google
University of Maryland
DERI, NUI Galway
FutureText Publishing
Blue Oxen Associates
Google
Propylon
Yahoo! Research
Universidad de Navarra
Microsoft
Osmosoft, BT
SlideShare, Uzanto
evectors, Broadband Mechanics
Harvard Berkman Institute
Table of Contents
A Model for Open Semantic Hyperwikis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Philip Boulain, Nigel Shadbolt, and Nicholas Gibbins
1
Implementing a Corporate Weblog for SAP . . . . . . . . . . . . . . . . . . . . . . . . . .
Justus Broß, Matthias Quasthoff, Sean MacNiven,
Jürgen Zimmermann, and Christoph Meinel
15
Effect of Knowledge Management on Organizational Performance:
Enabling Thought Leadership and Social Capital through Technology
Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Michel S. Chalhoub
Finding Elite Voters in Daum View: Using Media Credibility
Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Kanghak Kim, Hyunwoo Park, Joonseong Ko, Young-rin Kim, and
Sangki Steve Han
29
38
A Social Network System Based on an Ontology in the Korea Institute
of Oriental Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sang-Kyun Kim, Jeong-Min Han, and Mi-Young Song
46
Semantic Web and Contextual Information: Semantic Network Analysis
of Online Journalistic Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yon Soo Lim
52
Semantic Twitter: Analyzing Tweets for Real-Time Event
Notification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Makoto Okazaki and Yutaka Matsuo
63
Linking Topics of News and Blogs with Wikipedia for Complementary
Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yuki Sato, Daisuke Yokomoto, Hiroyuki Nakasaki, Mariko Kawaba,
Takehito Utsuro, and Tomohiro Fukuhara
A User-Oriented Splog Filtering Based on a Machine Learning . . . . . . . . .
Takayuki Yoshinaka, Soichi Ishii, Tomohiro Fukuhara,
Hidetaka Masuda, and Hiroshi Nakagawa
75
88
Generating Researcher Networks with Identified Persons on a Semantic
Service Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hanmin Jung, Mikyoung Lee, Pyung Kim, and Seungwoo Lee
100
Towards Socially-Responsible Management of Personal Information in
Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jean-Henry Morin
108
XII
Table of Contents
Porting Social Media Contributions with SIOC . . . . . . . . . . . . . . . . . . . . . .
Uldis Bojars, John G. Breslin, and Stefan Decker
Reed’s Law and How Multiple Identities Make the Long Tail Just That
Little Bit Longer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
David Cushman
Memoz – Spatial Weblogging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jon Hoem
116
123
131
Campus Móvil: Designing a Mobile Web 2.0 Startup for Higher
Education Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hugo Pardo Kuklinski and Joel Brandt
143
The Impact of Politics 2.0 in the Spanish Social Media: Tracking the
Conversations around the Audiovisual Political Wars . . . . . . . . . . . . . . . . .
José M. Noguera and Beatriz Correyero
152
Extended Identity for Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Antonio Tapiador, Antonio Fumero, and Joaquı́n Salvachúa
162
NeoVictorian, Nobitic, and Narrative: Ancient Anticipations and the
Meaning of Weblogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mark Bernstein
169
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
177
A Model for Open Semantic Hyperwikis
Philip Boulain, Nigel Shadbolt, and Nicholas Gibbins
IAM Group, School of Electronics and Computer Science,
University of Southampton, University Road,
Southampton SO17 1BJ, United Kingdom
{prb,nrs,nmg}@ecs.soton.ac.uk
http://users.ecs.soton.ac.uk/
Abstract. Wiki systems have developed over the past years as lightweight, community-editable, web-based hypertext systems. With the
emergence of semantic wikis such as Semantic MediaWiki [6], these collections of interlinked documents have also gained a dual role as ad-hoc
RDF [7] graphs. However, their roots lie in the limited hypertext capabilities of the World Wide Web [1]: embedded links, without support
for features like composite objects or transclusion. Collaborative editing
on wikis has been hampered by redundancy; much of the effort spent
on Wikipedia is used keeping content synchronised and organised.[3] We
have developed a model for a system, which we have prototyped and are
evaluating, which reintroduces ideas from the field of hypertext to help
alleviate this burden.
In this paper, we present a model for what we term an ‘open semantic
hyperwiki’ system, drawing from both past hypermedia models, and the
informal model of modern semantic wiki systems. An ‘open semantic
hyperwiki’ is a reformulation of the popular semantic wiki technology
in terms of the long-standing field of hypermedia, which then highlights
and resolves the omissions of hypermedia technology made by the World
Wide Web and the applications built around its ideas. In particular, our
model supports first-class linking, where links are managed separately
from nodes. This is then enhanced by the system’s ability to embed links
into other nodes and separate them out again, allowing for a user editing
experience similiar to HTML-style embedded links, while still gaining the
advantages of separate links. We add to this transclusion, which allows
for content sharing by including the content of one node into another,
and edit-time transclusion, which allows users to edit pages containing
shared content without the need to follow a sequence of indirections
to find the actual text they wish to modify. Our model supports more
advanced linking mechanisms, such as generic links, which allow words
in the wiki to be used as link endpoints.
The development of this model has been driven by our prior experimental work on the limitations of existing wikis and user interaction.We
have produced a prototype implementation which provides first-class
links, transclusion, and generic links.
Keywords: Open Hypermedia, Semantic Web, Wiki.
J.G. Breslin et al. (Eds.): BlogTalk 2008/2009, LNCS 6045, pp. 1–14, 2010.
c Springer-Verlag Berlin Heidelberg 2010
2
1
P. Boulain, N. Shadbolt, and N. Gibbins
Introduction
Hypermedia is a long-standing field of research into the ways in which documents
can expand beyond the limitations of paper, generally in terms of greater crossreferencing and composition (reuse) capability. Bush’s As We May Think [4]
introduces a hypothetical early hypertext machine, the ‘memex’, and defines
the “essential feature” of it as “the process of tying two items together”. This
linking between documents is the common feature of hypertext systems, upon
which other improvements are built.
As well as simple binary (two endpoint) links, hypertext systems have been
developed with features including n-ary links (multiple documents linked to multiple others), typed links (links which indicate something about why or how
documents are related), generic links (links whose endpoints are determined by
matching criteria of the document content, such as particular words), and composite documents (formed by combining a set of other documents).
Open Hypermedia extends this with first-class links (and anchors) which
are held external to the documents they connect. These allow links to be
made to immutable documents, and to be added and removed in sets, often
termed ‘linkbases’. One of the earliest projects attempting to implement globallydistributed hypertext was Xanadu [8], a distinctive feature of the design of which
was transclusion: including (sections of) a document into another by reference.
The World Wide Web, while undeniably successful, only implements a very
small subset of these features—binary, embedded links—with more complicated
standards such as XLink failing to gain mainstream traction. Since then, applications have been built using the web as an interface, and following its same,
limited capabilities. One of these classes of applications is the semantic wiki,
an extension of the community-edited-website concept to cover typed nodes and
links, such that the wiki graph structure maps to a meaningful RDF graph.
In this paper, we present a model which extends such systems to cover a
greater breadth of hypermedia functionality, while maintaining this basic principle of useful graph mapping. We introduce the Open Weerkat system and
describe the details of its implementation which relate to the model.
2
Open Weerkat
Open Weerkat is a model and system to provide a richer hypertext wiki. This
implementation is built upon our previous Weerkat extensible wiki system [2].
From our experimental work [3], we have identified the need for transclusion
for content-sharing, better support for instance property editing, and generic
linking, which requires first-class links.
At the same time, we must not prohibit the ‘non-strict’ nature of wikis, as
dangling links are used as part of the authoring process. We also wish to preserve
the defining feature of semantic wikis: that there is a simple mapping between
nodes and links in the wiki, and RDF resources and statements.
In the rest of this section, we look at the core components of the model, and
some of the implementation considerations.
2.1
3
Atomic Nodes
The core type of the model is that class which is fundamental to all wiki designs:
the individual document, page, article, or component. We use here the general
and non-domain-specific term node.
Node title
DOM tree text node
transclude → native transclusion
o_ _ _ _ _
DOM element
element contents
link
/
attribute "value"
Fig. 1. Model diagram legend
Components. As we have said, our model draws from both hypermedia and
semantic wiki. In particular, we preserve the notion that wiki nodes are parallel
to semantic web resources. Because these resources are atomic (RDF cannot
perform within-component addressing on them, as that is only meaningful for
an representation of a resource), we have carefully designed our wiki model to
not rely on link endpoint specifications which go beyond what can be reasonably
expressed in application-generic RDF. Anything about which one wishes to make
statements, or to which one wishes to link, must have a unique identity in the
form of an URI, rather than some form of (URI, within-specifier) pairing.
Figure 1 shows the components of a node in the model. (We use this diagram
format, which draws inspiration from UML class diagrams, to depict example
hypertexts throughout this paper.)
Every node has a title which serves as an identifier. Titles are namespaced with
full stops (‘.’), which is useful for creating identities for content which nominally
belongs within another node.
Node content is either a DOM tree of wiki markup, or an atomic object (e.g.
an image binary). A notable element in the DOM tree is the ‘native transclusion’,
which indicates that another node’s content should be inserted into the tree at
that point. This is necessary to support the linking behaviour described below,
and is distinct from user-level transclusion using normal links.
The bottom of the format shows the attribute-value pairs for the node. The
domain of attributes is other nodes, and the domain of values are literals and
other nodes. These are effectively very primitive embedded, typed links, and are
used to provide a base representation from which to describe first-class links.
Identity, Meta-nodes, and RDF. It is a design goal of the model that the
hyperstructure of the wiki is isomorphic to a useful RDF graph. That is, typed
links between pages are expressible as an RDF statement relating the pages,
and attributes on a page are statements relating that page to the associated
value. The link in figure 5 should be presented (via RDF export, a SPARQL
4
endpoint, or such) as the triple (Phil, Likes, Perl), with appropriate URIs. (Note
that the anchor is not the subject—we follow the native transclusion back into
the owning Phil node.) The attribute of the Perl node in the same figure should
be presented as the triple (Perl, syntax, elegant). (For ‘fat’ links with more than
two endpoints, there is a triple for each pairing of sources and targets.)
For this, we grant each node a URI, namespaced within the wiki. However,
‘the node Perl’ and ‘Perl the programming language’ are separate resources. For
example, the node Perl may have an URI of http://wiki.example.org/node/
Perl. Yet a typed link from Perl is clearly a statement about Perl itself, not
a node about Perl. The statements are about the associated resource http://
wiki.example.org/resource/Perl. (This may be owl:sameAs some external
URI representing Perl.)
In order to describe the node itself (e.g. to express that it is in need of
copy-editing), a Perl.meta node represents http://wiki.example.org/nodes/
Perl. This meta node itself has an URI http://wiki.example.org/nodes/
Perl.meta, and could theoretically have a ‘meta meta’ node. Effectively, there
is an ‘offset’ of naming, where the wiki identifier Perl is referring to Perl itself semantically, and the Perl node navigationally; the identifier Perl.meta is referring
to the Perl node semantically, and the Perl meta-node navigationally.
Versions. We must give consideration to the identity of versions, or ‘revisions’
of a node. We wish to support navigational linking (including transclusion) to old
versions of a node. However, we must also consider our semantic/navigational
offset: we may wish to write version 3 of the Perl node, but we do not mean
to assert things about a third revision of Perl itself. Likewise, a typed link to
version 3 of the Perl node is not a statement about version 3 of Perl: it is a
statement about Perl which happens to be directed to a third revision of some
content about it.
We desire three properties from an identifier scheme for old versions:
Semantic consistency. Considers the version of the content about a resource
irrelevant to its semantic identity. All revisions of the Perl node are still
about the same Perl.
Navigational identity. Each revision of a node (including meta-nodes) should
have distinct identity within the wiki, so that it may be linked to. Intuitively,
despite the above, version 3 of the Perl node is Perl3 , not Perl.meta3 .
Semantic identity. Each revision of a node (including meta-nodes) should
have a distinct URI, such that people may make statements about them.
(Perl3 .meta, writtenBy, Phil) should express that Phil wrote version 3 of the
content for the Perl node.
We can achieve this by allowing version number specification both on the node
and any meta-levels, and dropping the version specification of the last component
to generate the RDF URI. Should somebody wish to make statements about
version 4 of the text about version 3 of the text about Perl, they could use
the URI Perl;3/meta;4/meta. This is consistent with the ‘node is resource
itself; meta-node is node’ approach to converting typed links into statements.
5
Additionally, we have no need to attempt to express that Perl and Perl3 are the
same semantic resource, as this mechanism allocates them the same URI.
It should be stressed that namespacing components of the node identifier
cannot have versions attached, as any versioned namespace content does not
affect the node content. For example, Languages2 .Perl3 is not a valid identifier,
and would be isomorphic to Languages.Perl3 if it were.
Representations. Each node URI should have a representation predicate which
specifies a retrievable URL for a representation. (We do not claim to have any authority to provide a representation of Perl itself, merely our node about Perl.) For
example, (wiki:node/Perl, representation, http://wiki.example.org/content/
Perl.html). There may be more than one, in which case each should have a
different MIME type. Multiple representations are derived from the content: for
example, rendering a DOM tree of markup to an XHTML fragment. Hence, the
range of MIME types is a feature of the rendering components available in the
wiki software to convert from the node’s content.
Should an HTTP client request the wiki:node/Perl resource itself, HTTP
content negotiation should be used to redirect to the best-matching representation. In the spirit of the ‘303 convention’ [9], if the HTTP client requests RDF,
they should be redirected to data about the requested URI: i.e. one meta-level
higher. This inconsistency is unfortunately a result of the way the convention
assumes that all use of RDF must necessarily be ‘meta’ in nature, but we have
considered it preferable to be consistent with convention than to unexpectedly
return data, not metadata, RDF in what is now an ambiguous case. Clients
which wish to actually request the Perl node’s content itself in an RDF format,
should such exist, must find the correct URI for it (e.g. wiki:content/Perl.ttl) via
the representation statements.
Requests to resource URIs (e.g. wiki:resource/Perl) are only meaningful in terms of the 303 convention, redirecting RDF requests to data about
wiki:node/Perl. There are no representations available in the wiki for these
base resources—only for nodes about them—so any non-RDF type must therefore must be ‘Not Found’.
2.2
Complex Nodes
We can build upon this base to model parametric nodes, whose content may be
affected by some input state.
Node identity. MediaWiki’s form of transclusion, ‘templates’, also provides for
arguments to be passed to the template, which can then be substituted in. This
is in keeping with the general MediaWiki paradigm that templates are solely for
macro processing of pages.
We propose a generalisation, whereby pages may be instantiated with arbitrary key/value pairs. The range of our links are node identifiers, so we consider
these parameters as part of the identity of an instantiation in a (likely infinite)
multi-dimensional space of instances. Figure 3 shows a subset of the instance
6
Template.GoodNode
This node is featured in topic
param
topic
in particular because of its
param
virtue
Fig. 2. Exemplary parametric node
_
virtue
Template.GoodNode
{topic→science,
virtue→citations}
Template.GoodNode
{topic→science,
virtue→grammar}
Template.GoodNode
{topic→art,
virtue→citations}
Template.GoodNode
{topic→art,
virtue→grammar}
/
topic
Fig. 3. Instance-space of a parametric node
space for a node, figure 2, which has parameters topic and virtue. There is assumed to be an instance at any value of these parameters, although evidently
all such instances are ‘virtual’, with their content generated from evaluating the
parametric Template.GoodNode node.
We do not use an (identif ier, parameters) pair, as this does not fit the
Semantic Web model that any resource worth making statements about should
have identity. Granting instances in-system identity is useful, as it encapsulates
all necessary context into one handle.
To guarantee that all isomorphic instantiations of a page use the same identifier, parameters must be sorted by key in the identifier. Note that this is orthogonal to user interface concerns—the restriction is upon the identity used by
links to refer to ‘this instance of this node with these parameters’, not upon the
display of these parameters when editing such a link. As with revision specifiers,
parameters upon namespace components of the identifier are meaningless and
forbidden.
Within the node’s content, parameters may be used to fill in placeholders in
the DOM tree. These placeholders may have default value should the parameter not be provided; and the default-default parameter is to flag an error. For
example, a parameter may be used to fill in a word or two of text, or as the
target of a link. User interface operations upon Foo {bar→baz}’s content, such
as viewing the history, and editing, should map through to Foo, as the instance
has no content of its own to operate upon.
7
Because we model parameterised nodes as a set of static objects with firstclass identity which are simply instantiations of a general node, identifiers which
do not map to a valid instantiation of a node could be considered non-existent
targets. For example, an identifier which specifies a non-existant parameter.
Resource identity. We must consider whether such instances are separate
Semantic Web resources to each-other, and to the parametric node from which
their content is derived. As with version specifiers, parameters affect the content
of a node, not the resource which it describes. Because the Perl node represents
Perl itself, it follows that Perl {bar→baz} still represents Perl. However, as with
version specifiers, these node instances still have distinct identity as nodes. As
Perl.meta represents the Perl node, so does Perl {bar→baz}.meta represent the
Perl {bar→baz} node. Therefore, we can form a URI for a parametric node
instance in exactly the same way we form URIs for specific revisions, defined in
section 2.1. In brief, the final set of parameters are dropped.
RDF expressions of the hyperstructure should specify that parametric node
instances, where used, are derivations of other nodes. For example, remembering
that we are making a statement about Perl nodes, not about Perl itself, (Perl
{bar→baz}.meta, templatedFrom, Perl.meta).
Eager vs. lazy evaluation. The infinite space of non-link parametric node
instances can be considered to not exist until they are specified as a link target, as
their existence or non-existence in the absence of explicit reference is irrelevant.
However, if we also allow parameter values to substitute into the attributes
of a node, we can create parametric links. Parametric node instances which
are links have the ability to affect parts of the hyperdocument outside of their
own content and relations: this is the nature of first-class links. Hence we must
consider whether parametric node instantiation, at least for link nodes, is eager
(all possible instances are considered to always exist) or lazy (instances only
exist if they are explicitly referred to).
Template.FancyLink
type
Link
source
param(from)
target
param(to)
decoration fancy
Fig. 4. Free-variable parametric link
Figure 4 highlights a case where this distinction is particularly significant.
With lazy evaluation, this template could be used as a macro, in a ‘classical’
wiki style, to create links. One would have to create links to instances of this
link, which would then cause that particular instance to exist and take effect,
linking its from and to parameters.
8
An eager approach to evaluation would, however, treat parametric links as
free-variable rules to satisfy. All possible values of from and to would be matched,
and linked between. In this case, every node in the hyperdocument would be
linked to every other node.
Logically, eager evaluation is more consistent, and potentially more useful:
free-variable links are of little utility if one has to explicitly provide them with
possible values. It would be better to manually link the nodes, with a type of
FancyLink which is then defined to be fancy. If there were some content provided
by the Template.FancyLink template, it could still be used, but would simply
display this content rather than actually functioning as a link.
This is contrary to common practice on Semantic MediaWiki, which has
evolved from practice on Wikipedia, where the templating system works via
macro evaluation. We argue that this leads to bad ontology modelling, as class
definitions end up embedded within display-oriented templates, such as ‘infoboxes’. For example, the common Semantic MediaWiki practice to provide
the node about Brazil with a relational link to its capital Brası́lia would be to
include a template in the Brazil node with the parameter capital→Brası́lia. The
template would then contain markup to display a panel of information containing
an embedded link of type has capital to the value of the capital parameter.1
The problem is that stating that templates have capitals is clearly not correct, and only results in correct information when they are macro-expanded into
place.Statements about the template itself must be ignored as they are likely intended to be about whichever nodes use that template. In addition, what could
be a statement about the class of countries—that they are the domain of a has
capital property—is entangled with the display of this information.
A better approach would be to simply assert the capital on the Brazil page,
and then transclude a template whose only role is to typeset the information
panel, using only the name of the transcluding page as an implicit parameter.
This approach emphasises the use of correct semantics, and using these to inform
useful display, rather than ‘hijacking’ useful display to try to add semantics.
Templating. Templating can be achieved through the use of parametric nodes
and transclusion. Simple macroing functionality, as in contemporary wiki systems, is possible by transcluding a particular instance of a parametric node which
specifies the desired parameter values.
It should be stressed that parametric nodes are not, however, a macro preprocessing system. As covered in section 2.2, parametric links are eagerly evaluated:
i.e. they are treated as rules, rather than macros which must be manually ‘activated’ by using them in combination with an existing node. In general, use
of macroing for linking and relations is discouraged, as it is better expressed
through classes of relation.
1
This example is closely based upon a real case: http://www.semanticweb.org/wiki/
Template:Infobox_Country
9
Phil
Likes
→ Phil.anchor.1
, an
em
elegant
language.
_ _ _/
Perl
(interesting facts)
syntax elegant
o
Phil.anchor.1
Perl
Phil.link.Perl.1
type Likes
source Phil.anchor.1
target Perl
Fig. 5. Linking in the model
2.3
Links
Open Weerkat is an open hypermedia system, so links are first-class: all links
are nodes. Nodes which have linking attributes are links. To maintain a normal
wiki interface, we present links in an embedded form.
Embedded. Figure 5 shows user-level linking. As presented to the user in an
example plaintext markup, the source for the Phil node would be:
Likes [link type=Likes to=Perl Perl], an [em elegant] language.
We use edit-time transclusion, where transcluded text is displayed in-line even
during the editing of a node, to present the user with the familiar and direct
model of embedded linking, but map this into a open hypermedia model. The link
element, when written, separates out the link text as a separate, ‘anchor’ node,
and is replaced with native transclusion. A first-class link is then created from
the anchor node to the link target. The identity if this link is largely arbitrary,
so long as it is unique.
Native transclusion is here an optimisation for creating a named, empty anchor
in the DOM, then maintaining a link which transcludes in the node of the same
name. It is also considered meronymous: a link involving an anchor is considered
to relate the node to which that anchor belongs. Because native transclusion is
entirely implicit, only the owning node can natively transclude its anchors.
When editing the node again, the anchor is transcluded back into the node,
and converted into a link element with targets from all links from it. (Depending
on the exact markup language used to express the DOM for editing, this may
require multiple, nested link elements.)
10
This guarantees that each anchor has a full identity (a node title) in the
system. It does not, however, immediately provide a solution to ‘the editing
problem’—a longstanding issue in hypertext research [5], where changes to a
document invalidate external pointers into that document. The anchor names
are not here used in the plaintext markup, so ambiguity can arise when they are
edited. It should thus be possible to specify the anchor name (as a member of
the Node.anchors namespace) for complicated edits:
Likes [link anchor=1 type=Likes to=Scheme Scheme]...
A graphical editor could treat the link elements as objects in the document
content which store hidden anchor identities, providing this in all cases.
Note that the link’s properties in figure 5 are stored as attributes. Theoretically, in the RDF mapping described in section 2.1, an attribute-value pair
(source, Phil.anchor.1) in the node Phil.link.Perl.1 is identical to a link of type
source from the link to the anchor. However, such an approach would become
infinitely recursive, as the source link’s source could again be described in terms
of links. The attribute-value pairs thus provide a base case with which we can
record basic properties needed to describe first-class links.
Transclusive. Transclusive links can be used to construct composite nodes. A
link is transclusive if its type is a specialisation of Transclusion. A transclusive
link replaces the display of its source anchor contents with its target contents.
Unlike the ‘native transclusion’ in section 2.1, user-level transclusive links do not
imply a part-of relation. This is because any part-of relation would be between
the representations of the nodes, not the resources that the nodes represent. To
extend the Brazil example in section 2.2, a country information box is not part
of Brazil; instead the Infobox Country node is part of Brazil node.
Edit-time transclusion is user-interface specific, although quite similar to the
issues already covered in section 2.3 with the native transclusion performed by
embedded anchors. For a simple, text serialisation interface, such as a web form,
it is possible to serialise the transcluded content in-place with a small amount of
surrounding markup; if the returned text differs, this is an edit of the transcluded
node. Again, richer, graphical editors can replace this markup with subtler cues.
Open. To realise first-class links while retaining a standard wiki embedded-style
editing interface, we have modified Weerkat to work upon document trees, into
which links can be embedded, and from which links can be separated. These
embedding and separation routines rewrite documents into multiple nodes as is
necessary. Transclusion, be it presented at edit-time, or for viewing, is possible
via the same mechanism: including the target content within the link’s nowembedded anchor.
To embed a link back into a document, including in order to create an XHTML
representation of it for display and web navigation, it must be determined which
links are applicable to the document being processed. For this, we have defined
a new type of module in the system: a link matcher. Link matchers inspect the
endpoints of links and determine if the document matches the endpoint criteria.
11
For straightforward, literal links, this is a simple case of identity equality between
the endpoint’s named document, and the current document.
As part of the storage adaptation for first-class linking, we have introduced
an attribute cache, which is fundamentally a triple store whose contents are
entirely derived from the attributes of each node. As well as eventually being a
useful way to interact with the semantic content of the wiki, this allows us to
implement link matching in an efficient way, by querying upon the store.
For example, in the literal endpoint case, assuming suitable prefixes and subtype inference, we can find such links with a set of simple SPARQL queries,
selecting ?l where:
1.
2.
3.
4.
{
{
{
{
?l
?l
?l
?l
type
type
type
type
link
link
link
link
.
.
.
.
?l
?l
?l
?l
source
source
target
target
Scheme .
Scheme_5
Scheme .
Scheme_5
}
. }
}
. }
The first two queries find links where this node is a source; the latter two, where
it is a target. We must also find links from or to the specific version of the current
node, which is provided by queries two and four.
This approach can be extended to deal with endpoints which are not literal,
which we consider ‘computed’.
Query. Query endpoints are handled as SPARQL queries, where the union of
all values of the selected variables is the set of matched pages. For example, a
query endpoint of SELECT ?n WHERE { ?n Paradigm Functional . } would link
from or to all functional programming languages. This kind of endpoint can be
tested for a specific node via a SPARQL term constraint:
SELECT ?n WHERE { ?n Paradigm Functional .
FILTER ( ?n = Scheme ) }
If multiple variables are selected, the filter should combine each with the logical
or operator, so as to retrieve any logically-sound solution to the query, even if
some of the variables involved are not the node we are interested in linking with.
Generic. Generic endpoints can be implemented as a filtering step on query
endpoints.2 We define a postcondition CONTAINS ( ?n, "term" ) to filter the
solutions by those where the node n contains the given term. This postcondition
can be implemented efficiently by means of a lexicon cache, from each term used
by any generic link, to a set of the nodes using that term. Changes to generic
links add or remove items from the lexicon, and changes to any node update the
sets for any terms they share with the lexicon. If CONTAINS is used alone, n is
implied to be the universal set of nodes, so matching is a simple lexicon lookup.
To be useful for generic linking, CONTAINS implies an anchor at the point of
the term when it is used as a source endpoint. For example, CONTAINS ( ?n,
2
An alternative approach may be to assert triples of the form (Scheme, containsTerm,
term), but this would put a great load on the triplestore for each content edit.
12
"Scheme" ) matches the Scheme node, but should link not from the entire node,
but from the text “Scheme” within it. For user interface reasons, it is desirable
to restrict this only to the first occurrence of the term for non-transclusive links,
so that the embedded-link document is not peppered with repeated links. For
transclusive links, however, it is more consistent and useful to match all occurrences. While transclusive generic links are a slightly unusual concept, it is
possible that users will find innovative applications for them. For example, if it
is not possible to filter document sources at a node store level for some reason,
a generic, transclusive link could be used to censor certain profane terms.
Multiple CONTAINS constraints can be allowed, which require that a node
contains all of the terms. Any of the terms are candidates for implicit anchors:
i.e. whichever occurs first will be linked, or all will be replaced by transclusion.
Parametric. We can use SPARQL variables for parametric links. Every
SPARQL variable is bound to the parameter element in the node’s DOM tree
with the same name: variables and parameters are considered to be in the same
namespace. This allows the content to reflect the query result which matched
this link. If the query allows OPTIONAL clauses which can result in unbound
variables, then they could potentially have values provided by defaults from the
parameter definitions in the DOM. Default values are meaningless for parameters which appear as compulsory variables in the query, as the query engine will
either provide values, or will not match the link.
Parametric links may have interdependent sources and targets, in which case
they are simple functional links (the source can be a function of the target,
and the target an inverse function of the source). Link matching is performed
pairwise for all source and target combinations. For example, consider a link
with these select endpoints:
source: ?thing WHERE { ?thing Colour Red . }
target: ?img WHERE { ?img Depicts ?thing . }
target: ?img WHERE { ?img Describes ?thing . }
This would create links from all nodes about things which are red, to all nodes
which depict or describe or those red things. To perform this match, we union
each pair of the clauses into a larger query:
SELECT ?thing, ?img WHERE {
?thing Colour Red . ?img Depicts ?thing .
FILTER ( ?thing = Scheme || ?img = Scheme ) }
A similar query would also be performed for Describes. Note that we may receive
values for the variables used as source or target which are not the current node
if it matches in the opposite direction. We must still check that any given result
for the endpoint direction we are interested in actually binds the variable to the
current node. In this example, current node Scheme is not Red, so the query will
not match, and no link will be created.
13
The pairwise matching is to be consistent with the RDF representation presented in section 2.1, and the ‘or’ nature of matching with static endpoints: a
link must only match the current node to be used, and other endpoints may
be dangling. An alternative approach would be to create a ‘grand union’ of all
sources and targets, such that all are required to be satisfied. Neither approach
is more expressive at an overall level: with a pairwise approach, a single target
endpoint can include multiple WHERE constraints to require that all are matched;
with a union approach, independent targets can be achieved through use of multiple links (although they would no longer share the same identity). The union
approach is more consistent with regard to the interdependence of variables; with
the pairwise approach, one matching pair of source/target endpoints may have
a different variable binding for a variable of the same name to another. However, it loses the RDF and static endpoint consistency. Ultimately, the decision
is whether the set of targets is a function of the set of sources (and vica-versa
with the inverse), or if it is the mapping of a function over each source. In lieu
of strong use cases for n-ary, interdependent, parametric links (most are better
modelled as separate links), we choose the former for its greater consistency, and
ability for a single link to provide both behaviours.
Functional. We also give consideration to arbitrarily-functional links. These are
computationally expensive to match in reverse (i.e. for target-end linking and
backlinks) unless the functions have inverses. We do not currently propose the
ability for users to write their own Turing-complete functions, as the complexity
and performance implications are widespread.
However, we can potentially provide a small library of ‘safe’ functions: those
with guaranteed characteristics, such as prompt termination. One such example
which would be of use is a ‘concatenate’ function:
source: SELECT ?n WHERE { ?n type ProgLang . }
target: CONCAT( "Discuss.", ?n )
This would be a link from any programming language to a namespaced node
for discussing it.However, it highlights the reversibility problem: the inverse of
CONCAT has multiple solutions. For example, “ABC” could have been the result of
CON CAT (“A”, “BC”), CON CAT (“AB”, “C”), or a permutation with blank
strings. Hence, while it is easy to match the source, and then determine the
target, it is not practical to start with the target and determine the source.
We suggest that any endpoint which is an arbitrary function of others in
this manner must therefore only ever be derived. Matching is performed against
all other endpoints, and then the functional endpoints are calculated based on
the results. A link from CONCAT( ?n, ".meta") to CONTAINS( ?n, "lambda" )
would only ever match as a backlink: showing that any node containing ‘lambda’
would have been linked from its meta-node, without actually showing that link
on the meta-node itself. A link with only arbitrarily functional endpoints will
never match and is effectively inert.
14
3
Conclusions
In this paper, we have approached the perceived requirement for a more advanced
communually-editable hypertext system. We have presented a solution to this
as a model for a ”semantic open hyperwiki” system, which blends semantic wiki
technology with open hypertext features such as first-class linking. We also offer
an approach to implementing the more advanced link types with a mind towards
practicality and computational feasibility.
Providing users with stronger linking and translusion capabilities should help
improve their efficiency when working on editing wikis such as Wikipedia. Interdocument linking forms a major component of current editing effort, which we
hope to help automate with generic links. Content re-use is complicated by
surrounding context, but even in cases where texts could be shared, technical usability obstacles with current macro-based mechanisms discourage editors
from doing so. We address this with the concept of edit-time transclusion, made
possible by the wiki dealing with programatically manipulatable tree structures.
Beyond this, we wish to address other user-study-driven design goals, such as
improving versioning support that allows for branching.
References
1. Berners-Lee, T., Cailliau, R., Groff, J.-F., Pollermann, B.: World-Wide Web: The
Information Universe. Electronic Networking: Research, Applications and Policy 1(2), 74–82 (1992)
2. Boulain, P., Parker, M., Millard, D., Wills, G.: Weerkat: An extensible semantic wiki.
In: Proceedings of 8th Annual Conference on WWW Applications, Bloemfontein,
Free State Province, South Africa (2006)
3. Boulain, P., Shadbolt, N., Gibbins, N.: Studies on Editing Patterns in Large-scale
Wikis. In: Weaving Services, Location, and People on the WWW, pp. 325–349.
Springer, Heidelberg (2009) (in publication)
4. Bush, V.: As We May Think. The Atlantic Monthly 176, 101–108 (1945)
5. Davis, H.: Data Integrity Problems in an Open Hypermedia Link Service. PhD
thesis, ECS, University of Southampton (1995)
6. Krötzsch, M., Vrandečić, D., Völkel, M.: Wikipedia and the semantic web - the
missing links. In: Proceedings of the WikiMania 2005 (2005), http://www.aifb.
uni-karlsruhe.de/WBS/mak/pub/wikimania.pdf
7. Manola, F., Miller, E.: RDF Primer. Technical report, W3C (February 2004)
8. Nelson, T.: Literary Machines, 1st edn. Mindful Press, Sausalito (1993)
9. Sauermann, L., Cyganiak, R., Völkel, M.: Cool URIs for the Semantic Web. Technical Report TM-07-01, DFKI (February 2007)
Implementing a Corporate Weblog for SAP
Justus Broß1, Matthias Quasthoff1, Sean MacNiven2,
Jürgen Zimmermann2, and Christoph Meinel1
1
Hasso-Plattner-Institut, Prof.-Dr.-Helmert-Strasse 2-3,
14482 Potsdam, Germany
{Justus.Bross,Matthias.Quasthoff,Office-Meinel}
@hpi.uni-potsdam.de
2
SAP AG, Hasso-Plattner-Ring 7, 69190 Walldorf
{Sean.MacNiven,Juergen.Zimmermann}@sap.com
Abstract. After web 2.0 technologies experienced a phenomenal expansion and
high acceptance among private users, considerations are now intensified to assess whether they can be equally applicable, beneficially employed and meaningfully implemented in an entrepreneurial context. The fast-paced rise of social
software like weblogs or wikis and the resulting new form of communication via
the Internet is however observed ambiguously in the corporate environment. This
is why the particular choice of the platform or technology to be implemented in
this field is strongly dependent on its future business case and field of deployment and should therefore be carefully considered beforehand, as this paper
strongly suggests.
Keywords: Social Software, Corporate Blogging, SAP.
1 Introduction
The traditional form of a controllable mass-medial and uni-directional communication
is increasingly replaced by a highly participative and bi-directional communication in
the virtual world, which proves to be essentially harder to direct or control [2][13].
For a considerable share of companies this turns out to be hard to tolerate. The usecase of a highly configured standard version of an open source multi-user weblog
system for SAP – the market and technology leader in enterprise software – will form
the basis for the paper outlined here. SAP requested the Hasso Plattner Institute (HPI)
to realize such a weblog to support its global internal communications activities. In
the current economic environment and with the changes in the SAP leadership, an
open and direct exchange between employees and executive board was perceived as
being critical to provide utmost transparency into the decisions taken and guidance for
the way forward. Recent discussions about fundamental and structural changes within
SAP have clearly shown that need for direct interaction. SAP and HPI therefore
agreed to share research, implementation and configuration investments necessary
for this project – hereafter referred to as “Point of View”, or shortly POV. The platform went online in June 2009, and is at this moment beginning to gain first acceptance among all SAP employees worldwide. To leverage the experiences and expert
16
J. Broß et al.
knowledge gained in the course of the project, we will elaborate upon the following
research question from an ex post perspective:
“What are critical key success factors for the realization of a corporate weblog in
an environment comparable to the setting of SAP’s Point of View”?
This paper will start with a short treatise about social software in general and weblogs in particular in section II, followed by a more elaborate and thorough analysis in
section III about the capabilities and challenges of weblogs in the corporate environment , including their forms of deployment (e.g. CEO blog, PR-blog, internal communications tool), success factors (e.g. topicality, blogging policies, social software
strategies), risks (e.g. media conformity, institutional backup, resource planning, security- and image related-issues,) and best-practice examples. Section IV is dedicated to
the use-case of the POV-project, beginning with an introduction about the SAP’s
motivation to have such a platform developed and the overall scope of the project.
The subsequent paragraph will provide an overview about all technical development-,
implementation- and configuration- efforts undertaken in the course of the project. In
doing so, it will elaborate upon the precondition of SAPs work council to get an anonymous rating functionality, the prerequisite to bond the standard blogging software
with the authentication systems in place on behalf of SAP (LDAP, SSO, etc.) and the
precondition to realize the blog on the basis of a multi-user version of the blogging
standard software Wordpress. Design issues like the integration of the blog into the
SAP web portal as well as the CI/CD guidelines of SAP are also mentioned. A conclusion and the obligatory reference list complete the paper.
2 What Is a Weblog?
A weblog – a made-up word that is composed of the terms „web“ and „log“– is no
more than a specific website, whose entries, also known as “posts”, are usually written in reverse chronological order with the most recent entry displayed first. Initially,
it was meant to be an online diary. Nowadays, there are countless weblogs around,
with each covering a different range of topic. Single blog posts combine textual parts
with images and other multimedia data, and can be directly addressed and referenced
via an URL (Uniform Resource Locator) in the World Wide Web. Readers of a blog
posts can publish their personal opinion in a highly interactive manner about the topic
covered by commenting on a post. These comments can however be subject to moderation by the owner of a blog.
2.1 Social Software
While the first blogs around were simple websites that were regularly update with
new posts (or comments), we witnessed the emergence of so-called “Blog Hosting
Services” by the end of the ‘90s. Services providers like for instance Wordpress, Serendipity, MovableType, Blogspot or Textpattern1 offered a user-friendly and readymade blog service that even allowed non-expert-users to generate and publish
content accessible to all Internet users. Everybody capable of using a simple
1
www.wordpress.org; www.s9y.org; www.movabletype.org; www.textpattern.com;
www.blogger.com
17
text-editor-program could thus actively take part in the unconfined exchange of opinions over the web [35].
Nowadays, weblogging systems are more specialized, but still easy-to-use Content
Management Systems (CMS) with a strong focus on updatable content, social interaction, and interoperability with other Web authoring systems. The technical solutions
agreed upon among developers of weblogging systems are a fine example of how
new; innovative conventions and best practices can be developed on top of existing
standards set by the World Wide Web Consortium and the community.
Applications like these, that offer a simplified mode of participation in today’s Internet in contrast to earlier and traditional web applications, were in the following
described as “Web 2.0 applications”. The concurrently developing „Participation
Internet“ is till the present day referred to as the „Web 2.0“ [25].
The above-mentioned cumulative „social“ character of the Internet is contrary to
traditional mass media representatives like the printing-press, television or the radio,
since these only offer a uni-directional form of communication. The Internet in turn
offers all its users real interaction, communication and discussion. This is also why
blogs – next to podcasts – are referred to as the most frequently used ”social media
tools“ [26].
2.2 Features
One prominent feature of weblogging systems are so called feeds, an up-to-date table
of contents of a weblog's content. Feeds are exchanged in standardized, XML-based
formats like RSS or ATOM formats, and are intended to be used by other computer
programs rather than being read by humans directly.
Such machine-readable tables of contents of web sites opened a whole new way for
users of consuming content from various Web sites. Rather than having to frequently
check different web sites for updates, users can subscribe to feeds in so-called aggregators, i.e. software automatically notifying subscribers upon content updates. Feeds
from different sources can even be mixed, resulting in a highly customized subscription to web content from different sources [1]. Such syndicated content can then be
consumed as a push-medium, on top of the pull-oriented World Wide Web architecture. One example for a popular extension of feed formats are podcasts, which have
additional media files, such as audio or video broadcasts, attached.
Social interaction is another important aspect of weblogging systems, which form a
notable part of the so-called Social Web. The most visible method of social interaction is inviting readers to comment and discuss postings directly within the weblogging system. Weblogs also introduced more subtle, interesting means of interaction.
To overcome the limiting factor of HTTP-based systems being only aware of outbound hyperlinks, different type of Link Backs have been developed. These will
automatically detect incoming hypertext links from one weblog posting to any other
weblog posting, and will insert a link from the original link target to its source, hence
making hypertext links symmetrical. Such links can be detected, e.g., using the often
disregarded Referer [sic] header in an HTTP transmission, or by actively notifying the
link target about the reference. Making hyperlinks symmetrical significantly helps
weaving a true social web between weblog authors and thus ultimately forms the
interconnectivity of the blogosphere.
18
J. Broß et al.
The latter example of weblog systems actively notifying each other is one example
of how interoperable weblogging systems are. Many of these systems have an
XML-RPC interface, a technology used to control web services using non-browser
technologies [36]. This interface can be used to notify about incoming links (so-called
ping-backs), but even to author and manage content within the weblogging system, e.g.
using mobile phone software. Other promising means of interoperability are upcoming
technologies based on Semantic web standards, such as RDF and SIOC. Using these
standards, the structure of a weblog's content and its role in the blogosphere can be
expressed and published in a standardized, machine-readable way, which will be even
more flexible compared to today's feeds and XML-RPC interfaces [16].
3 Corporate Weblogs – Capabilities and Challenges
Successful enterprises attribute part of their success to effective internal communication, which most employees would circumscribe as direct and open communications
with their management. These internal open channels of communications create an
atmosphere of respect where co-worker and manager-employee relationships can
flourish, keep employees excited about their job, circulate vital information as quickly
as possible and connect employees with the company’s goal and vision [5][7][22].
3.1 The Corporate Internal Communications Perspective
While most people consider face-to-face communication as the most effective communication tool, it is often too time-consuming or too difficult or expensive over
greater distances of time or space. Print was also no option here, since it is too slow
and requires filing as well as complex retrieval systems and storage. Advances in
Information and Communication Technologies (ICT) could finally overcome these
disadvantages while still allowing for direct and personal interaction.
An increasing number of enterprises therefore started to employ weblogs as a complementary tool for their external or internal communications [6][11]. Blogs however
turned out to be a far more effective tool within the internal corporate environment.
Through their application in intranets, or enclosed network segments that are owned,
operated, controlled and protected by a company, it could keep track of information
and communication more quickly and effectively [23]. Inside the company walls, it
could furthermore replace an enormous amount of emails, spread news more quickly,
serve as a knowledge database or create a forum for collaboration and the exchange of
ideas [7][11][12]. Chances are high that companies will become more innovative,
transparent, faster and more creative with such instruments [5]. Especially the rather
traditional big-businesses found the uncontrollable world of the blogosphere hard to
tolerate, where fundamentally different (unwritten) rules, codes of conduct or pitfalls
existed than what they were used to so far [13][17]. Even traditional hierarchies and
models of authority were sometimes questioned when social software projects were
initiated [5]. This disequilibrium and radical dissimilarity oftentimes resulted in
worst-case-scenarios for public relations department of major companies that just did
not know how to deal with this new tool of communications [2].
19
However, unlike their equivalents in the Internet, internal weblogs can be customized to help a company succeed both on the individual and organizational level.
3.2 Deployment of Corporate Blogs
While first pragmatic systematization efforts of corporate weblogs [20][34] provided
a coherent overview about the whole field but lacked a conceptual fundament, Zerfaß and Bölter [33] provided a more applicable reference framework that presents
two dimensions in which the distinct forms of corporate weblogs can be located. On
the one hand blogs differ regarding their field of applications: internal communications, market communications or PR. Then again blogs can support distinct aims of
communications, which in turn can be distinguished between an informative procedural method, a persuasive method and finally processes of argumentation (see
fig. 1).
Fig. 1. Deployment possibilities for corporate blogs (adapted on the basis of [6])
Since this paper focuses on corporate weblogs applied in the internal communications perspective, we will leave market communication and PR, the latter two fields of
application, out at this point.
Knowledge Blogs can support a company’s knowledge management because expertise and know-how can be shared on that platform with other fellow employees
[12]. A successful Collaboration blog like the “Innovation Jam” of IBM for instance
brought together employees of their worldwide strategic partners and contractors with
their own ones to spur software innovation [38].
20
J. Broß et al.
CEOs of major companies like Sun Microsystems, General Motors and Daimler2
or dotcoms like Xing3 are increasingly making use of CEO blogs to address matters of
strategic interest and importance for their company’s stakeholders [2][4]. While a
sustainable commitment is highly important for these kinds of blogs, Campaigning
blogs are temporally limited and rather suitable for highly dramaturgical processes of
communication. Topic blogs can similarly to Campaigning blogs be allocated within
multiple dimensions of Zerfaß reference framework (see fig.1). They are utilized to
prove a company’s competence in relevant fields of their industry. The graphical
position of our use-case POV within Zerfaß’ framework (refer Fig. 1) indicates a
profound distinctiveness compared to the other types of weblogs. It ranges over the
entire horizontal reach of the framework while being restricted to only the “vertical”
internal dimension of communication. POV mandate is however biased towards the
horizontal coverage similar to the ones of CEO blogs.
3.3 Success Factors
Even if a blog is professional or oriented towards the company, it is still a fairly loose
form of self-expression since it is the personal attributes of weblogs what make them
so effective in the entrepreneurial context [1][7]. Corporate weblogs offer a form of
buttom-up-approach that stresses the individual and offers a forum for seamless and
eternal exchange of ideas [37]. It allows employees to feel more involved in the company. There is however a downside to any killer application as usual. Before a corporate blog is established within a company – no matter if in an internal or external
communications context, the people responsible must address several strategic issues
in order to decide on the practicability and meaningfulness of the tool. It is first of all
essential to assess whether a blog would be a good fit for the companies values, its
corporate culture or its image. The management responsible for any social software
project within a company should therefore first of all fully understand the form of
communication they are planning to introduce [17]. It might therefore be recommendable that the persons in charge would collect their very personal experiences by
running their own weblog or have at least other employees test it beforehand. A longterm blog-monitoring to systematically oversee the formation of opinion in this
medium might be of great help here [14].
Furthermore, employees might not be going to enjoy blogging as much as their
managers or the communications staff might. To keep internal communications via
weblogs as effective as possible, it is essential that all stakeholders commit their time
and effort to update their blogs, keep them interesting, and encourage other employees to use them. Social software after all only breathes with the commitment of
the whole collective. Full and continuing institutional and managerial backup that can
neither be convulsed by unexpected nor by sometimes unwanted occurrences is essential for a successful corporate weblog.
However, even if companies would decide against a corporate blog, it should at all
times stay on their agenda [9]. Ignoring the new communications arena of the blogosphere might put your entity at a risk that will grow with the raising importance of the
medium [2][3][21]. This holds especially true if your direct competitors are working
harder into this direction than you do [19].
2
3
http://blogs.sun.com/jonathan/; http://fastlane.gmblogs.com/; http://blog.daimler.de/
http://blog.xing.com
21
Blogs can be impossible to control if they are not regulated within certain limitations, codes of conduct and ethics [7][15]. This tightrope walk needs to be realized
very carefully since weblogs tend to be a medium allergic to any kind of regulation or
instances of control. But by opening a pipeline to comments from employees without
any restrictions you can reach information glut very quickly, essentially defeating the
purpose of the tool [19].
IBM for instance was one of the first big businesses that successfully established a
simple and meaningful guideline, known as the “IBM blogging policy” for the proper
use of their internal blogs that were quickly accepted by their employees [24].
Encouraging and guiding your employees to utilize internal blogs may be the most
important issue a firm will have to address when implementing a blog for their internal communications [8].
4 Case Study: Point of View Platform
Especially in a time of crisis, generating open dialogue is paramount to managing fear
and wild speculation, and yet traditional corporate communications remains a largely
unidirectional affair. The transition of a new CEO, the global economic financial
crisis and the first lay-offs in the history of the company had generated an atmosphere
of uncertainty within SAP. While conversation happened in corridors and coffee corners, there was no way for employees to engage with executives transparently across
the company and share their ideas, concerns and offer suggestions on topics of global
relevance, and there was no consolidated way for executives to gain detailed insight
into employee sentiments.
But reaching the silent majority and making results statistically relevant requires
more than offering the ability to comment on a topic. Knowing the general dynamics
of lurkers versus contributors, especially in a risk averse culture, SAP and the HPI
worked together on a customized ratings system that would guarantee the anonymity
of those participants not yet bold enough to comment with their name, but still encourage them to contribute to the overall direction of the discussion by rating not only
the topic itself, but also peer comments.
4.1 POV: Scope, Motivation, Vision
To set appropriate expectations, the blog was launched as an online discussion forum
rather than as a personal weblog, and was published as a platform for discussion between executives and employees, without placing too much pressure on any one executive to engage. Launched with the topic of “purpose and values” and following as
part of the wave of activities around the onboarding of SAP’s new CEO, the new
platform POV has signaled a fundamental shift towards a culture of calculated risk
and a culture of dialogue [18].
This culture shift has extended now well beyond the initial launch of Point of
View, with internal blogging becoming one of the hottest topics among executives
who want to reach out to their people and identify areas for improvement. Another
result of the collaboration has been a fundamental rethinking of the way news is
created and published, with the traditional approach to spreading information via
22
J. Broß et al.
HTML e-mail newsletters being challenged by the rollout of SAP’s first truly bidirectional Newslogs. As employees become more and more acquainted with RSS and
aggregation of feeds, the opportunity to reclaim e-mail for the tasks it was originally
designed for is tangibly near.
Point of View has been the first step toward ubiquitous dialogue throughout the
company, and the approach to facilitating open, transparent dialogue is arguably the
single most pivotal enabler of internal cultural transformation at SAP. SAP thus follows the general trend of internationally operating big-business in Germany that
increasingly employ weblogs in their enterprise (41% of those companies with more
than 5000 employees [10]).
4.2 Configuration of the Standardized to Fit Corporate Requirements
The WordPress MU weblogging system favored as the SAP corporate weblogging
system needed, inspite its long lists of features and configuration options, quite some
modifications to fit the requirements set by the company’s plans, and corporate policies. Of course, the very central blogging functionality has already been provided by
WordPress.
Posts and comments can be created and moderated, and permissions for different
user roles can be restricted. Also, multimedia files can be embedded in postings. Postings and comments can by default be assigned a permanent, human-readable URI.
Furthermore, WordPress already provides basic usage statistics for readers and moderators.
One benefit of using a popular weblogging system like WordPress MU, rather than
developing a customized system from scratch or using a general-purpose CMS, is that
large parts of actual customizations needed can be achieved using extensions, or plugins, to the weblogging system.
Using such plug-ins, some of SAP's more specialized requirements could, at least
partly, be addressed. One group of plug-ins helped to meet SAP's display-related
requirements, e.g. to list comments and replies to comments in a nested (threaded)
view. Other plugins enable editing of postings and comments, even if they have
already been published, and to easily enable or disable discussions for individual
postings. Another set of plug-ins was required to highlight specific comments in a
dedicated part of the website (see “nested comments” in fig. 2) and to ensure anonymous voting as demanded by the worker’s council. The last group of plug-ins
focused on notifying users upon new postings or comments, e.g., via e-mail, and on
enhancing WordPress MU’s default searching and browsing functionality for postings, comments and tag keywords.
The dual-language policy of SAP, offering intranet web content both in English
and German, was found a bigger challenge during the development, as all content, i.e.
postings, comments, category names and tags, and the general layout of the CMS has
been requested of being kept separate by language. The most feasible solution was
found to be setting up completely independent weblogs within one shared WordPress
MU installation for each language, at the cost of having independent discussions for
different languages. Another big issue, which required thorough software development, was fulfilling privacy-related requirements. Understandably, in a controlled
corporate environment due to potentially identifiable users, such requirements play a
23
much bigger role than in a publicly available weblogging platform with terms of use
often more reflecting technical possibilities rather than corporate policies. Hence, lots
of the rating and statistics functionality needed adjustments to ensure privacy. Not
only were moderators not allowed to see certain figures, but rather it had to be ensured that such figures were not stored in the database systems. This required
some changes to the internal logic of otherwise ready-to-use voting and statistics
enhancements.
Fig. 2. Seamless Integration of POV in SAP’s internal corporate webportal
4.3 Who Are You Really?
Nowhere it is easier to fake your real identity as in the public room of the Internet, or
as Peter Steiner put it in a subtitle of a cartoon in The New Yorker: ”On the Internet,
nobody knows that you are a dog“ [27]. This holds especially true for posts and comments inside a blog. Usually, only a valid email address and an arbitrary pseudonym
are requested from authors of new posts or comments for their identification. Verification of the email address is however only limited to its syntax, this is to say that as
long as the specified address is in regular form, it is accepted by the system irrespectively of the content posted with it. Another customary security mechanism is the
request to the author of a comment to enter a graphically modified band of characters
or “captcha”, which prevents so called web robots to automatically disseminate large
quantities of content in other websites or forums.
In some cases, in which relevant political, corporative, or societal content is published in weblogs and therefore potentially available to everybody online, the identity
of authors should not be possible to fake, alter or change. This does not only hold true
24
J. Broß et al.
for identities of general public interest, but sometimes also for the identity of participants in any given content-related discussion [28].
A useful security mechanism in this regard might for instance be digital signatures
that can either be used for user-authentication or for the verification of a blog (post’s)
integrity – thus ensuring the absence of manipulation and alteration [29]. In doing so,
digital signatures serve a similar purpose to our regular signatures in diurnal life. By
signing a specific document, we express our consent with the content of that document and consequently authorize it. Since every signature holds an individual and
unique characteristic, it can be assigned to the respective individual without any
doubt. A digital signature incorporates a similar individual characteristic due to
unique captchas that link a signed document with the identity of the signee. Neither
the content of the signed document nor the identity of the signee can be changed
without altering the content of the digital signature. Finally, there is a third trusted
instance, (a so-called “certification authority”) that confirms the integrity of the document, the author as well as the corresponding signature.
For an internal corporate weblog like POV, a fully-functional user-authentication
had to be equally realized to truly overcome traditionally unidirectional corporate
communication and to generate a real open dialogue and the trust necessary to manage fear and wild speculation among the workforce within SAP.
Every stakeholder active in the POV-platform thus needed the guarantee that every
article or comment in the platform was written by exactly the author as specified within the platform. In the specific case of POV it was furthermore not only imperative to
identify single users, but also clearly mark their corresponding affiliation to the major
interest groups within that platform being the top-management and board of directors
on the one hand, and SAPs 60.000 employees and their work council on the other.
The WordPress and WordPress MU weblogging systems by default provide their
own identity management solution, which require authors to register using their personal data, and optionally validate e-mail addresses or need new accounts being activated by administrators or moderators of the system. As mentioned before, this only
partially enables user-authentication. As SAP already has a corporate identity management system in place, it was thus decided to reuse this infrastructure and allow
users to authenticate with the weblog system without any username or password, but
just using their corporate X.509 client certificate [32] using the Lightweight Directory
Access Protocol (LDAP) directories already in place. There is no ready-to-use extension for WordPress to integrate the WordPress identity management and X.509.
Hence, the functionality required needed to be developed from scratch and was packaged as a separate WordPress plugin.
Given that user-authentication needed to be implemented, it was also imperative to
allow for an easy and quick access of their employees [31]. The property of access
control of multiple, related, but independent software systems – also known as SingleSign-On (SSO) - allowing SAPs employees to log in once into the well-established
internal portal and consequently gain access to all other systems (including the blog)
without being prompted to log in again at each of them [30].
This plugin makes use of the identity information conveyed in the users' TLS client
certificates and provides it to the WordPress identity management system. As a consequence, when authenticated the SAP weblog could only be accessed using HTTPS
25
connections. This required some further rewriting techniques for hyperlinks within the
system itself, in order to avoid disturbing warning messages in the users' web browsers.
4.4 Seamless Integration
SAP employees, like most information workers, prefer a one-stop-shop approach to
information discovery, acquisition and retention, rather than site-hopping (see “SSO”
in section 4.2 and fig. 1). To improve adoption of the new platform tight integration
into SAP’s intranet and the illusion of the platform being a native component of the
intranet was required. The design was discussed with Global Communications and
SAP IT, and then implemented by HPI to meet the standards of the SAP Corporate
Portal style-guides (see “CI/CD” in fig. 1). Feedback has shown that this has been so
effective that employees have requested rating functionality for their own pages without even realizing that the entire application is a separate entity (see “integration” in
fig. 1). Seamless integration has also ensured that it is possible to install several instances of the same discussion in multiple languages, so that employees can be automatically directed to their default language version based on their personal settings
(see “Bilingual” in fig. 1).
As an equal-opportunity employer, accessibility is a mandatory consideration for
new platforms, and Point of View was tested for compatibility with external screen
readers, screen inversion, and standard Windows accessibility functions.
4.5 Meeting Enterprise Standards
Especially in the corporate (non-private) context, it should be regarded as a projects
central aspect to safeguard your blog-platform against any kind of failure and have
general system stability guaranteed at any time. For an internal communications platform with no intended customer interaction, but many thousands of potential platform
users from the work force, it could grow into a fairly embarrassment for a company if
such a platform would not be available as planned. Especially for the use case of the
POV project, which was announced within the company [30] to be a central point of
communication between SAP’s board and its eemployees, there was no room for
(temporal) failure.
This is why the development phase of POV was realized on separate server hardware. At the time the blog was fully functional und completely free of bugs, it was
moved onto two identical physical machines that will guarantee redundancy for
POV’s life-time. In the last resort of a system crash on the production server currently
running, traffic could immediately be redirected towards the stand-by redundant
second server. Already published posts and comments could quickly be restored from
a database backup.
System stability through redundancy on the hardware-side should however be realized at all times contemporaneously with software stability tests. Since POV was
build upon the open-source blogging software of Wordpress that is mainly used for
the private and small-scale context, and its code furthermore heavily adapted to fit
extra requirements, the systems scalability had to thoroughly tested for the implementation in the corporate context with up to potentially 60.000 users as well.
26
J. Broß et al.
Table 1. POV Load Test: Transaction Summary
Transaction
Name
Logon
ReadArchive
ReadComments
RecentlyAdded
Search
Min.
Avg.
Max.
1.175
0
0.409
0.445
0.458
2.735
0.063
1.231
1.241
1.248
26.972
20.924
13.094
14.16
22.704
Std.
Dev.
2.098
0.237
1.081
1.092
1.12
90 %
Pass
5.031
0.13
2.5
2.5
2.406
14,194
14,100
13,933
13,660
13,806
Fail
Stop
58
0
18
26
15
2
0
1
3
2
The IT department of SAP therefore conducted load tests with 1000 concurrent users performing automated read scenarios with a think time set at random 30-90
seconds and 10 concurrent users carrying out heavy write transactions. The number of
concurrent users was determined against benchmarks with similar platforms already
in use at SAP such as forums and wikis, and scaled to ensure sufficient stability for a
best-case employee engagement (see table 1 for transaction summary). 16 transactions per seconds were created, and 50 comments in a space of 15 minutes resulting in
an overall logon, navigation and read transaction response time of less than 3 seconds.
This result was comparable to similar systems such as internal forums and wikis, and
no major errors were encountered. Of almost 70,000 transactions executed in the test,
less than 2% failed or were stopped. The server CPU sustained between 70 and 90%
utilization and RAM consumption was around 500MB. To date, CPU load in the
active system does not exceed 15%. Performance lags in the Americas and Asia
Pacific have also now been remedied, resulting in similar response times around
the world.
5 Conclusion
Point of View was launched to the company during the launch of SAP’s Purpose and
Values by the CEO. Initially, participation was slow, and employees waited to see
how the channel developed. Following a couple of critical statements, more people
felt encouraged to participate, and the platform has begun to take on a life of its own
with 128 comments for the first post alone, and counting, even 2 months after it was
posted. Around 19,000 employees have visited the platform, and it has clocked up
55000 page views. This far exceeds the initial expectations and shows the need for
feedback was indeed very present. An increase in access to the blog via tags has also
been identified, a trend expected to grow as more content becomes available. We do
conclude that a weblog is a highly dynamic online communications tool that if implemented correctly has the potential to make a company’s internal communications
more cohesive and vibrant. However, it should also be mentioned here that any social
software projects – especially when it comes to weblogs – can wreak havoc if the
basic success factors discussed before are not fully adhered to. Nonetheless, weblogs
inherently incorporate respect for individual self-expression and thus provide an excellent forum for the free exchange and development of ideas, that can make
employees feel more involved in a company and connected closer to the corporate
27
vision - even in times of crisis. Even though weblogs do not offer all solution to corporate communications departments, they can unbind human minds that make up an
organization and make internal communications more effective.
References
1. Ludewig, M., Röttgers, J.: Jedem sein Megaphon – Blogs zwischen Ego-Platform, Nischenjournalismus und Kommerz. C’t, Heft 25, Report | Web 2.0: Blogs, 162–165 (2007)
2. Jacobsen, N.: Corporate Blogging – Kommunikation 2.0, Manager Magazin,
http://www.manager-magazin.de/it/artikel/
0,2828,518180,00html
3. Klostermeier, J.: Web 2.0: Verlieren Sie nicht den Anschluss, Manager Magazin,
http://www.manager-magazin.de/it/ciospezial/0,2828,517537,00.html
4. Tiedge, A.: Webtagebücher: Wenn der Chef bloggt, Manager Magazin,
http://www.manager-magazin.de/it/artikel/
0,2828,513244,00.html
5. Hamburg-Media.Net: Enterprise 2.0 - Start in eine neue Galaxie. Always on, Ausgabe 9
(February 2009)
6. Koch, M., Richter, A.: Enterprise 2.0: Planung, Einführung und erfolgreicher Einsatz von
Social Software in Unternehmen. Oldenbourg Wissenschaftsverlag, München (2008)
7. Cowen, J., George, A.: An Eternal Conversation within a Corporation: Using weblogs as
an Internal Communications Tool. In: Proceedings of the 2005 Association for Busines
Communication Annual Convention (2005)
8. Langham, M.: Social Software goes Enterprise. Linux Enterprise (Weblogs, Wikis and
RSS Special) 1, 53–56 (2005)
9. Heng, S.: Blogs: The new magic formula for corporate communications? Deutsche Bank
Research, Digital Economy (Economics) (53), 1–8 (2005)
10. Leibhammer, J., Weber, M.: Enterprise 2.0 – Analyse zu Stand und Perspektiven in der
deutschen Wirtschaft, BITKOM (2008)
11. BerleCon Research: Enterprise 2.0 in Deutschland – Verbreitung, Chancen und Herausforderungen, BerleCon Research im Auftrag der CoreMedia (2007)
12. IEEE: Web Collaboration in Unternehmen. In: Proceedings of first IEEE EMS Workshop
about Web Collaboration in Enterprises, September 28, Munich (2007)
13. Sawhney, M.S.: Angriff aus der Blogosphäre. Manager Magazin 2 (2005),
https://www.manager-magazin.de/harvard/0,2828,343644,00.html
14. Zerfaß, A.: Corporate Blogs: Einsatzmöglichkeiten und Herausorderungen, p.6 ff (2005),
http://www.bloginitiativegermany.de
15. Scheffler, M.: Bloggers beware! Tippfs fürs sichere Bloggen im Unternehmensumelfd.
Bedrohung Web 2.0, SecuMedia Verlags-Gmbh (2007)
16. Wood, L.: Blogs & Wikis: Technologies for Enterprise Applications? The Gilbane Report 12(10), 1–9 (2005)
17. Heuer, S.: Skandal in Echtzeit. BrandEins Schwerpunkt: Kommunikation Blog vs. Konzern 02/09, 76–79 (2009)
18. Washkuch, F.: Leadership transition comms requires broader strategy in current economy,
July 2009, p.10 (2009), http://PRWEEKUS.com
19. Baker, S., Green, H.: Blogs will change your business. Business Week, May 2, 57–67 (2005),
http://www.businessweek.com/magazine/content/05_18/b3931001_
mz001.htm
28
J. Broß et al.
20. Röll, M.: Business Weblogs – a pragmatic approach to introducing weblogs in medium
and large nterprises. In: Burg, T.N. (Hrsg.), BlogTalks, Wien 2004, pp. 32–50 (2004)
21. Eck, K.: Substantial reputational risks, PR Blogger,
http://klauseck.typepad.com/prblogger/2005/02/pr_auf_der_zus
c.html
22. Argenti, P.A.: Corporate Communications. McGraw-Hill/Irwin, New York (2003)
23. O’Shea, W.: Blogs in the workplace, New York Times. July 7 (2003),
http://www.nytimes.com/2003/07/07/technology/07NECO.html?ex=
1112846400&en=813ac9fbe3866642&ei=5070
24. Snell, J.: Blogging@IBM (2005), http://www-128.ibm.com/developerworks/
blogs/dw_blog.jspa?blog=351&roll=-2#81328
25. O‘Reilly, T.: Web 2.0 Compact Definition: Trying again (2006),
http://radar.oreilly.com/archives/2006/12/web_20_compact.html
26. Cook, T., Hopkins, L.: Social Media or, How I learned to stop worrying and love communication (2007),
http://trevorcook.typepad.com/weblog/files/CookHopkinsSocialMediaWhitePaper-2007.pdf
27. Steiner, P.: Cartoon. The New Yorker 69(20) (1993),
http://www.unc.edu/depts/jomc/academics/dri/idog.html
28. Bross, J., Sack, H., Meinel, C.: Kommunikation, Partizipation und Wirkungen im Social
Web. In: Zerfaß, A., Welker, M., Schmidt, J. (eds.) Kommunikation, Partizipation und
Wirkungen im Social Web, Band 2 der Neuen Schriften zur Online-Forschung, Deutsche
Gesellschaft für Online-Forschung (Hrsg.), pp. 265–280. Herbert van Halem Verlag, Köln
(2008)
29. Meinel, C., Sack, H.: WWW – Kommunikation, Internetworking, Webtechnologien.
Springer, Heidelberg (2003)
30. Varma, Y.: SSO with SAP enterprise portal, ArchitectSAP Solutions,
http://architectsap.wordpress.com/2008/07/14/sso-with-sapenterprise-portal/
31. Secude, How to Improve Business Results through Secure SSO to SAP,
http://www.secude.com/fileadmin/files/pdfs/WPs/SECUDE_WhiteP
aper_BusinessResultsSSOforSAP_EN_090521.pdf
32. The Internet Engineering Task Force (IETF), Internet X.509 Public Key Infrastructure
Certificate and CRL Profile, http://www.ietf.org/rfc/rfc2459.txt
33. Zerfaß, A., Boelter, D.: Die neuen Meinungsmacher - Weblogs als Herausforderung für
Kampagnen, Marketing, PR und Medien. Nausner & Nausner Verlag, Graz (2005)
34. Berlecon Research: Weblogs in Marketing and PR (Kurzstudie), Berlin (2004)
35. Leisegang C., Mintert S.: Blogging Software, iX (July 2008)
36. Scripting News, XML-RPC Home Page, http://www.xmlrpc.com/
37. Cronin-Lukas, A.: Intranet, blog, and value, The big blog company,
http://www.bigblogcompany.net/index.php/weblog/category/C45/
38. Kircher, H.: Web 2.0 - Plattform für Innovation. IT-Information Technology 49(1), 63–65
(2007)
Effect of Knowledge Management on Organizational
Performance: Enabling Thought Leadership and Social
Capital through Technology Management
Michel S. Chalhoub
Lebanese American University, Lebanon
[email protected]
Abstract. The present paper studies the relationship between social networks
enabled by technological advances in social software, and overall business performance. With the booming popularity of online communication and the rise
of knowledge communities, businesses are faced with a challenge as well as an
opportunity – should they monitor the use of social software or encourage it and
learn from it? We introduce the concept of user-autonomy and user-fun, which
go beyond the traditional user-friendly requirement of existing information
technologies. We identified 120 entities out of a sample of 164 from Mediterranean countries and the Gulf region, to focus on the effect of social exchange
information systems in thought leadership.
Keywords: Social capital, social software, human networks, knowledge management, business performance, communities of practice.
1 Introduction
The present paper studies the relationship between social networks enabled by technological advances in social software, and overall business performance. With the
booming popularity of online communication and the rise of knowledge communities,
businesses are faced with a challenge as well as an opportunity – should they monitor
the use of social software or encourage it and learn from it? We introduce the concept
of user-autonomy and user-fun, which go beyond the traditional user-friendly requirement of existing information technologies. We identified 120 entities out of a
sample of 164 from Mediterranean countries and the Gulf region, to focus on the
effect of social exchange information systems in thought leadership.
During our exploratory research phase, we put forward that for a company to practice thought leadership, its human resources are expected to contribute continuously
to systems that support the development of social capital. Majority of our respondents
confirmed that although classical business packages such as enterprise resource planning (ERPs) have come a long way in supporting business performance, they are
distant from fast changing challenges that employees face in their daily lives. Respondents favored the use of social software - blogs, wikis, text chats, internet forums,
Facebook and the like - to open and conduct discussions that are both intellectual and
fun, get advice, share experiences, and connect with communities of similar interests.
30
M.S. Chalhoub
ERPs would continue to focus on business processing while leaving room for social
software in building communities of practice.
In a second phase, we identified six dimensions where knowledge systems could
be effective in supporting business performance. Those dimensions are (X1) training
and on-the-job application of social software, (X2) encouraging participative decision-making, (X3) spurring thought leadership and new product development (NPD)
systems, (X4) fostering a culture of early technology adoption, (X5) supporting customer-centered practices through social software, and (X6) using search systems and
external knowledge management to support thought leadership. We performed linear
regression analysis and found that (X1) is a learning mechanism that is positively
correlated with company performance. (X2), which represents participative decisionmaking, gives rise to informed decisions and is positively and significantly related to
company performance. (X3) or the use social software to support thought leadership
is positively and significantly related to company performance. Most employees indicated that they increasingly shifted to social participation, innovation and long term
relationships with fellow employees and partners. (X4), relevant to how social software fosters self-improvement and technology adoption is found to be statistically
insignificant, but this may be due to the company setting related to sampling and
requires further research. (X5) which corresponds to supporting customer-centered
practices through social software was found positively and significantly related to
company performance. (X6), representing the role of social software and advanced
search systems that support thought leadership through external knowledge management is statistically insignificant in relation to company performance.
Although this last result requires further research, it is consistent with our early
findings that most respondents in the geographic region surveyed rely on direct social
interaction rather than information technology applications. In sum, we recommend
that social networks and their enabling information systems be integrated in business
application rather than being looked upon by senior management as a distraction from
work. Social software grew out of basic human needs to communicate, and is
deployed through highly ergonomic tools. It lends itself to integration in business
applications.
1.1 Research Rationale
With competitive pressures in globalizing markets, the management of technology
and innovation has become a prerequisite for business performance. New ways to
communicate, organize tasks, design processes, and manage people have evolved.
Despite the competitive wave of the 1990s which pushed firms to lower costs and
adopt new business models, pressure remained to justify investments in management
systems [1]. This justification was further challenged by a lack of measures for
knowledge management systems as the latter is tacit and mobile [2]. Throughout the
last decade, increased demand by users to interact in communities of interests – both
intellectual and fun – gave rise to social software.
We identified six dimensions along which managers could be pro-active in
harnessing social software to enable employees perform better collectively. These
dimensions comprise:
Effect of Knowledge Management on Organizational Performance
31
(1) technology training and on-the-job application of social networks,
(2) using enterprise communication mechanisms to increase employee participation in decision-making,
(3) spurring employees towards thought leadership and new product development (NPD) systems. Thought leadership is our terminology referring to the
ability to create, develop, and disseminate knowledge in a brainstorming
mode, whereby the group of knowledge workers or team play a leading role
in originating ideas and building intellectual capital,
(4) fostering a culture of technological advancement and pride of being affiliated
with the organization,
(5) supporting customer-centered practices and customer relationship management (CRM) systems, and
(6) using social software to support thought leadership through external knowledge management.
The six dimensions above represent initiatives directed at the development of intellectual capital. They call for a competitive performance at an organizational level, while
contributing to self-improvement at an individual level. They are all geared towards
building a culture that appreciates early adoption of technology to remain up-to-date in
the fast-evolving field of information and management systems. Gathering, processing,
and using external knowledge is not only about the supporting technology, but mostly
about an entire process whereby the employee develops skills in performing research.
2 Organizational and Managerial Initiatives That Support
Intellectual Capital Development
2.1 Technology Training and On-the-Job Application
It is common knowledge that employee growth contributes to overall company performance. Several researchers suggest measurement techniques to link individual
growth to company result [3]. Strategic planning can no longer be performed without
accounting for technology investment and what the firm needs to address in terms of
employee training on new tools and techniques [4], [5], [6], [7].
Technology training has made career paths more complex, characterized by lateral
moves into, and out of, technical jobs. But at the same time, it provides room for userautonomy. It was found that technology training provides intellectual stimulation and
encourages employees to apply newly learned techniques on the job, realizing almost
immediate application of what they were trained for. In particular, technology training
allowed employees to develop on their own and with minimal investments personalized community systems powered by social software [8], [9].
2.2 Enterprise Communication Mechanisms Applied to Participative DecisionMaking
Employees strive to be involved in decision-making as this relates to selfimprovement. Over the last few decades, concepts and applications of enterprise-wide
collaboration have evolved to show that participation leads to sounder decisions [10],
32
M.S. Chalhoub
[11]. Most inventions and innovations that have been celebrated at the turn of the
century demonstrated the power of collaborative efforts across one or more organizations [12]. Communities that are socially networked thrive on knowledge diffusion and
relationships that foster innovation. The concept behind ERP type of applications is to
move decisions from manual hard copy reports to a suite of integrated software modules with common databases to achieve higher business performance [13]. The database collects data from different entities of the enterprise, and from a large number of
processes such as manufacturing, financial, sales, and distribution. This mode of operation increases efficiency, and allows management to tap into the system at any time
and make informed decisions [9]. But for employees to participate in decisions, they
must be equipped with the relevant processes and technology systems to help capture,
validate, organize, update, and use data related to both internal and external stakeholders. This approach requires a system that supports social interaction to allow for
brainstorming with minimal constraints, allowing employees to enjoy their daily interaction [14], [15]. Modern social systems seek to go beyond the cliché of user-friendly
features and more into user-fun features.
2.3 Supporting Thought Leadership through NPD Systems and Innovation
Novelty challenges the employee to innovate [16]. However, it is important to seek
relevant technology within an organizational context, as opposed to chasing any new
technology because it is in vogue. It has been argued that new product management
is an essential part of the enterprise not only for the sake of external competitive
moves, but also to bolster company culture and employee team-orientation - the closer
to natural social interaction, the better [17]. In that regard, processes in R&D become
necessary to drive innovation systematically rather than rely on accidental findings.
Nevertheless, product development is cross-functional by nature and requires a multifaceted free flow of ideas that is best supported by social software [18], [19].
2.4 Fostering a Culture of Technological Advancement and Pride of
Organizational Affiliation
Technological advancements have profound effects on company culture [19]. For
example, before the internet, the entire supply chain coordination was hindered by the
steep challenges of exchanging information smoothly among internal supply chain
systems such as manufacturing, purchasing, and distribution, and with external supply
chain partners such as suppliers and distributors. Today, enterprise systems provide
compatibility and ease of collaboration, while social software facilitates the development of a culture of exchange and sharing. Many respondents expressed “pride” in
belonging to an organization that built a technology-enabled collaborative culture and
enhanced their professional maturity. This effect has been proven over the last two
decades in that the adoption of modern technology is a source of pride to many employees [20], [21].
2.5 Supporting Customer-Centered Practices through CRM Types of Systems
Thought leadership has been illustrated by many examples of firms that drove industrial innovation such as Cisco, Intel, and Microsoft [22]. The internal enterprise
33
systems have gone serious development over the last few decades to integrate seamlessly with customer relationship systems, such as CRM type of applications [23].
While ERPs help achieve operational excellence, CRM helps build intimacy with the
customer. That intimacy with the customer was identified in past research, and in our
exploratory research as an important part of the employee’s satisfaction on the job
[24]. During our interviews, many expressed that the best part of their job is to accomplish something that is praised by the customer. Social software is now making
strides in assisting with customer intimacy including the use of blogs and wikis [25].
2.6 Using Advanced Systems to Support Thought Leadership through External
Knowledge Management
Knowledge formation has evolved into a global process through the widespread of
web technologies and dissemination [26], [27]. Over the last few decades, firms have
grown into more decentralized configurations, and many researchers argued that it
would be no longer feasible to operate with knowledge and decisions centralized in a
single location [28]. The development of integrated technology management processes became critical to business performance, as they link external and internal
knowledge creation, validation, and sharing [29]. Potential business partners are increasingly required to combine competition and cooperation to assist a new generation of managers in configuring alliances and maximizing business opportunities [30],
[31]. A range of social software applications are making their way into corporate
practice including internal processes and external supply chain management to build
and sustain partnerships [32], [33].
3 Research Hypotheses
Based on the managerial initiatives above, we state our six hypotheses:
• H1: Practicing technology training with direct on-the-job application is positively
correlated with company performance.
• H2: Using enterprise communication mechanisms to apply participative decisionmaking is positively correlated with company performance.
• H3: Supporting thought leadership through new product development information
systems is positively correlated with company performance.
• H4: Fostering a culture of technological advancement and pride of being affiliated
with the organization is positively correlated with company performance.
• H5: Supporting customer-centered practices through customer relationship management types of systems is positively correlated with company performance.
• H6: Investing in advanced information technology systems to support external
knowledge management is positively correlated with company performance.
3.1 Results of Empirical Analysis
The linear regression analysis provided a reliable test with an R of 0.602 (R2=0.362)
with beta coefficients β1, β2, …, β6, and their relative significance through the
34
M.S. Chalhoub
Table 1. Independent variables representing the use of technology in enabling human resources
in knowledge management and thought leadership, beta coefficients, and significance levels in
relation to company performance
Beta
Sig.
Constant
1.314
0.000
0.185
0.003
X1
Technology training and on-the job
application of social software
(& user-autonomy)
0.109
0.016
X2
Enterprise technology systems for
participative decision-making through
social networks (& user-fun)
X3
Technological thought leadership in
0.127
0.005
innovation and product development
X4
Pride in culture and technological
0.067
0.145
advancement
0.181
0.006
X5
Customer relationship management
systems for service support leveraging
social software
-0.022
0.672
X6
Advanced technology applications for
partnerships and external knowledge
management
With R=0.602 (R2=0.362), the regression is correlated, and significant at F= 10.5,
Sig = 0.000 significance level of 0.05. n = 118).
P-values. We used a 95% confidence interval. We find that X1, X2, X3, and X5 are
significant at 95% confidence interval, but that X4 and X6 are insignificant in relation
to company performance. The hypotheses H1, H2, …, H6 were tested using the regression equation. The regression results are as follows:
Y = βo + β1 .X1 + β2 .X2 + β3 .X3 + β4 .X4 + β5 .X5 + β6 .X6
Y = 1.314 + 0.185 .X1 + 0.109 .X2 + 0.127 .X3 + 0.067 .X4 + 0.181 .X5 – 0.022 .X6
Summary results are presented in Table 1. At 5% confidence level, we found that X1,
X2, X3, and X5 are positively and significantly correlated with company performance, while X4 and X6 are not significant.
We accept H1, H2, H3, and H5 as there is positive correlation and statistical significance. We cannot accept H4 and H6 as the relationship in the regression was
found insignificant.
4 Conclusions and Recommendations
The use of social software in the development of communities of practice sharing
common intellectual interests and pro-actively managing knowledge fosters thought
leadership. Our research shows that people on the job are increasingly searching for
35
technology that goes beyond the traditional user-friendly promise and more into the
user-autonomy and user-fun. We also found that individual autonomy and partaking
in idea generation while having fun is positively correlated with company performance as evidenced by the regression analysis of primary data. Decision about investment in new technologies need to be based on relevance to human resource’s work
environment rather than excitement about novelty.
We propose a framework built on six technology management initiatives – that we
used as evaluation dimensions - and argue that if included in the company’s strategic
planning, they result in competitive advantage. The six areas provide measurable and
manageable variables that could be well used as performance indicators. Our empirical model uses primary data collected from a subset of 120 companies, of a sample of
164 Mediterranean and Gulf entities. The dependent variable is an index of growth,
profitability, and customer service quality.
The empirical analysis showed that training on technology and its application onthe-job in social networks, the use of enterprise systems for participative decisionmaking, fostering thought leadership and product development using social software,
and the use of relationship systems for customer intimacy are all positively and significantly related to company performance. The cultural component represented by
pride in being part of an organization that promotes the use of modern technologies
was not found significant. The use of technology to support business partnership and
apply external knowledge management was not found significant either. The two
latter results do not indicate that these items are not important, but rather need to be
revisited in more detail. This is especially true in Middle Eastern companies where
company cultures are heavily influenced by owners, and employees are early technology adopters on their own. In such cases, social software is still perceived as
somewhat out of-scope at work, or better put designed for fun and not for business.
Nevertheless, this perception is changing as business managers are becoming increasingly aware of the business value of social software. Further, external knowledge
management is still practiced through face to face interactions and events rather than
through technology tools and techniques.
Future research could focus on other regions where market data is available for
publicly traded companies. The study would then explore the relationship between
technology management initiatives and company value on open markets.
References
[1] Tushman, M., Anderson, P. (eds.): Managing Strategic Innovation and Change: A Collection of Readings. Oxford University Press, New York (1997)
[2] Gumpert, D.E.: U.S. Programmers at overseas prices. Business Week Online (December
3, 2003)
[3] Kaplan, R.S., Norton, D.P.: The Strategy-Focused Organization: How BalancedScorecard Companies Thrive in the New Business Environment. Harvard Business
School Publishing Corporation, Cambridge (2001)
[4] Training Industry Inc.: Training Industry Research Report on Training Effectiveness
(1999)
[5] Kim, W., Mauborgne, R.: Strategy, value innovation, and the knowledge economy. Sloan
Management Review, 41–54 (Spring 1999)
36
M.S. Chalhoub
[6] Kim, W., Mauborgne, R.: Charting your company’s future. Harvard Business Review,
76–83 (June 2002)
[7] Jun, H., King, W.R.: The Role of User Participation In Information Systems Development: Implications from a Meta-Analysis. Journal of Management Information
Systems 25(1) (2008)
[8] James, W.: Best HR practices for today’s innovation management. Research Technology
Management 45(1), 57–60 (2002)
[9] Liang, H., Sharaf, N., Hu, Q., Xue, Y.: Assimilation of enterprise systems: The effect of
institutional pressures and the mediating role of top management. MIS Quarterly 31(1)
(March 2007)
[10] Miles, R., Snow, C.: Organizations: New concepts for new forms. California Management Review 28(3), 62–73 (1986)
[11] Vanston, J.: Better forecasts, better plans, better results. Research Technology Management 46(1), 47–58 (2003)
[12] Stone, F.: Deconstructing silos and supporting collaboration. Employment Relations Today 31(1), 11–18 (2004)
[13] Ferrer, J., Karlberg, J., Hintlian, J.: Integration: The key to global success. Supply Chain
Management Review (March 2007)
[14] Chalhoub, M.S.: Knowledge: The Timeless Asset That Drives Individual DecisionMaking and Organizational Performance. Journal of Knowledge Management – Cap
Gemini (1999)
[15] Xue, Y., Liang, H., Boulton, W.R.: Information Technology Governance In Information
Technology Investment Decision Processes: The Impact of Investment Characteristics,
External Environment and Internal Context. MIS Quarterly 32(1) (2008)
[16] Andriopoulos, C., Lowe, A.: Enhancing organizational creativity: The process of perpetual challenging. Management Decision 38(10), 734–749 (2000)
[17] Crawford, C., DiBenedetto, C.: New Products Management, 7th edn. McGraw-Hill,
Philadelphia (2003)
[18] Alboher, M.: Blogging’s a Low-Cost, High-Return Marketing Tool. The New York
Times. December 27 (2007)
[19] Laudon, K.C., Traver, C.G.: E-Commerce: Business, Technology, Society, 5th edn. Prentice-Hall, Upper Saddle River (2009)
[20] Bennis, W., Mische, M.: The 21st Century Organization. Jossey-Bass, San Francisco
(1995)
[21] Hof, R.D.: Why tech will bloom again. BusinessWeek, 64–70 (August 25, 2003)
[22] Gawar, A., Cuzumano, M.: Platform Leadership: How Intel, Microsoft, and Cisco Drive
Industry Innovation. Harvard Business School Press, Cambridge (2002)
[23] Goodhue, D.L., Wixom, B.H., Watson, H.J.: Realizing business benefits through CRM:
Hitting the right target in the right way. MIS Quarterly Executive 1(2) (June 2002)
[24] Gosain, S., Malhorta, A., ElSawy, O.A.: Coordinating flexibility in e-business supply
chains. Journal of Management Information Systems 21(3) (Winter 2005)
[25] Wagner, C., Majchrzak, A.: Enabling Customer-Centricity Using Wikis and the Wiki
Way. Journal of Management Information Systems 23(3) (2007)
[26] Sartain, J.: Opinion: Using MySpace and Facebook as Business Tools. Computerworld
(May 23, 2008)
[27] Murtha, T., Lenway, S., Hart, J.: Managing New Industry Creation: Global Knowledge
Formation and Entrepreneurship in High technology. Stanford University Press, Palo Alto
(2002)
[28] Rubenstein, A.: Managing technology in the decentralized firm. Wiley, New York (1989)
37
[29] Farrukh, C., Fraser, P., Hadjidakis, D., Phaal, R., Probert, D., Tainsh, D.: Developing an
integrated technology management process. Research Technology Management, 39–46
(July-August 2004)
[30] Chalhoub, M.S.: A Framework in Strategy and Competition Using Alliances: Application
to the Automotive Industry. International Journal of Organization Theory and Behavior 10(2), 151–183 (2007)
[31] Cone, E.: The Facebook Generation Goes to Work. CIO Insight (October 2007)
[32] Kleinberg, J.: The Convergence of Social and Technological Networks. Communications
of the ACM 51(11) (November 2008)
[33] Malhorta, A., Gosain, S., ElSawy, O.A.: Absorptive capacity configurations in supply
chains: Gearing for partner-enabled market knowledge creation. MIS Quarterly 29(1)
(March 2005)
Finding Elite Voters in Daum View:
Using Media Credibility Measures
Kanghak Kim1, Hyunwoo Park1, Joonseong Ko2, Young-rin Kim2,
and Sangki Steve Han1
1
Graduate School of Culture Technology, KAIST
335 Daejeon, South Korea
{fruitful_kh,shineall,stevehan}@kaist.ac.kr
2
Daum Communications Corp
Jeju, South Korea
{pheony,ddanggle}@daumcorp.com
Abstract. As news media have been expected to provide valuable news contents to readers, credibility of each medium depends on what news contents it
has created and delivered. In traditional news media, staff editors look into
news articles and arrange news contents to enhance their media credibility. By
contrast, in social news services, general users play an important role in selecting news contents through voting behavior as it is practically impossible for
staff editors to go through thousands of articles sent to the services. However,
although social news services have strived to develop news ranking systems
that select valuable news contents utilizing users’ participation, they still
represent popularity rather than credibility, or give users too much burden.
In this paper, we examined whether there is a group of elite users who votes
for articles whose journalistic values are higher than others. To do this, we
firstly assessed journalistic values of 100 social news contents with a survey.
Then, we extracted a group of elite users based on what articles they had voted
for. To prove that the elite group shows a tendency to vote for journalistically
valuable news contents, we analyzed their voting behavior in another news
pool. Finally, we concluded with a promising result that news contents voted by
the elite users show significantly higher credibility scores than other news
stories do while the number of votes from general users is not significantly correlated with the scores.
Keywords: News Ranking System, Media Credibility, Collaborative Filtering,
Social Media, Social News Service.
1
Introduction
Since the web emerged, the ecosystem of journalism has gone through huge changes.
Given the web, people have become able to publish whatever they want without any
cost, and the barrier between professional journalists and general people is no longer
clear. We Media report (2003) from the Media Center named this phenomenon ‘participatory journalism.’ Participatory journalism is defined as the act of citizen, or
Finding Elite Voters in Daum View: Using Media Credibility Measures
39
Where Do You Get Most of Your
National and Internation News?
80
TV, 70
Percent (%)
70
60
50
Internet, 40
40
30
Newspaper, 35
20
10
0
2004
2005
2006
2007
2008
Fig. 1. Sources of News Consumption in the US
group of citizens, playing an important role in collecting, reporting, analyzing and
disseminating news and information. Social news media like Digg or Reddit help this
phenomenon happen. People collect, recommend, and read news contents in social
news media. According to the Pew Research Center for the People & Press, 40 percent of Americans keep up with news about national and international issues through
the internet, and the percentage has been rapidly increasing.
For news media, selecting valuable contents has always been considered essential
since it decide their media credibility as information providers. In traditional media,
therefore, staff editors look into articles, select some of them, and arrange the selected
stories. In contrast, most social news services have been using automated news ranking
systems that utilize users’ participation, trying to select credible items for their first
page, as it is practically impossible for a small number of staff editors to screen numerous articles from a number of writers. For instance, Slashdot, a representative social
media, adopted its meta-moderation to enhance news moderators’ credibility, while
recently launched services such as NewsCred and NewsTrust aims to disseminate
credible and trustworthy news contents by providing different voting method to users.
However, the problem is that their systems often select popular contents rather than
credible ones, or otherwise give users too much burden.
This study examines whether there is a group of people who have a tendency to
vote for journalistically valuable news contents. If there is, their votes will be not only
powerful but also efficient in selecting valuable contents. For this, we firstly reviewed
researches on media credibility measures and ranking systems in section 2. Then, we
practically assessed values of news articles based on media credibility and extracted
people who had voted for journalistically valuable news contents in section 3, and
finally analyzed their voting behavior toward other news pools in section 4. As a
result, it is proven that there is a group of elite users in terms of voting credibility, and
it is promising in that we will be able to use their votes enhancing the credibility of
selected news contents utilizing their voting behavior.
40
K. Kim et al.
2
Related Work
2.1
Researches on Journalistic Media Credibility
Researches on journalistic media credibility have mainly focused on finding out components to assess perceived media credibility with. Related research started from early
1950s. Hovland and Weiss suggested trustworthiness and expertise as source credibility factors. Infante (1980) added dynamism on the previous research. Meyer (1988)
presented measuring components categorized into 2 dimension, “social concern” and
“credibility of paper” adopting Gaziano and McGrath ‘s (1986) well-known 12 factors1. Rimmer and Weaver (1987) suggested other 12 elements – including concern
for community well-being and factual foundations of information published.
Researchers started focusing on finding common or different measuring components for online news media. Ognianova (1998) used 9 semantic differential elements2
while Kiousis (1999) practically conducted a survey with 4 elements and concluded
online news is more credible than television. Berkman Center for Internet and Society
at Harvard University organized a conference titled “Blogging, Journalism & Cedibility: Battleground and Common Ground” in 2005 and discussed which subjects can be
better dealt with in online journalism and what components should be considered to
measure credibility. It shows how controversial it is to differentiate online news credibility from traditional credibility. Cliff Lampe and R. Kelly Garret classified measuring components into 2 groups3- normative and descriptive review elements – and
suggested which one among 4 review instruments (normative, descriptive, full, mini
review) performs well in terms of accuracy, discriminating credible news from others,
and relieving user burden.
Thus, numerous researched have been conducted to measure perceived credibility.
Although these researches have provided good criteria to measure credibility, they
are not adjustable to news ranking systems because those are not about forecasting
credibility.
2.2
News Ranking Systems
User-participation based news ranking systems used in representative social news
services can be categorized into three groups. One is simple voting, another is
weighted voting, and the other is rating-based voting method.
Digg and Reddit’s ranking systems are examples of the simple voting method.
They offer Digg / Burry, Upvotes / Downvotes features to users and once a news
article earns a critical mass of Diggs or Upvotes, it is promoted to the front page.
NewsCred’s method is similar to that of Digg and Reddit, except that Newscred asked
1
2
3
These were fairness, bias, telling the whole story, accuracy, respect for privacy, watching out
after people’s interest, concern for community, separation of fact and opinion, trustworthiness, concern for public interest, factuality, an level of reporter training.
9 semantic differential elements include factual-opinionated, unfair-fair, accurate-inaccurate,
untrustworthy-trustworthy, balanced-unbalanced, biased-unbiased, reliable-unreliable, thorough-not thorough, and informative-not informative.
Their component includes accuracy, credibility, fairness, informativeness, originality,
balance, clarity, context, diversity, evidence, objectivity, transparency.
41
users to vote for articles when they find them credible. Simple voting method is powerful in that it can stimulate users’ participation. However, it can cause a wellknown group voting problems known as Digg Mafia, Digg Bury Brigade, or Reddit
Downmod Squad. More fundamentally, it does not represent credibility but popularity. Slashdot is an example of weighted voting method. Each news moderator has
different weight in selecting news articles, depending on the evaluation from metamoderators. Newstrust have tried a rating-based voting system. It encourages users
to evaluate news contents with a rating instrument involving several rating components such as accuracy or informativeness. Although it turns out to be quite reliable in
assessing journalistic values of news contents, it can lower users’ participation because of its complexity. Daum View adopted a system that utilizes votes from elite
users called Open Editors, but the performance of the system cannot be accurately
evaluated due to lack of criteria in selecting credible news contents and elite users.
On the other hand, Techmeme relied on structure analysis based method. It analyzed how news sites link to each other, and considered something gathering many
inbound links as “news”. However, Techmeme gave up on its fully automated algorithm, and started allowing manual news selection because of its bad performance. It
shows how complex it is to consider credibility with structure analysis.
3
3.1
Methods
Daum View
Daum Communications (Daum) is an internet portal company in Korea and launched
a social news service named Daum View in February, 2005, which has become the
most representative social news media in Korea. As of now, the service has more than
100 million page views per a month and approximately 150,000 enrolled news bloggers. It categorizes articles into 5 groups – current, everyday lives, culture/entertainment, science/technology and sports. In this research, we are focused on
current news category because it is most likely to be subject to news credibility.
3.2
Assessing Journalistic Credibility of News Set 1
We collected top 100 popular news contents published from August 26 to September
2 in Daum View service. To assess journalistic credibility of a number of news contents, we conducted a survey over the web. Respondents were asked to assess the
credibility of news contents using a review instrument named ‘normative review’
adopted from C. Lampe and R. Kelly Garret (2007), since the instrument shows best
performance in that the result is similar to that from journalism experts. The normative review involves accuracy, credibility, fairness, informativeness, and originality,
which are widely used in traditional credibility measures.
The survey was conducted during September 2 – 9, 2009. A total number of 369
people participated, and assessed 100 social news contents with the Likert-type scale.
Besides evaluating the journalistic credibility, we asked the subjects to determine the
importance of each survey components with Likert-type scale to consider the characteristics of social news contents in value estimation. Then, we rescaled the weights so
that news credibility scores ranges from 1 to 5, and calculated the credibility scores
42
K. Kim et al.
for the sample news contents considering weights for each component. The result
shows that people perceive credibility and accuracy as the most important requirements (0.227, 0.225 respectively), while originality considered as the least important
factor (0.148).
Table 1. Weights of Credibility Components for Social News Contents
Accuracy
Credibility
Fairness
Informativeness
Originality
0.225
0.227
0.198
0.202
0.148
Finally, we calculated credibility scores for 100 news articles. Among 100 news
contents, about 20 percent of them (22 articles) were considered “good” articles in
consideration of the meaning of the Likert scale(Credibility Score > 3.5), while
another 20 percent of them (23 articles) were considered “bad” (Credibility Score <
3.0.) Below are examples of news credibility scores.
Table 2. Samples of Journalistic Credibility Scores for the News Contents and the Number of
Votes for Them
Previous
Ranking
89
93
76
45
22
47
67
3.3
URL
Credibility Score
http://v.daum.net/link/3918133
4.109859
3.996898
3.882244
3.856029
3.807791
3.777746
3.705755
Collecting User Voting Data
We collected user-voting data from Daum View. A total of 73,917 votes from 41,698
users were made for the 100 news contents. The data shows that the number of votes
per a user follows a power law with 32% of active users making 80% of votes. We
also differentiated users’ malicious votes, defining it as a vote made within a specific
time in which the user is not likely to be able to read the whole voted article after he
or she made the previous vote.
3.4
Valid Voting Rate
As we gathered each article’s credibility score, we were able to calculate each user’s
valid voting rate (VVR) which stands for the number of valid votes divided by the
total number of votes the user has made. Valid votes are defined as the users’ votes
for articles whose credibility scores are over 3.5, considering the meaning of 5 point
Likert scale. In this process, we considered malicious votes for credible article as
43
Fig. 2. Distribution
invalid votes and also, excluded users who made less than 3 votes because they can
gather a high valid voting rate by chance. 36,284 users were excluded in this process
and 5,414 remained.
3.5
Assessing Journalistic Credibility of News Set 2
We collected another top 50 popular current news contents published from September
9 - 16 in Daum View, and the number of votes each article gathered. Then, we again
assessed the credibility scores of the second news set. The survey method is same as
we did for news set 1, except that the number of sample news articles is smaller than
that of the first news set. The reason is that approximately 50 percent of news content
Fig. 3. (a) Ratio of Malicious Votes to Total Votes per an Article. (b) Number of Votes per an
Article.
44
K. Kim et al.
took about 75 percent of votes and that there was a tendency that the lower rank a news
article has, the higher ratio of malicious votes it gathers. So we considered that news
contents with row ranking results cannot have a chance to be voted even if their credibility scores are high enough. Finally, 22,488 votes from 14,205 users are gathered.
3.6
Evaluate Elite Voters’ Performance
Pearson correlation coefficient is used to access the relation between news credibility
scores and votes from elite votes, as well as that between the scores and votes from
general users. In addition, to compare performances among elite user groups, we
diversified elite user groups with 3 different criteria. – (1) elite user group 1 (VVR >
0.5), (2) elite user group 2 (VVR > 0.4), and (3) elite user group 3 (VVR > 0.3).
4
Result
As we assumed, there was a significant correlation between the number of votes from
elite user groups and news credibility scores. Among them, elite user group 2 showed
the highest level of correlation (Pearson correlation coefficient 0.326), while other
elite user group 1 and 3 showed slightly lower Pearson correlation coefficient (0.288
and 0.287 respectively). However, the number of votes from general users turned out
not to have any significant correlation with the credibility scores.
Table 3. Pearson Correlation Coefficient of General Users and Elite Users
# voters
General Users
User Group 1
User Group 2
User Group 3
5
?
273
620
914
News Credibility Score
Pearson Correlation
-.016
Sig. (2-tailed)
.872
Pearson Correlation
.302*
Sig. (2-tailed)
.043
Pearson Correlation
.328*
Sig. (2-tailed)
.021
Pearson Correlation
.287*
Sig. (2-tailed)
.043
Discussion
A majority of social news media is adopting user-participation based ranking systems.
That is not only because of the difficulty of measuring credibility through contents
analysis, but also because of the social aspect of the web. However, current userparticipation based news ranking systems do not show satisfying ranking results in
45
selecting credible news contents. Moreover, it caused other problems such as group
voting. James Surowiecki (2005) also claims that wisdom of crowd does not emerge
naturally, but requires a proper aggregation methodology.
Wikipedia, a representative example of wisdom of crowds, dealt with the credibility problem by differentiating users’ power in the system. Its editing model allows
administrators, who are considered trustworthy by Wikipedia employees, to have
more access to restricted technical tools including protecting or deleting pages. We
assumed that utilizing this model in social news services can be a good way to enhance credibility, not giving users too much burden. So, we firstly present criteria to
evaluate users’ performance in the system from researches on media credibility measures, and selected elite users. As a result, votes from selected user groups showed
significant correlation with credibility scores. Although the correlation coefficient
was not really high, it is still promising because the number of votes from general
users did not show any significant correlation with credibility scores, supporting the
fundamental problem of previous ranking system that they rather stand for popularity.
This study is an initial work characterizing users’ particular voting tendency, and
did not propose an elaborate news ranking model. Researches for designing a model
which enhances the correlation between users’ votes and credibility are needed.
References
1. Bowman, S., Willis, C.: We Media: How Audience are Shaping the Future of News and
Information, p. 9. The Media Center at The American Press Institute (2003)
2. The Pew Research Center for the People & the Press, http://peoplepress.org/reports/pdf/479.pdf
3. Slashdot’s meta moderation, http://slashdot.org/moderation.shtml
4. Hovand, C.I., Weiss, W.: The Influence of Source Credibility on Communication Effectiveness. In: Public Opinion Quarterly, vol. 15, pp. 635–650. Oxford University Press,
Oxford (1951)
5. Infante, D.A.: The Construct Validity of Semantic Differential Scales for the Measurement
of Source Credibility. Communication Quarterly 28(2), 19–26
6. Gaziano, C., McGrath, K.: Measuring the Concept of Credibility. Journalism and Mass
Communication Quarterly 63(3), 451–462 (1986)
7. Rimmer, T., Weaver, D.: Different Questions, Different Answers? Media Use and Media
Credibility. Journalism Quarterly 64, 28–44 (1987)
8. Ognianova, E.: The Value of Journalistic Identity on the World Wide Web. Paper presented to the The Mass Communication amd Society Division, Association for Education
in Journalism and Mass Communication, Balimore (1998)
9. Kiousis, S.: Public Trust or Mistrust? Perceptions of Media Credibility in the Information
Age. Paper presented to the The Mass Communication amd Society Division, Association
for Education in Journalism and Mass Communication, New Orleans (1999)
10. Lampe, C., Garrett, R.K.: It’s All News to Me: The Effect of Instruments on Rating Provision. Paper presented to the Hawaii International Conference on System Science, Waikoloa, Hawaii (2007)
11. Ko, J.S., Kim, K., Kweon, O., Kim, J., Kim, Y., Han, S.: Open Editing Algorithm: A Collaborative News Promotion Algorithm Based on Users’ Voting History. In: International
Conference on Computational Science and Engineering, pp. 653–658 (2009)
12. Surowiecki, J.: The Wisdom of Crowds. Anchor Books, New York (2005)
A Social Network System Based on an Ontology
in the Korea Institute of Oriental Medicine
Sang-Kyun Kim, Jeong-Min Han, and Mi-Young Song
Information Research Center, TKM Information Research Division,
Korea Institute of Oriental Medicine, South Korea
{skkim,goal,smyoung}@kiom.re.kr
Abstract. We in this paper propose a social network based on ontology in Korea
Institute of Oriental Medicine (KIOM). By using the social network, researchers
can find collaborators and share research results with others so that studies in
Korean Medicine fields can be activated. For this purpose, first, personal profiles, scholarships, careers, licenses, academic activities, research results, and
personal connections for all of researchers in KIOM are collected. After relationship and hierarchy among ontology classes and attributes of classes are defined
through analyzing the collected information, a social network ontology are
constructed using FOAF and OWL. This ontology can be easily interconnected
with other social network by FOAF and provide the reasoning based on OWL
ontology. In future, we construct the search and reasoning system using the
ontology. Moreover, if the social network is activated, we will open it to whole
Korean Medicine fields.
1 Introduction
Recently throughout the world, Social Network Service (abbreviated as SNS)[1] is
developing at a rapid rate. Due to this, numerous SNS has been created and people
with various purposes are being connected through SNS. However, with multitudes of
SNS formulated, there arouses the problem of linkage among the various SNSs. Face
book unveiled a social platform called F8 and Google devised a platform named
OpenSocial. These efforts were made in order to standardize the application offered
by SNS but the sharing is only possible between the users who are using the particular
platform. Lately, in order to solve this problem, there exists suggestion of semantic
social network[1][2] on the basis of network between people and objects. Researches
that support semantic social network are, to name a few, FOAF(Friend of a
Friend)[3], SIOC(Semantically-Inter-linked Online Communities)[4]. In fact, My
Space and Facebook are currently using FOAF.
This paper constructs a social network using ontology for the Korea Institute of Oriental Medicine (abbreviated as KIOM) as a case of semantic social network. The
purpose of this paper is to revitalize and activate research on oriental medicine by allowing researchers in KIOM to search various researchers who would aid the researches
and to enable the KIOM researchers to easily share their research information.
The KIOM social network that is constructed in this study possesses the characteristics mentioned below:
A Social Network System Based on an Ontology in the KIOM
47
First, our ontology was modeled using OWL[5], which is a semantic web ontology
language. Especially, for the information regarding people and personal contact, we
used FOAF. These methods allow mutual linkage between other social networks and
ontologies and provide, through the use of inference by OWL, much more intelligent
searches than the pre-existing.
Second, we created a closed social network that can be used only within the KIOM
in order to make actual usage possible after constructing the social network with much
information as possible. If we make it usable only within the Institute, the security can
be maintained. The advantage of this is that the researchers can share personal information and private research content that they cannot post on internet. This inside system can provide foundation to expand this network throughout the oriental medicine
community. In fact, Facebook was initially a SNS made for the use of only Harvard
University, in U.S.A., and then later opened to public, expanded and further developed.
2 Social Network in KIOM
2.1 Construction of the Ontology
The relationship between classes in the social network ontology is shown below in the
figure. The figure does not include all classes but only those with relationship between objects.
The concept that is the focal in the ontology is Institute Personnel and External Personnel. The reason for this is that the constructed ontology is used only internally
within the Institute and thus these two classes have, partially, different properties. In
Fig. 1. Class relationship of social network ontology
48
S.-K. Kim, J.-M. Han, and M.-Y. Song
addition, the institute personnel and the external personnel are both connected to the
Organization class but the institute personnel are connected as an instance of KIOM.
The external personnel can possess diverse organizational structure such as institutions,
schools, or enterprises. This diversity can be divided and differentiated in the Organization class with a property of the organization type.
Moreover, the institute personnel can have links to research information such as papers, patents, and reports, and to academics, experiences, attainments, academic meetings, and personal contact by foaf:knows. In particular, papers and patents information
will be linked through rdf:Seq. The order by name of the author or the inventor is
important in papers and patents. However, because in RDF it does not have orders
between the instances, the order should be clearly expressed through rdf:Seq.
The Class Hierarchy of the Ontology
The class structure of the social network ontology is shown in the figure below. This
figure is the class structure of the ontology seen from TopBraid Composer[6].
Fig. 2. Class hierarchy of social network ontology
49
Under the highest class of Entity class, there exist Abstract and Physical class. This
is following the structure of the Top-Level Ontology called Suggested Upper Merged
Ontology (SUMO)[7]. The entities that contain a place within time and space are seen
as Physical and those that are not a part of Physical are seen as Abstract. Thus in
Abstract, it contains experience, achievements, and academic information whereas
Physical contains classes for the instances that lower classes refer to.
In the Physical class, there are the Agent class and the Object class. The Agent signifies those that devise a certain change or work by itself and the Object contains all
of the other types.
2.2 The Analysis of the Relationship of the Ontology
This section analyses the relationship between people and objects in the social network ontology. The information that is inferred through this analysis is new sources
that are not stated in the current ontology and can be used in the future for inferences
using ontology and ontology search system, which could make use of these analyses.
Expert Relationship
Expert Relationship refers to finding the experts in the related-field. In order for this
to occur, in the KIOM social network ontology, one can use information on papers
(title of the paper, keyword), patent (title of the patent), graduation dissertation (name
of the paper), major, work in assigned, and field of interest.
For example, an expert on Ginseng would have done many researches on Ginseng
and thus he or she would write many papers and possess numerous patents related to
it and most likely to have majored or written his or her graduation dissertation correlating to Ginseng. In addition, his or her assigned work and field of interest could
be related to Ginseng. In the ontology, regarding research information, projects are
excluded as although projects exist, generally participating researchers do not take
part in actual researches.
Furthermore, research topics tend to change according to trend as time passes. Although it is regarding the same topic of Ginseng, interests in old researches decreases.
Therefore, in the cases of papers, graduation dissertation, and majors, there needs to
be a sorting according to publication date or graduation date.
Mentor Relationship
Mentor Relationship refers to people among the experts who will be useful in a person’s research. In other words, mentors are those, including the experts, someone
who can give help or become partners in a research. These people can be, among the
experts, 1) either have the academics of a doctorate, experiences or positions of above
a team leader or a professor or 2) 1st author, a collaborating author, or a corresponding
author of a SCI paper. In the first case, these mentors will be helpful in carrying out
and in managing the researches and in the case of the latter, they would be able to
provide help on technical aspects of the research.
In addition to these two cases, if we divide the mentors into Internal Mentors and
External Mentors, we can also infer upon the below relationship.
Internal mentors refer to mentor-and-menti relationship that exists within the Institution. In the case of projects, generally the research director becomes the mentor for
50
S.-K. Kim, J.-M. Han, and M.-Y. Song
participating researchers. In the case of papers, the 1st author, the co-author, and the
corresponding authors precede the research with other authors but because they take
more charge than the other authors (excluding appointed positions) they can become
mentors for the relevant dissertations.
In external mentors relationship, mentors allude to those outside of the Institute
and the mentis would be the researchers in the Institute. The relationship between the
academic adviser of the researchers and the researchers themselves tend to continue
as a mentor relationship after graduation. Moreover, in papers, the authors received
helps in writing it from external personnel. Therefore, we can infer that the external
personnel are an external mentor.
Relationship of Personal Contact
The inference for personal contact not only searches linkage of information of people
in the social network but tries to find out immanent linkage relationship or how close
people have contact with each other. The Expert relationship and Mentor relationship
is also inferring the immanent relationship. However, in this section, it discusses other
inherent relationships aside from Expert or Mentor relationship.
Fig. 3. Example of personal contact relationship
• Senior and Junior, or the Same School Class Relationship
- If A and B’s academic advisers are the same, we can infer that A and B are “a
relationship of senior and junior, or of the same school class”.
• Intimate Relationship
- If there is B, a person who is not a part of the Institute, among the list of authors
in A’s paper or patent, we can infer that A and B have a “close relationship”.
- Within a certain project, if A and B are either a research director, a detail subject
director, a co-operating research director, or a commissioned research director,
we can infer them to have a “close relationship”.
• Personal Contact Relationship
- If A and B both have the experience of working in the same company or were
educated in the same school under the same major, we can infer that a “personal
contact exists” between these two people.
-
51
If A has an experience of working in a company or have graduated from a
school in a major, we can infer that “personal contact exists” between A and
people who are currently working in the company or who are related to the major in the school.
3 Conclusion
In this study, we constructed a social network ontology for Korea Institute of Oriental
Medicine. In order for this construction to occur, we collected personal information,
academic information/experiences/attainments/academic meetings, research information, and personal contact information of all the researchers. With this as foundation,
we used FOAF and OWL to construct social network ontology. The ontology that was
constructed as such is able to link to other social networks and provide ontology
inferences based on OWL. In order for the ontology inference, this study analyzed
relationship of the ontology and deducted new relationships that were not stated in the
ontology itself. These new relationships can be used in the future in building inference system and ontology foundation search.
The social network in this study possesses a closed form of being used internally
within KIOM only. Therefore, it has the advantage that it can share much more useful
information than ordinary social networks. However, there is the problem that it is
linked to outgoing links only, to those which the researchers already know of, but no
information on the incoming links.
In future, we are designing to build a search and inference system based on the
constructed ontology and we are planning to make this social network public, once
this network is firmly established, to the field of oriental medicine in order to solve
the above problems.
References
[1] Boyd, D.M., Ellison, N.B.: Social Network Sites: Definitions, History, and Scholarship.
Journal of Computer-Mediated Communication 13(1) (2007)
[2] Breslin, J., Decker, S.: The Future of Social Networks on the Internet. IEEE Internet
Computing, 84–88 (2007)
[3] http://www.foaf-project.org/
[4] http://sioc-project.org/
[5] http://www.w3.org/TR/owl-features
[6] http://www.topquadrant.com/products/TB_Composer.html
[7] http://www.ontologyportal.org/
Semantic Web and Contextual Information:
Semantic Network Analysis of Online Journalistic Texts
Yon Soo Lim
WCU Webometrics Institute, Yeungnam University
214-1 Dae-dong, Gyeongsan, Gyeongbuk, 712-749, South Korea
[email protected]
Abstract. This study examines why contextual information is important to actualize the idea of semantic web, based on a case study of a socio-political issue
in South Korea. For this study, semantic network analyses were conducted regarding English-language based 62 blog posts and 101 news stories on the web.
The results indicated the differences of the meaning structures between
blog posts and professional journalism as well as between conservative journalism and progressive journalism. From the results, this study ascertains empirical
validity of current concerns about the practical application of the new web
technology, and discusses how the semantic web should be developed.
Keywords: Semantic Web, Semantic Network Analysis, Online Journalism.
1 Introduction
The semantic web [1] is expected to mark a new epoch in the development of internet
technology. The key property of the semantic web is to provide more useful information by automatically searching the meaning structure of web content. The new web
technology focuses on the creation of collective human knowledge rather than a simple collection of web data. The semantic web is not only a technological revolution,
but also a sign of social change.
However, many researchers and practitioners are skeptical about the practicability
of the semantic web. They doubt whether the new web technology can transform
complex and unstructured web information into well-defined and structured data. Also, the technological limitation may bring irrelevant or fractional data without considering contextual information. Further, McCool [2] asserted that the semantic web
will be fail if it ignores diverse contexts of web information. Although there are a lot
of criticism and skepticism, the negative perspectives are rarely based on empirical
studies. At this point, this study aims to ascertain why contextual information is important to actualize the idea of semantic web, based on an empirical case study of a
socio-political issue in South Korea.
This study investigates the feasibility of the semantic web technology using a semantic network analysis of online journalistic texts. Specifically, it diagnoses whether
there are the differences of the semantic structures among online texts containing different contextual information. Further, this study will discuss about how the semantic
web should be developed.
53
2 Method
2.1 Background
On July 22, 2009, South Korea's National Assembly passed contentious media reform
bill that allows newspaper publishers and conglomerates to own stakes in broadcasting networks. The political event generated a heated controversy in South Korea. Also, there were different opinions and information on the web. It seemed to be a global
issue because regarding the event, international blog posts and online news stories,
which use English language, could be easily found on the web.
Also, major Korean newspaper publishers provide English-language based news
stories for global audiences via internet. Their in-depth news stories could be sufficiently a cause for promoting global bloggers’ discussions about a nation-state's event
on the web. For this reason, although the research topic is a specific social phenomenon, it may represent the complexity of online information.
This study examines the semantic structure of English-language based online texts
regarding Korea's media reform bill. Semantic network analysis is used to identify the
differences of the semantic structures between blog posts and professional journalism
as well as between conservative journalism and progressive journalism.
2.2 Data
To identify the main concepts of online journalistic texts regarding Korean media
reform bill, 62 blog posts and 101 news stories were gathered from Google news and
blog search engines using the following words: Korea, media, law, bill, reform, revision, regulation. The time period was from June 1st-August 31th, 2009. 24 of 101 news
stories were produced by conservative Korean news publishers, such as Chosun,
Joongang, and DongA. 22 news stories were published by a progressive Korean
newspaper, Hankyoreh. The unit of analysis is individual blogs and news stories.
2.3 Semantic Network Analysis
Semantic network analysis is a systematic technique of content analysis to identify the
meaning structure of symbols or concepts in a set of documents, including communication message content by using network analysis [3, 4]. The semantic network
represents the associations of neurons responding to symbols or concepts that are socially constructed in human brains. That is, it is a relationship of shared understanding
of cultural products among members in a social system [3]. In this study, the semantic
network analysis of online journalistic texts was conducted using CATPAC [5, 6]. It
embodies semantic network analysis in "a self-organizing artificial neural network
optimized for reading text" [6]. The program identifies the most frequently occurring
words in a set of texts and explores the pattern of interconnections based on their cooccurrence in a neural network [6, 7]. Many studies have used the program to analyze
diverse types of texts, such as news articles, journals, web content, and conference
papers [8-11].
54
Y.S. Lim
2.4 CATPAC Analysis Procedure
In CATPAC, a scanning window reads through fully computerized texts. The window
size represents the limited memory capacity associated with reading texts. The default
size of the window covers seven words at a time on the basis of Miller's [12] argument that people's working memory can hold seven meaningful units at a time. After
first reading words 1 through 7, the window slides one word further and reads words
2 through 8 and so on. Whenever given words are presented in the window, artificial
neurons representing each word are activated in a simulated neural network [5, 6].
The connection between neurons is strengthened when the number of times that they
are simultaneously active increases. Conversely, their connections are weakened as
the likelihood of their co-occurrence decreases.
The program creates a matrix based on the probability of the co-occurrence between neurons representing words or symbols. From the matrix, CATPAC identifies
the pattern of their interrelationships by using cluster analysis. In this study, the cluster analysis uses the Ward method [13] to optimize the minimum variance within
clusters. This method provides a grouping of words that have the greatest similarity in
the co-occurrence matrix, where each cell shows the likelihood that the occurrence of
a word will indicate the occurrence of another. Through the cluster analysis, CATPAC produces a "dendogram," a graphical representation of the resultant clusters
within the analyzed texts [5, 6].
With the cluster analysis, multidimensional scaling (MDS) technique facilitates the
understanding of the interrelationships among words and clusters in the semantic
neural network. The co-occurrence matrix can be transformed into a coordinate matrix for spatial representation through the MDS algorithm [14]. The position of each
word in a multidimensional space is determined by the similarities between words,
based on the likelihood of their co-occurrence. That is, words having strong connections would be close to each other, whereas words having weak relationships would
be far apart. Thus, through MDS, the pattern of the semantic network in a given set of
texts can be visually identified. For this analysis, this study used UCINET-VI [15], a
program designed to analyze network data.
3 Results
In the semantic network analysis, a list of meaningless words, including articles, prepositions, conjunctions, and transitive verbs were excluded. Also, any problematic
words that may distort the analysis were eliminated by the researcher. In addition,
similar words were combined into single words to facilitate the analysis. To clarify
major concepts of online journalistic texts, this study focused on the most frequently
occurring words over 1% of the total frequency in each set of texts.
3.1 Blog vs. Newspaper
As shown in Table 1, regarding blog posts, the most frequently occurring word was
media, which occurred 124 times in 33 (53.2%) posts. Other frequently occurring words
were bill, 98 times (26, 41.9%); party, 84 times (23, 37.1%); parliament, 63 times
55
Table 1. List of the most frequently mentioned words in 62 blogs
WORD
Freq. Freq.(%) Case Case(%)
WORD
MEDIA
124
9.8
33
53.2
NATIONAL
25
2.0
10
16.1
98
7.8
26
41.9
PASS
25
2.0
15
24.2
BILL
PARTY
84
6.7
23
37.1
PUBLIC
24
1.9
7
11.3
PARLIAMENT
63
5.0
24
38.7
DP
23
1.8
11
17.7
KOREA
57
4.5
27
43.5
NETWORK
23
1.8
13
21.0
OPPOSITION
55
4.4
18
29.0
MB
21
1.7
13
21.0
24.2
KOREAN
50
4.0
23
37.1
PEOPLE
21
1.7
15
RULING
49
3.9
19
30.6
REFORM
21
1.7
15
24.2
LAW
48
3.8
12
19.4
FIGHT
16
1.3
10
16.1
LAWMAKER
45
3.6
14
22.6
CHANGE
15
1.2
7
11.3
GNP
43
3.4
14
22.6
CONTROL
15
1.2
7
11.3
GOVERNMENT 39
3.1
19
30.6
VIOLENCE
15
1.2
10
16.1
NEWSPAPER
39
3.1
13
21.0
BRAWL
14
1.1
10
16.1
BROADCAST
36
2.9
13
21.0
MEMBERS
14
1.1
10
16.1
VOTE
35
2.8
23
37.1
POLITICIANS
14
1.1
8
12.9
OWNERSHIP
29
2.3
16
25.8
PRESIDENT
14
1.1
10
16.1
ASSEMBLY
27
2.1
10
16.1
SPEAKER
14
1.1
7
11.3
COMPANY
25
2.0
7
11.3
(24, 38.7%); Korea, 57 times (27, 43.5%); opposition, 55 times (18, 29.0%); Korean, 50
times (23, 37.1%); ruling, 49 times (19, 30.6%); law, 48 times (12, 19.4%); lawmaker,
45 times (14, 22.6%); and GNP (Grand National Party), 43 times (14, 22.6%).
Table 2 shows the list of the most frequent words in news articles. In terms of
newspapers, the most frequent word was bill, occurred 658 times in 100 (99.0%)
news stories. Others were media, 587 times (98, 97.0%); GNP, 481 times (90,
89.1%); party, 470 times (88, 87.1%); DP (Democratic Party), 365 times (76, 75.2%);
assembly, 350 times (90, 89.1%); lawmaker, 303 times (77, 76.2%); opposition, 296
times (87, 86.1%); national, 294 times (89, 88.1%); broadcast, 253 times (68, 67.3%);
vote, 250 times (71, 70.3%); and law, 209 times (61, 60.4%).
Based on the co-occurrence matrix representing the semantic network focusing on
the most frequently occurring words, a cluster analysis was conducted to further examine the underlying concepts. From the cluster analysis, the groupings of words that
have a tendency to co-occur in the online journalistic texts were identified. Figure 1
presents the co-occurring clusters about blog posts and news stories.
The co-occurring clusters of blog posts were fragmentary, even though a group of
words included the most frequently occurring words. Conversely, the dendogram of
news stories represented a large cluster and several small clusters. Most words of high
frequency were strongly connected to each other.
MDS was conducted to investigate the interrelationships between words and the
clusters. Figure 2 presents the semantic networks in the two-dimensional space.
56
Y.S. Lim
Table 2. List of thee most frequently mentioned words in 101 news stories
WORD
WORD
BILL
658
9.7
100
99.0
PARLIAMENT 136
Freq. Freq.(%) Case Case(%
%)
2.0
54
53.55
MEDIA
587
8.7
98
97.0
PUBLIC
133
2.0
50
49.55
GNP
481
7.1
90
89.1
PASS
127
1.9
69
68.33
PARTY
470
6.9
88
87.1
PASSAGE
117
1.7
50
49.55
DP
365
5.4
76
75.2
REFORM
113
1.7
59
58.44
ASSEMBLY
350
5.2
90
89.1
MB
112
1.7
53
52.55
LAWMAKER
303
4.5
77
76.2
INDUSTRY
111
1.6
55
54.55
OPPOSITION
296
4.4
87
86.1
KOREA
109
1.6
60
59.44
NATIONAL
294
4.3
89
88.1
MEMBER
93
1.4
55
54.55
BROADCAST
253
3.7
68
67.3
AGAINST
82
1.2
53
52.55
VOTE
250
3.7
71
70.3
PRESIDENT
81
1.2
48
47.55
LAW
209
3.1
61
60.4
FLOOR
78
1.2
42
41.66
RULING
199
2.9
75
74.3
COMPANY
75
1.1
37
36.66
NEWSPAPER
170
2.5
68
67.3
PEOPLE
74
1.1
39
38.66
SPEAKER
159
2.3
62
61.4
LEGISLATION
71
1.0
25
24.88
SESSION
150
2.2
55
54.5
LEADER
69
1.0
41
40.66
(a) Blog
(b) News
Fig. 1. Co-occcurring clusters about blog posts and news stories
The centralization of thee blog semantic network was 19.6%. A larger cluster included 20 of 35 words. Theere were strongly connected words: media, bill, Korea, pparliament, newspaper, broadccast, law, public, and pass. Besides, 11 words were isolatted.
Conversely, the centralizaation of the news semantic network was 44.2%. 233 of
32 words were included in
n a large cluster. Also, they were tightly associated w
with
each other.
57
(a) Blog
(b) News
Fig. 2. Semantic networks of blog posts and news stories
3.2 Conservative Newspaper vs. Progressive Newspaper
As the same way as the previous analysis, 24 news stories published by conservative
newspapers and 22 news articles by a progressive newspaper were examined.
As shown in Table 3, regarding conservative newspapers, the most frequently occurring word was bill, which occurred 162 times in 23 (95.8%) news stories. Other
words of high frequency were media, 135 times (22, 91.7%); assembly, 107 times
58
Y.S. Lim
Table 3. List of the most frequently mentioned words in 24 conservative news stories
WORD
WORD
BILL
162
9.3
23
95.8
REFORM
36
2.1
15
62.5
MEDIA
135
7.8
22
91.7
SPEAKER
35
2.0
12
50.0
62.5
ASSEMBLY
107
6.2
21
87.5
PASS
29
1.7
15
GNP
107
6.2
17
70.8
COMPANY
27
1.6
12
50.0
PARTY
99
5.7
20
83.3
WORKERS
25
1.4
9
37.5
NATIONAL
97
5.6
21
87.5
KOREA
24
1.4
15
62.5
DP
94
5.4
17
70.8
PUBLIC
23
1.3
13
54.2
LAWMAKER
85
4.9
18
75.0
FLOOR
21
1.2
11
45.8
VOTE
71
4.1
15
62.5
LEADER
20
1.2
10
41.7
OPPOSITION
64
3.7
19
79.2
MB
20
1.2
7
29.2
LAW
56
3.2
16
66.7
MEMBER
19
1.1
15
62.5
SESSION
56
3.2
13
54.2
PEOPLE
19
1.1
10
41.7
BROADCAST
46
2.7
12
50.0
TIME
19
1.1
12
50.0
RULING
46
2.7
15
62.5
END
18
1.0
11
45.8
INDUSTRY
45
2.6
17
70.8
LEGISLATIVE
18
1.0
7
29.2
NEWSPAPER
40
2.3
12
50.0
MBC
17
1.0
7
29.2
BROADCASTER 38
2.2
10
41.7
PARLIAMENT
17
1.0
6
25.0
Table 4. List of the most frequently mentioned words in 22 progressive news stories
WORD
WORD
GNP
147
8.0
21
95.5
PEOPLE
38
2.1
14
63.6
MEDIA
144
7.8
20
90.9
MB
37
2.0
14
63.6
BROADCAST
128
6.9
20
90.9
VOTE
36
2.0
14
63.6
BILL
125
6.8
21
95.5
OPINION
LAW
1.9
1.8
13
9
59.1
40.9
109
5.9
20
90.9
35
ADMINISTRATION 34
PARTY
86
4.7
19
86.4
NEWSPAPER
33
1.8
13
59.1
ASSEMBLY
84
4.6
21
95.5
RULING
31
1.7
16
72.7
DP
78
4.2
15
68.2
AGAINST
30
1.6
12
54.5
PUBLIC
75
4.1
15
68.2
SPEAKER
29
1.6
12
54.5
NATIONAL
73
4.0
20
90.9
POLITICAL
24
1.3
13
59.1
LAWMAKER
60
3.3
14
63.6
TERRESTRIAL
24
1.3
5
22.7
LEGISLATION
51
2.8
11
50.0
SESSION
22
1.2
11
50.0
OPPOSITION
46
2.5
17
77.3
STRIKE
22
1.2
7
31.8
PASSAGE
44
2.4
16
72.7
COMMENTS
21
1.1
20
90.9
REVISION
41
2.2
14
63.6
PRESIDENT
20
1.1
8
36.4
PASS
39
2.1
17
77.3
QUESTIONS
20
1.1
20
90.9
KOREA
38
2.1
18
81.8
RESPONDENTS 20
1.1
5
22.7
59
(21, 87.5%); GNP, 107 times (17, 70.8%); Party, 99 times (20, 83.3%); national, 97
times (21, 87.5%); DP, 94 times (17, 70.8%); lawmaker, 85 times (18, 75.0%); vote,
71 times (15, 62.5%); and opposition, 64 times (19, 79.2%).
On the other hand, as presented in Table 4, in a progressive newspaper, the most
frequent word was GNP, occurred 147 times in 21 (95.5%) news stories. Other frequently occurring words were media, 144 times (20, 90.9%); broadcast, 128 times
(20, 90.9%); bill, 125 times (21, 95.5%); law, 109 times (20, 90.9%); party, 86 times
(19, 86.4%); assembly, 84 times (21, 95.5%); DP, 78 times (15, 68.2%); public, 75
times (15, 68.2%); and national, 73 times (20, 90.9%),
A cluster analysis was conducted. Figure 3 presents the dendograms of conservative newspapers and a progressive newspaper.
(a) Conservative
(b) Progressive
Fig. 3. Co-occurring clusters about conservative and progressive newspapers
The two dendograms of newspapers seemed to be similar, in that a larger cluster
included the majority of high frequent words. Most words of high frequency in the
large group were also identical. However, other words included in minor clusters
were different between newspapers. In conservative newspapers, the words were MB
(the initial of Korean president), workers, MBC (one of Korean public broadcasting
networks), parliament, people, public, and reform. Conversely, in a progressive newspaper, they were administration, MB, president, against, strike, people, legislation,
and respondents.
As shown in Figure 4, the visualized semantic networks present the differences between two newspapers. The network centralization of conservative newspapers was
46.6%. There was only a large cluster, including 27 of 34 words. Also, the words are
tightly connected to each other. On the contrary, in terms of a progressive newspaper,
the centralization of the semantic network was 20.0%. There were a larger cluster and
a small group of words. The larger group included relatively neutral concepts, but the
small group contained several negative words, such as against, strike, and questions.
60
Y.S. Lim
(a) Conservative
(b) Progressive
Fig. 4. Semantic networks of conservative and progressive newspapers
4 Discussion
This study focused on online journalist texts concerning a specific socio-political issue in South Korea. In the global perspective, the issue itself has very limited context
under a nation-state’s boundary. However, the results indicated the differences of the
online texts with different contextual information, such as journalistic style and tone.
61
From the results of semantic network analyses, the semantic structure of blog posts
and news stories were different. The semantic network of blogs was relatively sparse
comparing to that of newspapers. Also, the results reveals that bloggers discussed
about diverse issues derived from the main event, such as politicians' fight and violence. Conversely, professional journalists focused on the main event, and
straightforward reported the fact, such as media bill passed.
The semantic networks of conservative journalism and progressive journalism
were also different. In this study, a progressive newspaper, Hankyoreh, focused on
negative issues, such as people strike against MB administration, even though it mainly reported the fact of the main event. On the contrary, conservative newspapers, such
as Chosun, Joongang, and DongA, made little account of the negative aspects.
Instead, they more focused on the main event.
Additionally, as shown in Table 5, regarding the main words from the four types of
semantic networks, while 20 words were commonly used, other 37 words were differently mentioned.
Table 5. List of common words and different words
Common Words (N=20)
ASSEMBLY, BILL, BROADCAST, DP, GNP, KOREA, LAW, LAWMAKER, MB, MEDIA,
NATIONAL, NEWSPAPER, OPPOSITION, PARTY, PASS, PEOPLE, PUBLIC, RULING,
SPEAKER, VOTE
Different Words (N=37)
ADMINISTRATION, AGAINST, BRAWL, BROADCASTER, CHANGE, COMMENTS,
COMPANY, CONTROL, END, FIGHT, FLOOR, GOVERNMENT, INDUSTRY, KOREAN,
LEADER, LEGISLATION, LEGISLATIVE, MBC, MEMBER, NETWORK, OPINION,
OWNERSHIP, PARLIAMENT, PASSAGE, POLITICAL, POLITICIANS, PRESIDENT,
QUESTIONS, REFORM, RESPONDENTS, REVISION, SESSION, STRIKE, TERRESTRIAL,
TIME, VIOLENCE, WORKERS
In this case, if the semantic web technology considers only the common words regardless of other contextual information, a great number of useful information would
be hidden or lost in the web system. Consequently, the new web system would provide only fractional data. It is far from the idea of the semantic web. At this point, this
study empirically supports McCool’s [2] admonition that the semantic web will be
fail if it neglects diverse contextual information on the web.
To realize the idea of the semantic web that is the creation of collective human
knowledge, web ontologies should be more carefully defined considering the complexity of the web information. The semantic web is not only a technological issue,
but also a social issue. The semantic structure of the web information is changed and
developed on the basis of social interaction among internet users. Computer scientists
have led the arguments of the semantic web, and their arguments have usually focused on programming and database structure. In that case, the essentials of the web
information can be overlooked. Alternatively, social scientists can provide a crucial
idea to identify how the social web contents are constructed and developed. Thus,
62
Y.S. Lim
collaborative multi-disciplinary approaches should be required for the practical embodiment of the semantic web.
5 Conclusion
The findings of this study will be a starting point for future research. Although this
study focused on a specific socio-political issue in South Korea, there were the differences of the semantic structures among online texts containing different contextual
information. The results represent the complexity of the web information. To obtain
better understandings of the semantic structure of massive online contents, subsequent
research should be required with multi-disciplinary collaboration.
Reference
1. Lee, T.-B., Hendler, J., Lassila, O.: The semantic web. Scientific American 284, 34–43
(2001)
2. McCool, R.: Rethinking the semantic web, part 1. IEEE Internet Computing, 85–87 (2005)
3. Monge, P.R., Contractor, N.S.: Theories of communication networks. Oxford University
Press, New York (2003)
4. Monge, P.R., Eisenberg, E.M.: Emergent communication networks. In: Jablin, F.M., Putnam, L.L., Roberts, K.H., Porter, L.W. (eds.) Handbook of organizational communication,
pp. 304–342. Sage, Newbury Park (1987)
5. Woelfel, J.: Artificial neural networks in policy research: A current assessment. Journal of
Communication 43, 63–80 (1993)
6. Woelfel, J.: CATPAC II user’s manual (1998),
http://www.galileoco.com/Manuals/CATPAC3.pdf
7. Doerfel, M.L., Barnett, G.A.: A semantic network analysis of the International Communication Association. Human Communication Research 25, 589–603 (1999)
8. Choi, S., Lehto, X.Y., Morrison, A.M.: Destination image representation on the web: Content analysis of Macau travel related websites. Tourism Management 28, 118–129 (2007)
9. Doerfel, M.L., Marsh, P.S.: Candidate-issue positioning in the context of presidential debates. Journal of Applied Communication Research 31, 212–237 (2003)
10. Kim, J.H., Su, T.-Y., Hong, J.: The influence of geopolitics and foreign policy on the U.S.
and Canadian media: An analysis of newspaper coverage of Sudan’s Darfur conflict. Harvard International Journal of Press/Politics 12, 87–95 (2007)
11. Rosen, D., Woelfel, J., Krikorian, D., Barnett, G.A.: Procedures for analyses of online
communities. Journal of Computer-Mediated Communication 8 (2003)
12. Miller, G.A.: The magical number seven, plus or minus two: Some limits on our capacity
for processing information. Psychological Review 63, 81–97 (1956)
13. Ward, J.H.: Hierarchical Grouping to optimize an objective function. Journal of American
Statistical Association 58, 236–244 (1963)
14. Torgerson, W.S.: Theory and methods of scaling. John Wiley & Sons, New York (1958)
15. Borgatti, S.P., Everett, M.G., Freeman, L.C.: Ucinet 6 for Windows. Analytic Technologies, Harvard (2002)
Semantic Twitter: Analyzing Tweets
for Real-Time Event Notification
Makoto Okazaki and Yutaka Matsuo
The University of Tokyo
2-11-16 Yayoi, Bunkyo-ku
Tokyo 113-8656, Japan
Abstract. Twitter, a popular microblog service, has received much attention recently. An important characteristic of Twitter is its real-time nature. However,
to date, integration of semantic processing and the real-time nature of Twitter
has not been well studied. As described herein, we propose an event notification
system that monitors tweet (Twitter messages) and delivers semantically relevant
tweets if they meet a user’s information needs. As an example, we construct an
earthquake prediction system targeting Japanese tweets. Because of numerous
earthquakes in Japan and because of the vast number of Twitter users throughout
the country, it is sometimes possible to detect an earthquake by monitoring tweets
before an earthquake actually arrives. (An earthquake is transmitted through the
earth’s crust at about 3–7 km/s. Consequently, a person has about 20 s before its
arrival at a point that is 100 km distant.) Other examples are detection of rainbows in the sky, and detection of traffic jams in cities. We first prepare training
data and apply a support vector machine to classify a tweet into positive and negative classes, which corresponds to the detection of a target event. Features for
the classification are constructed using the keywords in a tweet, the number of
words, the context of event words, and so on. In the evaluation, we demonstrate
that every recent large earthquake has been detected by our system. Actually, notification is delivered much faster than the announcements broadcast by the Japan
Meteorological Agency.
1 Introduction
Twitter, a popular microblogging service, has received much attention recently. Users of
Twitter can post a short text called a tweet: a short message of 140 characters or less. A
user can follow other users (unless she chooses a privacy setting), and her followers can
read her tweets. After its launch on October 2006, Twitter users have increased rapidly.
Twitter users are currently estimated as 44.5 million worldwide1.
An important characteristic of Twitter is its real-time nature. Although blog users
typically update their blogs once every several days, Twitter users write tweets several
times in a single day. Users can know how other users are doing and often what they
are thinking now, users repeatedly come back to the site and check to see what other
people are doing.
1
http://www.techcrunch.com/2009/08/03/twitter-reaches-44.
5-million-people-worldwide-in-june-comscore/
64
M. Okazaki and Y. Matsuo
Fig. 1. Twitter screenshot
In Japan, more than half a million Twitter users exist; the number grows rapidly.
The Japanese version of Twitter was launched on 23 April 2008. In February 2008,
Japan was the No. 2 country with respect to Twitter traffic2 . At the time of this writing,
Japan has 11th largest number of users in the world. Figure 1 presents a screenshot of
the Japanese version of Twitter. Every function is the same as in the original Englishlanguage interface, but the user interface is in Japanese.
Some studies have investigated Twitter: Java et al. analyzed Twitter as early as 2007.
They described the social network of Twitter users and investigated the motivation of
Twitter users [1]. B. Huberman et al. analyzed more than 300 thousand users. They
discovered that the relation between friends (defined as a person to whom a user has
directed posts using an ”@” symbol) is the key to understanding interaction in Twitter
[2]. Recently, boyd et al. investigated retweet activity, which is the Twitter-equivalent
of e-mail forwarding, where users post messages originally posted by others [3].
On the other hand, many works have investigated Semantic Web technology (or semantic technology in a broader sense). Recently, many works have examined how to
integrate linked data on the web[4]. Automatic extraction of semantic data is another
approach that many studies have used. For example, extracting relations among entities
from web pages [5] is an example of the utilization of natural language processing and
web mining to obtain Semantic Web data. Extracting events is also an important means
of obtaining knowledge from web data.
To date, means of integrating semantic processing and the real-time nature of Twitter
have not been well studied. Combining these two directions, we can develop various
algorithms to process Twitter data semantically. Because we can assess numerous texts
(and social relations among users) in mere seconds, if we were able to extract some tweets
automatically, then we would be able to provide real-time event notification services.
As described in this paper, we propose an event notification system that monitors
tweets and delivers some tweets if they are semantically relevant to users’ information needs. As an example, we develop an earthquake reporting system using Japanese
tweets. Because of the numerous earthquakes in Japan and the numerous and geographically dispersed Twitter users throughout the country, it is sometimes possible to detect
2
http://blog.twitter.com/2008/02/twitter-web-traffic-aroundworld.html
Semantic Twitter: Analyzing Tweets for Real-Time Event Notification
65
Fig. 2. Twitter user map
Fig. 3. Earthquake map
an earthquake by monitoring tweets. In other words, many earthquake events occur
in Japan. Many sensors are allocated throughout the country. Figure 2 portrays a map
of Twitter users worldwide (obtained from UMBC eBiquity Research Group); Fig. 3
depicts a map of earthquake occurrences worldwide (using data from Japan Meteorological Agency (JMA)). It is apparent that the only intersection of the two maps, which
means regions with many earthquakes and large Twitter users, is Japan. (Other regions
such as Indonesia, Turkey, Iran, Italy, and Pacific US cities such as Los Angeles and
San Francisco also roughly intersect, although the density is much lower than in Japan.)
Our system detects an earthquake occurrence and sends an e-mail, possibly before an
earthquake actually arrives at a certain location: An earthquake propagates at about 3–7
km/s. For that reason, a person who is 100 km distant from an earthquake has about 20
s before the arrival of an earthquake wave. Actually, some blogger has already written
about the tweet phenomenon in relation to earthquakes in Japan3 :
Japan Earthquake Shakes Twitter Users ... And Beyonce: Earthquakes are
one thing you can bet on being covered on Twitter (Twitter) first, because, quite
frankly, if the ground is shaking, you’re going to tweet about it before it even
3
http://mashable.com/2009/08/12/japan-earthquake/
66
registers with the USGS and long before it gets reported by the media. That
seems to be the case again today, as the third earthquake in a week has hit
Japan and its surrounding islands, about an hour ago. The first user we can
find that tweeted about it was Ricardo Duran of Scottsdale, AZ, who, judging
from his Twitter feed, has been traveling the world, arriving in Japan yesterday.
Another example of an event that can be captured using Twitter is rainbows. Sometimes
people twitter about beautiful rainbows in the sky. To detect such target events, we first
prepare training data and apply a support vector machine to classify a tweet as either
belonging to a positive or negative class, which corresponds to the detection of a target
event. Features for such a classification are constructed using keywords in a tweet, the
number of words, the context of event words, and so on. In the evaluation, we can
send an earthquake notification in less than a minute, which is much faster than the
announcements broadcast by the Japan Meteorological Agency.
The contributions of the paper can be summarized as follows:
– The paper provides an example of semantic technology application on Twitter, and
presents potential uses for Twitter data.
– For earthquake prediction, many studies have been done from a geological science
perspective. This paper presents an innovative social approach, which has not been
reported before in the literature.
This paper is organized as follows: In the next section, we explain the concept of our
system and show system details. In Section 3, we explain the experiments. Section 4 is
devoted to related works and discussion. Finally, we conclude the paper.
2 System Architecture
In Fig. 4, we present the concept of our system. Generally speaking, the classical mass
media provide standardized information to the masses, where social media provide realtime information in which pieces of information are useful for only a few people. Using
semantic technology, we can create an advance social medium of a new kind; we can
provide useful and real-time information to some users.
We pick up earthquake information as an example because Japan has many earthquakes (as is true also of Korea, our conference venue). Moreover, earthquake information is much more valuable if given in real time. We can turn off a stove or heater in
our house and hide ourselves under a desk or table if we have several seconds before an
earthquake actually hits. For that very reason, the Japanese government has allocated
a considerable amount of its budget to develop earthquake alert systems. We take a
different approach to classical earthquake prediction. By gathering information about
earthquakes from Twitter, we can provide useful and real-time information to many
people.
Figure 5 presents the system architecture. We first search for tweets TQ including the
query string Q from Twitter at every s seconds. We use a search API4 to search tweets.
4
search.twitter.com or http://pcod.no-ip.org/yats/search
67
Fig. 4. Use of semantic technology for social media
Fig. 5. System architecture
In our case, we set Q = {”earthquake” and ”shakes”}5 . In fact, TQ is a set of tweets
including the query words. We set s to be 5 s.
The obtained set of tweets TQ sometimes includes tweets that do not mention an
actual earthquake occurring. For example, a user might see that someone is ”shaking”
hands, or people in the upper floor apartment are like an ”earthquake”. Therefore we
must clarify that the tweet t ∈ TQ is really referring to an actual earthquake occurring
(at least in the sense that the user believes so.)
To classify a tweet into a positive class (i.e. an earthquake occurs) or a negative class
(i.e. an earthquake does not occur), we make a classifier using support vector machine
(SVM) [6], which is a popular machine-learning algorithm. By preparing 597 examples
as a training set, we can obtain a model to classify tweets into positive and negative
categories automatically.
5
Actually, we set Q as ”nk” and ”h” in Japanese.
68
Table 1. Performance of classification
(i) earthquake query:
Features
A
B
C
All
Recall
87.50%
87.50%
50.00%
87.50 %
Precision
63.64%
38.89%
66.67%
63.64%
F-value
73.69%
53.85%
57.14%
73.69%
Features
A
B
C
All
Recall
66.67%
86.11%
52.78%
80.56 %
Precision
68.57%
57.41%
86.36%
65.91%
F-value
67.61%
68.89%
68.20%
72.50%
(ii) shaking query:
Features of a tweet are the following, categorized into three groups. Morphological
analysis is conducted using Mecab6 , which separates sentences into a set of words.
Group A: simple statistical features the number of words in a tweet, and the position
of the query word in a tweet
Group B: keyword features the words in a tweet
Group C: context word features the words before and after the query word
The classification performance is presented in Table 1. We use two query words—
earthquake and shaking; performances using either query are shown. We used a linear
kernel for SVM. We obtain the highest F-value when we use feature A and all features.
Surprisingly, feature B and feature C do not contribute much to the classification performance. When an earthquake occurs, a user becomes surprised and might produce a
very short tweet. It is apparent that the recall is not so high as precision. It is attributable
to the usage of query words in a different context than we intend. Sometimes it is difficult even for humans to judge whether a tweet is reporting an actual earthquake or not.
Some examples are that a user might write ”Is this an earthquake or a truck passing?”
Overall, the classification performance is good considering that we can use multiple
sensor readings as evidence for event detection.
After making a classification and obtaining a positive example, the system quickly
sends an e-mail (usually mobile e-mail) to the registered users. It is hoped that the
e-mail is received by a user shortly before the earthquake actually arrives.
3 Experiments
We have operated a system, called Toretter7 since August 8. The system screenshot is
shown in Fig. 6. Users can see the detection of past earthquakes. They can register their
e-mails for to receive notices of future earthquake detection. To date, we have about 20
test users who have registered to use the system.
6
7
http://mecab.sourceforge.net/
It means ”we have taken it” in Japanese.
69
Fig. 6. Screenshot of Toretter: Earthquake notification system
Table 2. Facts about earthquake detection
Date Magnitude Location
Time First tweet detected #Tweets within 10 min Announce of JMA
Aug 18
4.5
Tochigi
6:58:55
7:00:30
35
07:08
Aug 18
3.1
Suruga-wan 19:22:48
19:23:14
17
19:28
Aug 21
4.1
Chiba
8:51:16
8:51:35
52
8:56
Aug 25
4.3
Uraga-oki 2:22:49
2:23:21
23
02:27
Aug 25
3.5
Fukushima 22:21:16
22:22:29
13
22:26
Aug 27
3.9
Wakayama 17:47:30
17:48:11
16
17:53
Aug 27
2.8
Suruga-wan 20:26:23
20:26:45
14
20:31
Aug 31
4.5
Fukushima 00:45:54
00:46:24
32
00:51
Sep 2
3.3
Suruga-wan 13:04:45
13:05:04
18
13:10
Sep 2
3.6
Bungo-suido 17:37:53
17:38:27
3
17:43
Table 2 presents some facts about earthquake detection by our system. This table
shows that we investigated 10 earthquakes during 18 August – 2 September, all of which
were detected by our system. The first tweet of an earthquake is within a minute or so.
The delay can result from the time for posting a tweet by a user, the time to index
the post, and the time to make queries by our system. Every earthquake elicited more
than 10 tweets within 10 min, except one in Bungo-suido, which is the sea between
two big islands: Kyushu and Shikoku. Our system sent e-mails mostly within a minute,
sometimes within 20 s. The delivery time is far earlier than the rapid broadcast of announcement of the Japan Meteorological Agency (JMA), which are widely broadcast
Table 3. Earthquake detection performance for two months from August 2009
JMA intensity scale 2 or more 3 or more 4 or more
Num. of earthquakes
78
25
3
70(89.7%) 24 (96.0%) 3 (100.0%)
Detected
Promptly detected8 53 (67.9%) 20 (80.0%) 3 (100.0%)
70
Fig. 7. The locations of the tweets on the earthquake
Fig. 8. Number of tweets related to earthquakes
on TV; on average, a JMA announcement is broadcast 6 min after an earthquake occurs.
Statistically, we detected 96% of earthquakes larger than JMA seismic intensity scale9
3 or more as shown in Table 3.
Figure 8 shows the number of tweets mentioning earthquakes. Some spikes are apparent when the earthquake occurs; the number gradually decreases. Statistically, we
detected 53% of earthquakes larger than magnitude 1.0 using our system.
Figure 7 shows the locations of the tweets on the earthquake. The color of balloons
intend the passage of time. Red represents early tweets; blue represents later tweets.
The red cross shows the earthquake center.
9
The JMA seismic intensity scale is a measure used in Japan and Taiwan to indicate earthquake
strength. Unlike the Richter magnitude scale, the JMA scale describes the degree of shaking
at a point on the earth’s surface. For example, the JMA scale 3 is, by definition, one which is
”felt by most people in the building. Some people are frightened”. It is similar to the Modified
Mercalli scale IV, which is used along with the Richter scale in the US.
71
Dear Alice,
We have just detected an earthquake
around Chiba. Please take care.
Best,
Toretter Alert System
Fig. 9. Sample alert e-mail
A sample e-mail is presented in Fig. 9. It alerts users and urges them to prepare for
the earthquake. The location is obtained by a registered location on the user profile: the
location might be wrong because the user might register in a different place, or the user
might be traveling somewhere. The precise location estimation from previous tweets is
a subject for our future work.
4 Related Work
Twitter is an interesting example of the most recent social media: numerous studies have
investigated Twitter. Aside from the studies introduced in Section 1, several others have
been done. Grosseck et al. investigated indicators such as the influence and trust related
to Twitter [7]. Krishnamurthy et al. crawled nearly 100,000 Twitter users and examined
the number of users each user follows, in addition to the number of users following
them. Naaman et al. analyzed contents of messages from more than 350 Twitter users
and manually classified messages into nine categories [8]. The numerous categories are
”Me now” and ”Statements and Random Thoughts”; statements about current events
corresponding to this category.
Some studies attempt to show applications of Twitter: Borau et al. tried to use Twitter
to teach English to English-language learners [9]. Ebner et al. investigated the applicability of Twitter for educational purposes, i.e. mobile learning [10]. The integration of
the Semantic Web and microblogging was described in a previous study [11] in which
a distributed architecture is proposed and the contents are aggregated. Jensen et al. analyzed more than 150 thousand tweets, particularly those mentioning brands in corporate
accounts [12].
In contrast to the small number of academic studies of Twitter, many Twitter applications exist. Some are used for analyses of Twitter data. For example, Tweettronics10
provides an analysis of tweets related to brands and products for marketing purposes.
It can classify positive and negative tweets, and can identify influential users. The classification of tweets might be done similarly to our algorithm. Web2express Digest11
10
11
http://www.tweettronics.com
http://web2express.org
72
is a website that auto-discovers information from Twitter streaming data to find realtime interesting conversations. It also uses natural language processing and sentiment
analysis to discover interesting topics, as we do in our study.
Various studies have been made of the analysis of web data (except for Twitter)
particularly addressing the spatial aspect: The most relevant study to ours is one by
Backstrom et al. [13]. They use queries with location (obtained by IP addresses), and
develop a probabilistic framework for quantifying spatial variation. The model is based
on a decomposition of the surface of the earth into small grid cells; they assume that
for each grid cell x, there is a probability px that a random search from this cell will
be equal to the query under consideration. The framework finds a query’s geographic
center and spatial dispersion. Examples include baseball teams, newspapers, universities, and typhoons. Although the motivation is very similar, events to be detected differ.
Some examples are that people might not make a search query earthquake when they
experience an earthquake. Therefore, our approach complements their work. Similarly
to our work, Mei et al. targeted blogs and analyzed their spatiotemporal patterns [14].
They presented examples for Hurricane Katrina, Hurricane Rita, and iPod Nano. The
motivation of that study is similar to ours, but Twitter data are more time-sensitive; our
study examines even more time-critical events e.g. earthquakes.
Some works have targeted collaborative bookmarking data, as Flickr does, from a
spatiotemporal perspective: Serdyukov et al. investigated generic methods for placing
photographs on Flickr on the world map [15]. They used a language model to place
photos, and showed that they can effectively estimate the language model through analyses of annotations by users. Rattenbury et al. [16] specifically examined the problem
of extracting place and event semantics for tags that are assigned to photographs on
Flickr. They proposed scale-structure identification, which is a burst-detection method
based on scaled spatial and temporal segments.
5 Discussion
We plan to expand our system to detect events of various kinds from Twitter. We developed another prototype, which detect rainbow information. A rainbow might be visible
somewhere in the world, and someone might be twittering about the rainbow. Our system can find the rainbow tweets using a similar approach to that used for detecting
earthquakes. The differences are that in the rainbow case it is not so time-sensitive as
that in the earthquake case. Rainbows can be found in various regions simultaneously,
whereas usually two or more earthquakes do not occur together. Therefore, we can make
a ”world rainbow map”. No agency is reporting rainbow information as far as we know.
Therefore, such a rainbow map is producible only through Twitter.
Other plans we have, which remain undeveloped yet, include reporting sightings of
celebrities. Because people sometimes make tweets if they see celebrities in town, by
aggregating these tweets, we can produce a map of celebrities found in cities. (Here
we specifically examine the potential uses of the technology. Of course, we should be
careful about privacy issues when using such features.)
Such real-time reporting offers many possible advantages, as we described herein.
By processing tweets using machine learning and semantic technology, we can produce
an advanced social medium of new type.
73
Finally, we mention some related works: Although few academic studies exist for
Twitter, many Twitter applications exist. Some of them are used for analyses of Twitter data. For example, Tweettronics12 provides an analysis of tweets about brands and
products for marketing purposes. It can classify positive and negative tweets, and can
identify influential users. The classification of tweets might be done similarly to our
algorithm. Web2express Digest13 is a website which auto-discovers information from
Twitter streaming data to find real time interesting conversations. It also uses natural
language processing and sentiment analysis to discover interesting topics, as we do in
our study.
6 Conclusion
As described in this paper, we describe an earthquake prediction system targeting
Japanese tweets. Strictly speaking, the system does not predict an earthquake but rather
informs users very promptly. The search API is integrated with semantic technology.
Consequently, the system might be designated as semantic twitter.
This report presents several examples in which our system can produce alerts, and
describes the potential expansion of our system. Twitter provides social data of new
type. We can develop an advanced social medium integrating semantic technology.
References
1. Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: Understanding microblogging usage
and communities. In: Proc. Joint 9th WEBKDD and 1st SNA-KDD Workshop (2007)
2. Huberman, B., Romeroand, D., Wu, F.: Social networks that matter: Twitter under the microscope. First Monday 14 (2009)
3. boyd, d., Golder, S., Lotan, G.: Tweet, tweet, retweet: Conversational aspects of retweeting
on twitter. In: Proc. HICSS-43 (2010)
4. Brizer, C., Heath, T., Idehen, K., Berners-Lee, T.: Linked data on the web. In: Proc. WWW
2008, pp. 1265–1266 (2008)
5. Matsuo, Y., Mori, J., Hamasaki, M., Nishimura, T., Takeda, H., Hasida, K., Ishizuka, M.:
Polyphonet: An advanced social network extraction system from the web. Journal of Web
Semantics 5(4) (2007)
6. Joachims, T.: Text categorization with support vector machines. In: Nédellec, C., Rouveirol,
C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
7. Grosseck, G., Holotescu, C.: Analysis indicators for communities on microblogging platforms. In: Proc. eLSE Conference (2009)
8. Naaman, M., Boase, J., Lai, C.: Is it really about me? Message content in social awareness
streams. In: Proc. CSCW 2009 (2009)
9. Borau, K., Ullrich, C., Feng, J., Shen, R.: Microblogging for language learning: Using twitter
to train communicative and cultural competence. In: Spaniol, M., Li, Q., Klamma, R., Lau,
R.W.H. (eds.) Advances in Web Based Learning – ICWL 2009. LNCS, vol. 5686, pp. 78–87.
Springer, Heidelberg (2009)
12
13
http://www.tweettronics.com
http://web2express.org
74
10. Ebner, M., Schiefner, M.: In microblogging.more than fun? In: Proc. IADIS Mobile Learning
Conference (2008)
11. Passant, A., Hastrup, T., Bojars, U., Breslin, J.: Microblogging: A semantic and distributed
approach. In: Proc. SFSW 2008 (2008)
12. Jansen, B., Zhang, M., Sobel, K., Chowdury, A.: Twitter power:tweets as electronic word of
mouth. Journal of the American Society for Information Science and Technology (2009)
13. Backstrom, L., Kleinberg, J., Kumar, R., Novak, J.: Spatial variation in search engine queries.
In: Proc. WWW 2008 (2008)
14. Mei, Q., Liu, C., Su, H., Zhai, C.: A probabilistic approach to spatiotemporal theme pattern
mining on weblogs. In: Proc. WWW 2006 (2006)
15. Serdyukov, P., Murdock, V., van Zwol, R.: Placing flickr photos on a map. In: Proc. SIGIR
2009 (2009)
16. Rattenbury, T., Good, N., Naaman, M.: Towards automatic extraction of event and place
semantics from flickr tags. In: Proc. SIGIR 2007 (2007)
Linking Topics of News and Blogs
with Wikipedia for Complementary Navigation
Yuki Sato1 , Daisuke Yokomoto1, Hiroyuki Nakasaki2 , Mariko Kawaba3,
Takehito Utsuro1 , and Tomohiro Fukuhara4
1
Graduate School of Systems and Information Engineering,
University of Tsukuba, Tsukuba, 305-8573, Japan
2
NTT DATA CORPORATION, Tokyo 135-6033, Japan
3
NTT Cyber Space Laboratories, NTT Corporation,
Yokosuka, Kanagawa, 239-0847, Japan
4
Center for Service Research,
National Institute of Advanced Industrial Science and Technology,
Tokyo 315-0064, Japan
Abstract. We study complementary navigation of news and blog, where
Wikipedia entries are utilized as fundamental knowledge source for linking news articles and blog feeds/posts. In the proposed framework, given
a topic as the title of a Wikipedia entry, its Wikipedia entry body text
is analyzed as fundamental knowledge source for the given topic, and
terms strongly related to the given topic are extracted. Those terms are
then used for ranking news articles and blog posts. In the scenario of
complementary navigation from a news article to closely related blog
posts, Japanese Wikipedia entries are ranked according to the number
of strongly related terms shared by the given news article and each
Wikipedia entry. Then, top ranked 10 entries are regarded as indices
for further retrieving closely related blog posts. The retrieved blog posts
are finally ranked all together. The retrieved blog posts are then shown
to users as blogs of personal opinions and experiences that are closely
related to the given news article. In our preliminary evaluation, through
an interface for manually selecting relevant Wikipedia entries, the rate
of successfully retrieving relevant blog posts improved.
Keywords: IR, Wikipedia, news, blog, topic analysis.
1
Introduction
We study complementary navigation of news and blog, where Wikipedia entries are utilized as fundamental knowledge source for linking news articles and
blog feeds/posts. In previous works, Wikipedia, news, and blogs are intensively
studied in a wide variety of research activities. In the area of IR, Wikipedia
has been studied as rich knowledge source for improving the performance of
text classification [1,2] as well as text clustering [3,4,5]. In the area of NLP, it
has been studied as language resource for improving the performance of named
76
Y. Sato et al.
entity recognition [6,7], translation knowledge acquisition [8], word sense disambiguation [9], and lexical knowledge acquisition [10]. In previous works on news
aggregation such as Newsblaster [11], NewsInEssence1 [12], and Google News2 ,
techniques on linking closely related news articles were intensively studied. In
addition to those previous works on use and analysis of Wikipedia and news, blog
analysis services have also become popular. Blogs are considered to be one of
personal journals, market or product commentaries. While traditional search engines continue to discover and index blogs, the blogosphere has produced custom
blog search and analysis engines, systems that employ specialized information
retrieval techniques. With respect to blog analysis services on the Internet, there
are several commercial and non-commercial services such as Technorati3 , BlogPulse4 [13], kizasi.jp5 , and blogWatcher6 [14]. With respect to multilingual blog
services, Globe of Blogs7 provides a retrieval function of blog articles across languages. Best Blogs in Asia Directory8 also provides a retrieval function for Asian
language blogs. Blogwise9 also analyzes multilingual blog articles.
Compared to those previous studies, the fundamental idea of our complementary navigation can be roughly illustrated in Figure 1. In our framework of complementary navigation of news and blog, Wikipedia entries are retrieved when
seeking fundamental background information, while news articles are retrieved
when seeking precise news reports on facts, and blog feeds/posts are retrieved
when seeking subjective information such as personal opinions and experiences.
In the proposed framework, we regard Wikipedia as a large scale encyclopedic
knowledge base which includes well known facts and relatively neutral opinions.
In its Japanese version, about 627,000 entries are included (checked at October,
2009). Given a topic as the title of a Wikipedia entry, its Wikipedia entry body
text is analyzed as fundamental knowledge source for the given topic, and terms
strongly related to the given topic are extracted. Those terms are then used
for ranking news articles and blog feeds/posts. This fundamental technique was
published in [15,16] and was evaluated in the task of blog feed retrieval from a
Wikipedia entry. [15,16] reported that this technique outperformed the original
ranking returned by “Yahoo! Japan” API.
In the first scenario of complementary navigation, given a news article of a
certain topic, the system retrieves blog feeds/posts of closely related topics and
show them to users. In the case of the example shown in Figure 1, suppose
that a user found a news article reporting that “a long queue appeared in front
of a game shop on the day a popular game Dragon Quest 9 was published”.
Then, through the function of the complementary navigation of our framework,
1
2
3
4
5
6
7
8
9
http://www.newsinessence.com/nie.cgi
http://news.google.com/
http://technorati.com/
http://www.blogpulse.com/
http://kizasi.jp/ (in Japanese)
http://blogwatcher.pi.titech.ac.jp/ (in Japanese)
http://www.globeofblogs.com/
http://www.misohoni.com/bba/
http://www.blogwise.com/
Linking Topics of News and Blogs with Wikipedia
77
Fig. 1. Framework of Complementary Navigation among Wikipedia, News, and Blogs
a closely related blog post, such as the one posted by a person who bought the
game on the day it was published, is quickly retrieved and shown to the user. In
the scenario of this direction, first, about 600,000 Japanese Wikipedia entries are
ranked according to the number of strongly related terms shared by the given
news article and each Wikipedia entry. Then, top ranked 10 entries are regarded
as indices for further retrieving closely related blog feeds/posts. The retrieved
blog feeds/posts are finally ranked all together. The retrieved blog feeds/posts
are then shown to users as blogs of personal opinions and experiences that are
closely related to the given news article.
In the second scenario of complementary navigation, which is the opposite
direction from the first one, given a blog feed/post of a certain topic, the system
retrieves news articles of closely related topics and show them to users. This
scenario is primarily intended that, given a blog feed/post which refers to a
certain news article and includes some personal opinions regarding the news,
the system retrieves the news article referred to by the blog feed/post and show
it to users.
Finally, in the third scenario of complementary navigation, given a news article
or a blog feed/post of a certain topic, the system retrieves one or more closely
related Wikipedia entries and show them to users. In the case of the example
shown in Figure 1, suppose that a user found either a news article reporting the
publication of Dragon Quest 9 or a blog post by a person who bought the game
on the day it was published. Then, through the function of the complementary
78
Y. Sato et al.
navigation of our framework, the most relevant Wikipedia entry, namely, that
of Dragon Quest 9, is quickly retrieved and shown to the user. This scenario is
intended to show users background knowledge found in Wikipedia, given a news
article or a blog feed/post of a certain topic.
Based on the introduction of the overall framework of complementary navigation among Wikipedia, news, and blogs above, this paper focuses on the
formalization of the first scenario of complementary navigation for retrieving
closely related blog posts given a news article of a certain topic. Section 2 first
describes how to extract terms that are included in each Wikipedia entry and
are closely related to it. According to the procedure to be presented in section 3,
those terms are then used to retrieve blog posts that are closely related to each
Wikipedia entry. Based on those fundamental techniques, section 4 formalizes
the similarity measure between the given news article and each blog post, and
then presents the procedure of ranking blog posts that are related to the given
news article. Section 5 introduces a user interface for complementary navigation, to be used for manually selecting Wikipedia entries which are relevant to
the given news article and are effective in retrieving closely related blog posts.
Section 5 also presents results of evaluating our framework. Section 6 presents
comparison with previous works related to this paper.
2
Extracting Related Terms from a Wikipedia Entry
In our framework of linking news and blogs through Wikipedia entries, we regard terms that are included in each Wikipedia entry body text and are closely
related to the entry as representing conceptual indices of the entry. Those closely
related terms are then used for retrieving related blog posts and news articles.
More specifically, from the body text of each Wikipedia entry, we extract boldfaced terms, anchor texts of hyperlinks, and the title of a redirect, which is a
synonymous term of the title of the target page [15,16,17]. We also extract all
the noun phrases from the body text of each Wikipedia entry.
3
The Procedure of Retrieving Blog Posts Related to a
Wikipedia Entry
This section describes the procedure of retrieving blog posts that are related to
a Wikipedia entry [15,16]. In this procedure, given a Wikipedia entry title, first,
closely related blog feeds are retrieved, and then, from the retrieved blog feeds,
closely related blog posts are further selected.
3.1
Blog Feed Retrieval
This section briefly describes how to retrieve blog feeds given a Wikipedia entry
title.
In order to collect candidates of blog feeds for a given query, in this paper, we
use existing Web search engine APIs, which return a ranked list of blog posts,
79
given a topic keyword. We use the Japanese search engine “Yahoo! Japan” API10 .
Blog hosts are limited to major 11 hosts11 . We employ the following procedure
for the blog distillation:
i) Given a topic keyword, a ranked list of blog posts are returned by a Web
search engine API.
ii) A list of blog feeds is generated from the returned ranked list of blog posts
by simply removing duplicated feeds.
iii) Re-rank the list of blog feeds according to the number of hits of the topic
keyword in each blog feed. The number of hits for a topic keyword in each
blog feed is simply measured by the search engine API used for collecting
blog posts above in i), restricting the domain of the URL to each blog feed.
[15,16] reported that the procedure above outperformed the original ranking
returned by “Yahoo! Japan” API.
3.2
Blog Post Retrieval
From the retrieved blog feeds, we next select blog posts that are closely related to
the given Wikipedia entry title. To do this, we use related terms extracted from
the given Wikipedia entry as described in section 2. More specifically, out of the
extracted related terms, we use bold-faced terms, anchor texts of hyperlinks, and
the title of a redirect, which is a synonymous term of the title of the target page.
Then, blog posts which contain the topic name or at least one of the extracted
related terms are automatically selected.
4
Similarities of Wikipedia Entries, News, and Blogs
In the scenario of retrieving blog posts closely related to a given news article,
the most important component is how to measure the similarity between the
given news article and each blog post. This section describes how we design this
similarity.
In this scenario, the fundamental component is how to measure the similarity Simw,n (E, N ) between a Wikipedia entry E and a news article N , and the
similarity Simw,b(E, B) between a Wikipedia entry E and a blog post B. The
similarity measure Simw,n (E, N ) is used when, given a news article of a certain
topic, ranking Wikipedia entries according to whether each entry is related to the
given news article. The similarity measure Simw,b (E, B) is used when, from the
highly ranked Wikipedia entries closely related to the given news article, retrieving blog posts related to any of those entries. Then, based on those similarities
Simw,n (E, N ) and Simw,b(E, B), the overall similarity measure Simn,w,b (N, B)
between the given news article N and each blog post B is introduced. Finally,
blog posts are ranked according to this overall similarity measure.
10
11
http://www.yahoo.co.jp/ (in Japanese)
FC2.com,yahoo.co.jp,rakuten.ne.jp,ameblo.jp,goo.ne.jp,livedoor.jp,
Seesaa.net, jugem.jp, yaplog.jp, webry.info.jp, hatena.ne.jp
80
Y. Sato et al.
4.1
Similarity of a Wikipedia Entry and a News Article / A Blog
Post
The similarities Simw,n (E, N ) and Simw,b(E, B) are measured in terms of the
entry title and the related terms extracted from the Wikipedia entry as described
in section 2. The similarity Simw,n (E, N ) between a Wikipedia entry E and a
news article N is defined as a weighted sum of frequencies of the entry title and
the related terms:
w(type(t)) × f req(t)
Simw,n(E, N ) =
t
where weight(t) is defined as 1 when t is the entry title, the title of a redirect, a
bold-faced term, the title of a paragraph, or a noun phrase extracted from the
body text of the entry. The similarity Simw,b (E, B) between a Wikipedia entry
E and a blog post B is defined as a weighted sum of frequencies of the entry
title and the related terms:
w(type(t)) × f req(t)
Simw,b (E, B) =
t
where weight(t) is defined as 3 when t is the entry title or the title of a redirect, as
2 when t is a bold-faced term, and as 0.5 when t is an anchor text of hyperlinks12 .
4.2
Similarity of a News Article and a Blog Post through Wikipedia
Entries
In the design of the overall similarity measure Simn,w,b(N, B) between a news
article N and a blog post B through Wikipedia entries, we consider two factors.
One of them is to measure the similarity between a news article and a blog post
indirectly through Wikipedia entries which are closely related to both of the news
article and the blog post. The other is, on the other hand, to directly measure
their similarity simply based on their text contents. In this paper, the first factor
is represented as the sum of the similarity Simw,n(E, N ) between a news article
N and a Wikipedia entry E and the similarity Simw,b (E, B) between a blog post
B and a Wikipedia entry E. The second factor is denoted as the direct document
similarity Simn,b (N, B) between a news article N and a blog post B, where we
simply use cosine measure as the direct document similarity. Finally, based on
the argument above, we define the overall similarity measure Simn,w,b (N, B)
12
In [17], we applied machine learning technique to the task of judging whether a
Wikipedia entry and a blog feed are closely related, where we incorporated features
other than the frequencies of related terms in a blog feed and achieved improvement.
Following the discussion in [15,16], the technique proposed by [17] outperforms the
original ranking returned by “Yahoo! Japan” API. As a future work, we are planning
to apply the technique of [17] to the task of complementary navigation studied in
this paper.
81
between a news article N and a blog post B through Wikipedia entries as the
weighted sum of the two factors below:
Simn,w,b(N, B)
= (1 − Kw,nb )Simn,b (N, B) + Kw,nb
Simw,n (E, N ) + Simw,b (E, B)
E
where Kw,nb is the coefficient for the weight. In the evaluation of section 5.2, we
show results with this coefficient Kw,nb as 1, since the results with Kw,nb as 1
are always better than those with Kw,nb as 0.5.
4.3
Ranking Blog Posts Related to News through Wikipedia
Entries
Based on the formalization in the previous two sections, given a news article N ,
this section presents the procedure of retrieving blog posts closely related to the
given news article and then ranking them.
First, suppose that the news article N contains titles of Wikipedia entries
E1 ,. . ., En in its body text. Then, those entries E1 ,. . .,En are ranked according to their similarities Simw,n(Ei , N ) (i = 1, . . . , n) against the given news
are selected. Next, each Ei
article N , and top ranked 10 entries E1 , . . . , E10
(i = 1, . . . , 10) of those top ranked 10 entries are used to retrieve closely related blog posts according to the procedure presented in section 3. Finally, the
retrieved blog posts B1 , . . . , Bm all together are ranked according to their similarities Simn,w,b(N, Bj ) (j = 1, . . . , m) against the given news article N .
5
Manually Selecting Wikipedia Entries in Linking News
to Related Blog Posts
In this section, we introduce a user interface for complementary navigation
with a facility of manually selecting Wikipedia entries which are relevant to
the given news article. With this interface, a user can judge whether each candidate Wikipedia entry is effective in retrieving closely related blog posts. We
then evaluate the overall framework of complementary navigation and present
the evaluation results.
5.1
The Procedure
This section describes the procedure of linking a news article to closely related
blog posts, where the measure for ranking related blog posts is based on the
formalization presented in section 4.3. In this procedure, we also use an interface
for manually selecting Wikipedia entries which are relevant to the given news
article.
82
Y. Sato et al.
Fig. 2. Interface for Complementary Navigation from News to Blogs through Wikipedia
Entries
The snapshots of the interface are shown in Figure 2. First, in “News Article
Browser”, a user can browse through a list of news articles and can select one for
which he/she wants to retrieve related blog posts. Next, for the selected news article, “Interface for Manually Selecting Relevant Wikipedia Entries” appears. In
this interface, following the formalization of section 4.3, top ranked 10 Wikipedia
entry titles are shown as candidates for retrieving blog posts that are related to
the given news article. Then, the user can select any subset of the 10 candidate
Wikipedia entry titles to be used for retrieving related blog posts. With the
subset of the selected Wikipedia entry titles, “Browser for Relevant Blog Post
Ranking” is called, where the retrieved blog posts are ranked according to the
formalization of section 4.3. Finally, the user can browse through “High Ranked
Blog Posts” by simply clicking the links to those blog posts.
Table 1 shows a list of four news articles on “Kyoto Protocol” to be used in the
evaluation of next section. For each news article, the table shows its summary
and top ranked 10 Wikipedia entry titles, where entry titles judged as relevant to
the news article are in squares. The table also shows the summary of an example
of relevant blog posts.
83
Table 1. Summaries of News Articles for Evaluation, Candidates for Relevant
Wikipedia Entries, and Summaries of Relevant Blog Posts
news
article
ID
summary of
news article
top ranked 10 entries
as candidates
for relevant
Wikipedia entries
(manually selected
entries are in a square )
1
Reports
on
Japan’s activities
on “carbon offset”,
reduction
of electric power
consumption,
and
preventing
global warming.
(date: Jan. 25,
2008)
environmental issues,
Kyoto Protocol , Japan,
automobile,
carbon offset ,
transport, United States,
hotel, carbon dioxide ,
contribution
summary of
relevant blog posts
“I
understand
the significance
of Kyoto protocol, but I think
it also has problems.” (blogger
A)
Kyoto Protocol ,
2
carbon emissions trading ,
“Japan has to
Reports on a
Japan, post-Kyoto negotiations , rely on economic
meeting for “carapproaches such
energy conservation ,
bon offset”. (date:
as carbon offset.”
Mar. 31, 2008)
Poland, fluorescent lamp,
(blogger A)
technology,
greenhouse gases , industry
3
Reports on issues
towards
post-Kyoto
negotiations. (date:
Aug. 28, 2008)
post-Kyoto negotiations ,
United Nations, protocol,
carbon dioxide ,
United States, debate,
Kyoto, greenhouse gases ,
minister, Poland
Referring to a
news article on
World Economic
Forum. (blogger
B)
4
Discussion
on
global warming
such as issues
regarding developing countries
and technologies
for energy conservation in Japan.
(date: Jun. 29,
2008)
Japan, global warming ,
environmental issues,
United States, politics,
resource, 34th G8 summit ,
India, fossil fuels ,
society
Engineers
of
Japanese electric
power companies
make
progress
in research and
development.
(blogger C)
84
Y. Sato et al.
5.2
Evaluation
The Procedure. To each of the four news articles on “Kyoto Protocol” listed
in Table 1, we apply the procedure of retrieving related blog posts described in
the previous section. We then manually judge the relevance of top ranked N blog
posts into the following three levels, i.e., (i) closely related, (ii) partially related,
and (iii) not related. Next, we consider the following two cases in measuring the
rate of relevant blog posts:
(a) relevant blog posts = closely related blog posts only
(b) relevant blog posts = closely related blog posts + partially related blog posts
Fig. 3. Evaluation Results of the Ratio of Relevant Blog Posts (%): Comparison of with
/ without Manual Selection of Relevant Wikipedia Entries
85
(a) Only closely related blog posts (judged as (i)) are regarded as relevant.
(b) Both closely related blog posts (judged as (i)) and partially related blog
posts (judged as (ii)) are regarded as relevant.
For both cases, the rate of relevant blog posts is simply defined as below:
rate of relevant blog posts =
the number of relevant blog posts
N
In the evaluation of this section, we set N as 10.
Evaluation Results. In terms of the rate of relevant blog posts, Figure 3
compares the two cases of with / without manually selecting Wikipedia entries
relevant to the given news article through the interface introduced in the previous
section. In Figure 3 (a), we regard only closely related blog posts as relevant,
where the rates of relevant blog posts improve from 0% to 10∼60%. In Figure 3
(b), we regard both closely and partially related blog posts as relevant, where
the rates of relevant blog posts improve from 0∼10% to 80∼90%.
With this result, it is clear that, the current formalization presented in this
paper has its weakness in the similarity measure for ranking related Wikipedia
entries. As can be seen in the list of top ranked 10 Wikipedia entry titles in
Table 1 as well as those manually selected out of the 10 entries, general terms
and country names such as “automobile”, “transport”, “Japan”, and “United
States” are major causes of low rates of relevancy. Those general terms and
country names mostly damage the step of retrieving related blog posts and the
final ranking of those retrieved blog posts. However, it is also clearly shown that,
once closely related Wikipedia entries are manually selected, the rates of relevant
blog posts drastically improved. This result obviously indicates that the most
important issue to be examined first is how to model the measure for ranking
Wikipedia entries which are related to a given news article. We discuss this issue
as a future work in section 7.
6
Related Works
Among several related works, [18,19] studied linking related news and blogs,
where their approaches are different from that proposed in this paper in that
our proposed method conceptually links topics of news articles and blog posts
based on Wikipedia entry texts. [18] focused on linking news articles and blogs
based on cites from blogs to news articles. [19] studied to link news articles
to blogs posted within one week after each news article is released, where a
document vector space model modified by considering terms closely related to
each news articles is employed.
[20] also studied mining comparative differences of concerns in news streams
from multiple sources. [21] studied how to analyze sentiment distribution in news
articles across 9 languages. Those previous works mainly focus on news streams
and documents other than blogs.
86
Y. Sato et al.
Techniques studied in previous works on text classification [1,2] as well as text
clustering [3,4,5] using Wikipedia knowledge are similar to the method proposed
in this paper in that they are based on related terms extracted from Wikipedia,
such as hyponyms, synonyms, and associated terms. The fundamental ideas of
those previously studied techniques are also applicable to our task. Major differences between our work and those works are in that we design our framework
as having the intermediate phase of ranking Wikipedia entries related to a given
news article.
7
Conclusion
This paper studied complementary navigation of news and blog, where Wikipedia
entries are utilized as fundamental knowledge source for linking news articles and
blog posts. In this paper, we focused on the scenario of complementary navigation
from a news article to closely related blog posts. In our preliminary evaluation,
we showed that the rate of successfully retrieving relevant blog posts improved
through an interface for manually selecting relevant Wikipedia entries. Future
works include improving the measure for ranking Wikipedia entries which are
related to a given news article. So far, we have examined a novel measure which
incorporates clustering of Wikipedia entries in terms of the similarity of their
body texts. The underlying motivation of this novel measure is to prefer a small
number of entries which have quite high similarities with each other, and we
have already confirmed that this approach drastically improves the ranking of
Wikipedia entries. We are planning to evaluate this measure against a much
larger evaluation data set and the result will be reported in the near future.
References
1. Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using
Wikipedia: Enhancing text categorization with encyclopedic knowledge. In: Proc.
21st AAAI, pp. 1301–1306 (2006)
2. Wang, P., Domeniconi, C.: Building semantic kernels for text classification using
Wikipedia. In: Proc. 14th SIGKDD, pp. 713–721 (2008)
3. Hu, J., Fang, L., Cao, Y., Zeng, H.J., Li, H., Yang, Q., Chen, Z.: Enhancing text
clustering by leveraging Wikipedia semantics. In: Proc. 31st SIGIR, pp. 179–186
(2008)
4. Huang, A., Frank, E., Witten, I.H.: Clustering document using a Wikipedia-based
concept representation. In: Proc. 13th PAKDD, pp. 628–636 (2009)
5. Hu, X., Zhang, X., Lu, C., Park, E.K., Zhou, X.: Exploiting Wikipedia as external
knowledge for document clustering. In: Proceedings of the 15th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 389–396
(2009)
6. Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data.
In: Proc. EMNLP-CoNLL, pp. 708–716 (2007)
7. Kazama, J., Torisawa, K.: Exploiting Wikipedia as external knowledge for named
entity recognition. In: Proc. EMNLP-CoNLL, pp. 698–707 (2007)
87
8. Oh, J.H., Kawahara, D., Uchimoto, K., Kazama, J., Torisawa, K.: Enriching
multilingual language resources by discovering missing cross-language links in
Wikipedia. In: Proc. 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 322–328 (2008)
9. Mihalcea, R., Csomai, A.: Wikify! linking documents to encyclopedic knowledge.
In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, pp. 233–242 (2007)
10. Sumida, A., Torisawa, K.: Hacking Wikipedia for hyponymy relation acquisition.
In: Proc. 3rd IJCNLP, pp. 883–888 (2008)
11. McKeown, K.R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J.L.,
Nenkova, A., Sable, C., Schiffman, B., Sigelman, S.: Tracking and summarizing
news on a daily basis with Columbia’s Newsblaster. In: Pro. 2nd HLT, pp. 280–285
(2002)
12. Radev, D., Otterbacher, J., Winkel, A., Blair-Goldensohn, S.: NewsInEssence:
Summarizing online news topics. Communications of the ACM 48, 95–98 (2005)
13. Glance, N., Hurst, M., Tomokiyo, T.: Blogpulse: Automated trend discovery for
Weblogs. In: WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation,
Analysis and Dynamics (2004)
14. Nanno, T., Fujiki, T., Suzuki, Y., Okumura, M.: Automatically collecting, monitoring, and mining Japanese weblogs. In: WWW Alt. 2004: Proc. 13th WWW
Conf. Alternate Track Papers & Posters, pp. 320–321 (2004)
15. Kawaba, M., Nakasaki, H., Utsuro, T., Fukuhara, T.: Cross-lingual blog analysis
based on multilingual blog distillation from multilingual Wikipedia entries. In:
Proceedings of International Conference on Weblogs and Social Media, pp. 200–
201 (2008)
16. Nakasaki, H., Kawaba, M., Yamazaki, S., Utsuro, T., Fukuhara, T.: Visualizing
cross-lingual/cross-cultural differences in concerns in multilingual blogs. In: Proceedings of International Conference on Weblogs and Social Media, pp. 270–273
(2009)
17. Kawaba, M., Yokomoto, D., Nakasaki, H., Utsuro, T., Fukuhara, T.: Linking
Wikipedia entries to blog feeds by machine learning. In: Proc. 3rd IUCS (2009)
18. Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., Konig, A.C.: Blews: Using
blogs to provide context for news articles. In: Proc. ICWSM, pp. 60–67 (2008)
19. Ikeda, D., Fujiki, T., Okumura, M.: Automatically linking news articles to blog entries. In: Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing
Weblogs, pp. 78–82 (2006)
20. Yoshioka, M.: IR Interface for Contrasting Multiple News Sites. In: Prof. 4th AIRS,
pp. 516–521 (2008)
21. Bautin, M., Vijayarenu, L., Skiena, S.: International Sentiment Analysis for News
and Blogs. In: Proc. ICWSM, pp. 19–26 (2008)
A User-Oriented Splog Filtering
Based on a Machine Learning
Takayuki Yoshinaka1 , Soichi Ishii1 , Tomohiro Fukuhara2 ,
Hidetaka Masuda3 , and Hiroshi Nakagawa4
1
2
3
School of Science and Technology for Future Life, Tokyo Denki University,
2-2 Kanda Nishikicho, Chiyoda-ku, Tokyo 101-8457, Japan
[email protected]
Research into Artifacts, Center for Engineering, The University of Tokyo,
5-1-5, Kashiwanoha, Kashiwa, Chiba 277-0882, Japan
[email protected]
School of Science and Technology for Future Life, Tokyo Denki University,
2-2 Kanda nishikicho, Chiyoda-ku, Tokyo 101-8457, Japan
[email protected]
4
Information Technology Center, The University of Tokyo,
7-3-1 Hongou, Bunkyo-ku, Tokyo 113-0033, Japan
[email protected]
Abstract. A method for filtering spam blogs (splogs) based on a machine learning technique, and its evaluation results are described. Today,
spam blogs (splogs) became one of major issues on the Web. The problem
of splogs is that values of blog sites are different by people. We propose
a novel user-oriented splog filtering method that can adapt each user’s
preference for valuable blogs. We use the SVM(Support Vector Machine)
for creating a personalized splog filter for each user. We had two experiments: (1) an experiment of individual splog judgement, and (2) an experiment for user oriented splog filtering. From the former experiment,
we found existence of ‘gray’ blogs that are needed to treat by persons.
From the latter experiment, we found that we can provide appropriate
personalized filters by choosing the best feature set for each user. An
overview of proposed method, and evaluation results are described.
1
Introduction
Today, many people can own their blog sites. They can publish articles on their
blog sites. There are many types of blog sites on the Web such as blogs that
advertise books and commodities, blogs on programming, blogs on personal diaries. At the same time, a lot of spam blogs (splogs) are created by spam bloggers
(sploggers). These splogs form a ‘splogosphere’[1]. Splogs cause several problems
on the Web. For example, splogs degrade the quality of search results. Although
splogs should be removed from search results, it is not easy to identify splogs for
each user because there exists a blog marked as splog by person, but marked as
authentic (valuable) site by another person. Thus, a user-oriented splog filtering
method that can adapt each user’s preference for valuable blogs is needed.
A User-Oriented Splog Filtering Based on a Machine Learning
89
We propose a user-oriented splog filtering method that is possible to adapt
each user’s preference. For creating a personalized filter, our method collects
individual splog data. Then, personalized filters are created using the support
vector machine[2]. and this individual splog data.
This paper is organized as following sections: In section 2, we review the previous work. In section 3, we describe an experiment of individual splog judgement.
In section 4, we describe the user-oriented splog filtering method. In section 5,
we describe evaluation results of proposed method. In section 6, we discuss the
evaluation results. In section 7, we describe summaries of proposed method and
evaluation results, and future work.
2
Previous Work
There are several related work on splog filtering. Kolari et al.[1] analyzed the
splogosphere, and proposed a splog filtering method using SVM. They proposed
to use three feature sets for machine learning: (1)‘words’, (2)‘anchor text’, and
(3)‘url’ appeared in a blog article. Their method succeeded to detect splogs
at F-measure of splogs about 90%. Regarding Japanese splogs, Ishida analyzed
the Japanese splogosphere[3]. He proposed a splog filtering method that uses
the link structure analysis. His method detects splogs at F-measure 80%. These
work provides a single common filter for all of users, and does not consider the
user adaptation.
Regarding the user adaptation in e-mail and web applications, several work
considers the user adaptation functions. Junejo proposed a user-oriented spam
filter of e-mail[4]. Because one receives a large number of spam e-mail, to filter
spam mail is not easy on the user-side. They proposed a server-side spam mail
filter that detects spams for each user. The point of this method is that this filter
does not require much computational cost on the user-side. Jeh’s work[5] is over
the web spam. They proposed a personalized web spam filter that is based on
the page rank algorithm. This method, however, needs the whole link structures
among web pages, and this requires much cost for the user adaptation.
We need a simple method that does not require mush cost for the user adaptation. Therefore, we propose a user-oriented splog filtering method that can adapt
each user’s preference, and does not require much cost for the user adaptation.
3
Experiment of Individual Splog Judgement
We had an experiment for understanding individual splog judgement by persons. We asked 50 subjects to judge 50 blog articles whether they are splogs or
authentic articles. For the test data (blog articles), we prepared ‘gray’ articles
that are on the border line between splogs and authentic articles. We also asked
subjects to judge blog article whether they are valuable or not. We describe an
overview of the experiment and its results.
90
3.1
T. Yoshinaka et al.
Overview
50 subjects (25 men, 25 women) attended this experiment. The range of age
of subjects is from 21 to 55 years old. Their occupations are mainly ‘engineers
of information technology’ and ‘general office worker’. For the test data, we
prepared 50 blog articles. The dataset consists of (1) ‘40 common articles’ that
are common test articles for all of subjects, and (2) ‘10 individual articles’ that
are chosen by each subject. For the latter data, we asked subjects to choose 10
blog articles, that are, (1) five articles that they think the most interesting, and
(2) five articles that they think the most boring.
For the axes of evaluation, we adopt two axes for splog judgement: (1) spam
axis, and (2) value axis. The spam-axis indicates the degree of spam. The valueaxis indicates the degree of value of blog articles for a subjects. Both of axes
consists of four values. The questionnaires for spam-axis are ‘1:not splog’, ‘2:not
splog(weak)’, ‘4:splog(weak)’, and ‘5:splog’. The questionnaires for value-axis
are ‘1:not valuable’, ‘2:not valuable(weak)’, ‘4:valuable(weak)’, and ‘5:valuable’.
3.2
Results of Experiment
Figure 1 shows the result of individual judgement for ‘40 common articles’. Total
number of judgement is 2,000 (50 subjects × 40 articles = 2,000 judgements).
There are three axes in Figure 1, x-axis is spam-axis, y-axis is value-axis and zaxis is the number of judgements (judge count). In Figure 1, a peak that has 678
is appeared at the intersection of spam=5 and value=1. These judges indicate
that there are as unnecessary and valueless articles for most of subjects. On the
other hand, in Figure 1, a right area that is circled indicates the existence of gray
blogs for which subjects judged those blogs as splogs, but judged as valuable.
Fig. 1. The result of individual judgement for ‘40 common articles’
91
Fig. 2. The result of individual judgement for ’10 individual articles’
Figure 2 shows the result of individual judgement for ‘10 individual articles’.
Total number of judgement is 500 (50 subjects × 10 articles = 500 judgements).
In Figure 2, the axes are same as in Figure 1. From Figure 2, we found that
judgement of spam is low because each subject chose the most interesting articles. From these results, we found that the user adaptation is needed for splog
filtering.
4
User-Oriented Splog Filtering Method
Proposed method accepts individual splog judgement data, and feedback from
a user, and provides a personalized splog filter for each user. Figure 1 shows
an overview of the user-oriented splog filtering method. The figure shows the
relation between a user and the user-oriented splog filtering system that provides
a personalized filter for this user. At the beginning, a user provides his or her
splog judgement data with the system. This data is used for creating an initial
user model for that user. The system creates his/her user model by learning
from this judgement data. We use LibSVM (version 2.88)1 as a machine learning
module in this system.
The system provides a user an estimation of judgement of a blog article while
he or she browses that article. A user can send feedback to the system for updating his or her user model. The system accepts feedback data that consists of
a URL and judgement of that user. When the system accept feedback from a
user, the system collects an HTML file of the URL from the Web, and extracts
features that are used in the machine learning. Because we consider that there
is a suitable feature set for each user, the system chooses the best feature set for
1
http://www.csie.ntu.edu.tw/˜cjlin/libsvm/
92
T. Yoshinaka et al.
Fig. 3. The concept of the user-oriented splog filtering method
each user. We will describe the detail of feature sets, and its evaluation results
in section 5.
5
Evaluation for User-Oriented Splog Filtering
In this section, we describe evaluation results of proposed method. We prepared
three types of features: (1) ‘Kolari’s features’, (2) ‘Light-weight features’, and (3)
‘Mixed features’ as evaluation data, we compared performances of personalized
filters between these features set. In addition to this, we had another evaluation
by choosing the best feature set for each user. As an evaluation metric, we used
F-measure[6] described in the following equation.
F − measure =
2 × precision × recall
precision + recall
(1)
We evaluated the performance of each filter based on five-fold cross validation.
We used several kernel functions including linear kernel, polynomial kernel, RBF
(radial basis function) kernel, and sigmoid kernel. As kernel parameters, we used
default values of LibSVM for each kernel.
5.1
Dataset
As dataset, we used individual judgement data described in the section 3. We
use 50 Japanese blog articles.
93
Table 1. Feature list for Kolari’s features[8]
Feature group
Name of feature Dimension Value type
Bag-of-words
9,014 tf*idf score
Kolari’s features Bag-of-anchors
4,014
binary
Bag-of-urls
3,091
binary
5.2
Features for Machine Learning
We used following sets of features: (1) ‘Kolari’s features[1] described in the previous work, (2) ‘Light-weight features[7]’ that we propose, and (3) ‘Mixed features’
that are the mix of ‘Kolari’s features’ and ‘Light-weight features’.
Kolari’s features. Table 1 shows the list of features described in the previous work. We use three type Kolari’s features, that are, ‘bag-of-words’, ‘bag-ofanchors’, and ‘bag-of-urls’. In our experiment, the ‘bag-of-words’ is morpheme
words that are extracted by using a Japanese morphological analysis tool called
Sen2 . The number of dimensions of this feature are 9, 014, we use tf*idf[8] values
of morpheme words for creating a feature vector for this feature. The ‘bag-ofanchors’ contain morpheme words that are extracted from anchor text enclosed
with <A> tag in HTML. The number of dimensions of this feature is 4, 014. The
value of this vector is binary (1 or 0). The ‘bag-of-urls’ contain parts of URL
text split by ‘. (dot)’ and ‘/ (slash)’ on all URLs appeared in a blog article. The
number of dimensions of this feature is 3, 091. Elements of this feature vector
are tf*idf values. These feature are prepared faithfully along with the method in
the previous work[1].
Light-weight features. We propose ‘Light-weight features’ that consist of several simple features appeared in an HTML. Table 2 shows the list of features.
There are 12 features in this feature set. These features have much lower dimensions than Kolari’s features.
We explain for each feature. ‘Number of keywords’ is number of morpheme
data that consists of only noun words extracted from a body part of a blog
article. ‘Number of periods’ and the ‘number of commas’ are frequency of ‘ ’
and ‘ ’ in a blog article. ‘Number of characters’ is the length of character strings
in blog article that contains HTML tags. ‘Number of characters without HTML
tags’ is the length of character strings in blog article from which HTML tags are
removed. ‘Number of br tags’ is the number of <BR> tag in an HTML. ‘Number
of in-links’ is the number of links that connect to the same host (e.g., links to
comment pages, and archive pages of the same domain are included.). ‘Number
of out-links’ is the number of links that link to external domains. ‘Number of
images’ is the number of images contained in a blog article. ‘Average height of
all images’ is the average height of images contained in a blog article. ‘Average
width of all images’ is the average width of images contained in an blog article.
2
https://sen.dev.java.net
94
T. Yoshinaka et al.
Table 2. The list of features defined in the Light-weight features
1
2
3
4
5
6
7
8
9
10
11
12
Name of feature
Number of keywords
Number of ‘ (period)’
Number of ‘ (comma)’
Number of characters
Number of characters without HTML tags
Number of br tags
Number of in-links
Number of out-links
Number of images
Average height average of all image
Average width of all image
Number of affiliate IDs
Table 3. Average values of F-measure using each feature
Feature set
Bag-of-words
Bag-of-anchors
Bag-of-urls
Light-weight features
Mixed features
Linear Polynomial RBF Sigmoid
0.608
0.592 0.533
0.522
0.603
0.615 0.519
0.533
0.655
0.702 0.530
0.522
0.573
0.601 0.583
0.548
0.615
0.590 0.526
0.515
‘Number of affiliate IDs[9]’ is the number of IDs extracted from affiliate links in
a blog article.
Mixed features. ‘Mixed features’ is the mix of ‘Kolari’s features’ and ‘Lightweight features’. The number of dimensions is 16, 131 (16, 119 in Kolari’s features
plus 12 in Light-weight features).
5.3
Results
Results of Kolari’s features. Table 3 shows the average values of F-measure
using ‘Kolari’s features’ for each kernel function. The best score (F-measure
0.702) is appeared at the intersection of the ‘bag-of-urls’ row and the polynomial
kernel column. Figure 4 shows values of F-measure for each user using ‘bag-ofurls’ and polynomial kernel. Figure 5 shows values of F-measure for each user
using ‘bag-of-words’ and linear kernel. Figure 6 shows values of F-measure for
each user using ‘bag-of-anchors’ and polynomial kernel. In Figure 4, Figure 5
and Figure 6, the y-axis is the F-measure and x-axis is the subject ID. Subject
IDs are sorted by descending order of F-measure value of Figure 4. Subject ID
46 shows the best F-measure 0.947 in Figure 4. From this result, we found that
a pair of ‘bag-of-urls’ and polynomial kernel shows a good performance in the
personalized splog filtering.
95
Fig. 4. F-measure for each subject using the ‘bag-of-urls’ and the polynomial kernel
Fig. 5. F-measure for each subject using the ‘bag-of-words’ and the linear kernel
Performance of light-weigh features. Table 3 shows the average values of
F-measure using ‘Light-weight features’ for each subject. In Table 3, the Fmeasure 0.601 in polynomial kernel is the best one for this feature set. Figure 7
shows each user’s F-measure value using this feature set and polynomial kernel.
Figure 7 shows the same result compared to results of Kolari’s features. The best
F-measure 0.933 is appeared at subject ID 46 in Figure 7.
Performance of Mixed features. Mixed features is the mix of ‘Kolari’s features’ and ‘Light-weight features’. Table 3 shows the average values of F-measure
for each subject using this feature set. In Table 3, F-measure 0.615 at linear kernel shows the best score. Figure 8 shows each user’s F-measure value using this
feature set and linear kernel. The max value of F-measure 0.933 is appeared at
subject ID 46 in Figure 8.
5.4
Analysis of the Best Feature Set for Each User
We consider that there is the best feature set for each user. We found that there
are the best feature for each user. The candidates of the best feature set are: ‘1.
96
T. Yoshinaka et al.
Fig. 6. F-measure for each subject using the ‘bag-of-anchors’ and the polynomial kernel
Fig. 7. F-measure for each subject using the ‘Light-weight’ features and the polynomial
kernel
bag-of-words’, ‘2. bag-of-anchors’, ‘3. bag-of-urls’, ‘4. Light-weight features’, and
‘5. Mixed features’. To find the best feature set, we use the best F-measure of
value among these feature sets. In addition, when F-measure has same value, to
calculate the best feature is based on Table 4. Table 4 shows the rank of features.
This table is calculated based on the number of dimensions. A rank column in
Table 4 shows the priority of calculation of the best feature. This column shows
that if the value is small, priority is high.
We chose the best feature for each subject based on Table 4. The result is
shown in Table 5. Table 5 shows the best feature and the best F-measure value
for each subject. The best F-measure is 0.947 in subject ID 47, then the best
feature is ‘bag-of-urls’. The worst F-measure is 0.316 in subject ID 38, then the
best feature is ‘bag-of-urls’. In Table 5, there is no subject who has F-measure
0. Although there are several subjects who have 0 F-measure values through
Figure 4 to Figure 8, but there is no subject who has 0 value by choosing the
best feature set for each subject. We counted feature IDs in Table 5. Table 6
97
Fig. 8. F-measure for each subject using the ‘Mixed features’ and the linear kernel
Table 4. Rank of features based on feature dimensions
Feature name
Dimension Rank
4. Light-weight features
12
1
3. Bag-of-urls
3,091
2
2. Bag-of-anchors
4,014
4
1. Bag-of-words
9,014
3
5. Mixed features
16,143
5
Table 5. Results of the best pair of features and kernel, and its F-measure value
Subject ID
Feature ID
F-measure
Subject ID
Feature ID
F-measure
Subject ID
Feature ID
F-measure
Subject ID
Feature ID
F-measure
1
4
0.848
14
4
0.720
27
4
0.692
40
3
0.841
2
4
0.679
15
4
0.833
28
3
0.560
41
4
0.831
3
3
0.455
16
3
0.667
29
2
0.714
42
3
0.571
4
4
0.772
17
4
0.793
30
3
0.571
43
1
0.625
5
3
0.933
18
4
0.653
31
3
0.632
44
N/A
N/A
6
3
0.841
19
4
0.800
32
4
0.667
45
3
0.933
7
4
0.831
20
3
0.754
33
2
0.410
46
3
0.750
8
1
0.848
21
4
0.847
34
3
0.381
47
3
0.947
9
3
0.588
22
4
0.741
35
3
0.857
48
3
0.904
10
1
0.839
23
3
0.667
36
3
0.63
49
3
0.604
11
12
13
3
4
4
0.824 0.780 0.691
24
25
26
4
5
3
0.857 0.593 0.730
37
38
39
4
3
3
0.813 0.316 0.904
50
4
0.814
98
T. Yoshinaka et al.
Table 6. Total number of feature ID
Feature name
Frequency
3. Bag-of-urls
24
4. Light-weight features
19
1. Bag-of-words
3
2. Bag-of-anchors
2
5. Mixed features
1
shows frequency of features. The feature with the most frequent feature is ‘3.
bag-of-urls’, and its occurrence is 24. ‘4. Light-weight features’ appears 19 times,
and our feature set occupies about 25% in all subjects.
From these result, we found that there is the best feature for each user.
6
Discussion
We evaluated performances of a user-oriented filter method by comparing combinations of several feature sets. From this experiment, we found that (1) the effect
of ‘Kolari’s features’ for personalized splog filters, (2) ‘Light-weight feature’ was
effective in a user-oriented splog filtering.
6.1
The Effect of ‘Kolari’s Features’
First, we consider a filter performance using ‘Kolari’s features’. From results
of ‘Kolari’s features’, there are some subjects who has succeeded in the splog
detection of over 90% in Figure 4 to Figure 6. On the other hand, there are
subjects whose performances are not good. (subjects enclosed with the circle in
Figure 6). These subjects are showed very low F-measures values when we use a
common single kernel, but when we choose appropriate feature set for each user,
their F-measure improved3 .
6.2
The Effect of ‘Light-Weight Features’
Second, we consider a filter performance using the ‘Light-weight features’. Table
3 shows the similar results compared to results of ‘Kolari’s features’ and ‘Mixed
features’. The point is that dimensions of ‘Light-weight features’ are much lower
than ‘Kolari’s features’ and ‘Mixed features’. We found that the increase of
number of dimensions doesn’t improve F-measure values, and it is sufficient
to use lower dimensions feature set. Therefore, we consider that Light-weight
features is practical compared with ‘Kolari’s features’ and ‘Mixed features’. We
will evaluate the method by using more large dataset.
3
For example, in Figure 6, the F-measure of the subject enclosed with the circle
(subjects ID is 27) on each kernel are 5.00 in linear kernel, 0 in polynomial kernel,
0.604 in rbf kernel and 0.642 in sigmoid kernel.
7
99
Conclusion
In this paper, we described a user-oriented splog filtering method providing
appropriate personalized filter for each user. We had two experiments: (1) experiment of individual splog judgement, and (2) evaluation experiment of personalized splog filters. We collected individual splog judgement data from experiment
of attending 50 subjects. We found that Light-weight features are showed the
same effect or further effect compared Kolari’s features. We found that there is
the best feature for each user, and we describe that our method is effective.
In future work, we will try to select features for improving F-measure values
for each user.
References
1. Kolari, P., Java, A., Finin, T., Oates, T., Joshi, A.: Detecting spam blogs: A machine
learning approach. In: Proceedings of the 21st National Conference on Association
for Advancement of Artificial Intelligence (AAAI 2006), pp. 1351–1356 (2006)
2. Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization.
IEEE Transactions on Neural Networks, 1048–1054 (1999)
3. Ishida, K.: Extracting spam blogs with co-citation clusters. In: Proceedings of the
17th International Conference on World Wide Web (WWW 2008), pp. 1043–1044
(2008)
4. Junejo, K.N., Karim, A.: PSSF: A novel statistical approach for personalized serviceside spam filtering. In: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), pp. 228–234 (2007)
5. Jeh, G., Widom, J.: Scaling personalized web search. In: Proceedings of the 12th
International Conference on World Wide Web (WWW 2003), pp. 271–279 (2003)
6. Manning, C.D., Shuetze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
7. Yoshinaka, T., Fukuhara, T., Masuda, H., Nakagawa, H.: A user-oriented splog
filtering based on machine learning method- (in japanese). In: Proceedings of The
23rd Annual Conference on the Japanese Society for Artificial Intelligence (JSAI
2009), vol. 2B2-4 (2009)
8. Manning, C.D., Shuetze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
9. Wang, Y.M., Ma, M., Niu, Y., Chen, H.: Spam double-funnel: connecting web spammers with advertisers. In: Proceedings of the 16th International Conference on World
Wide Web (WWW 2007), pp. 291–300 (2007)
Generating Researcher Networks with Identified
Persons on a Semantic Service Platform
Hanmin Jung, Mikyoung Lee, Pyung Kim, and Seungwoo Lee
KISTI, 52-11 Eueon-dong, Yuseong-gu, Daejeon, Korea 305-806
[email protected]
Abstract. This paper describes a Semantic Web-based method to acquire researcher networks by means of identification scheme, ontology, and reasoning.
Three steps are required to realize it; resolving co-references, finding experts,
and generating researcher networks. We adopt OntoFrame as an underlying semantic service platform and apply reasoning to make direct relations between
far-off classes in ontology schema. 453,124 Elsevier journal articles with metadata and full-text documents in information technology and biomedical domains
have been loaded and served on the platform as a test set.
Keywords: Semantic Service Platform, OntoFrame, Ontology, Researcher
Network, Identity Resolution.
1 Introduction
Researcher network, a social network between researchers mainly based on coauthorship and citation relationship, helps for users to discover research trends and
behavior of its members. It can also support to indicate key researchers in a researcher
group, and further to facilitate finding appropriate contact point for collaboration
with ease.
Several researcher network services are currently on the Web. BiomedExperts
(http://www.biomedexperts.com) shows co-publication between researchers and
the researchers relating with a selected one in biomedical domain [1]. It also
provides researcher’s metadata and exploratory session on the network. Authoratory
(http://authoratory.com) is another service focused on co-authorship and article
details. ResearchGate (http://www.researchgate.net) provide additional service
function for grouping researchers by contacts, publications, and groups. Metadata
of a researcher is also offered on every node [2]. Microsoft’s network
(http://academic.research.microsoft.com) emphasizes attractive visualization as well
as detailed co-authorship information. However, none of them are built on semantic
service platform which can support precise information favored with identification
system, ontology, and reasoning. As they are typical database applications based on
data mining technologies, achieving a flexible and precise services in both connecting
and knowledge and planning services would become serious to them.
In order to surpass the qualitative limit of existing researcher network services, we
will address three major issues in this paper; resolving co-references for assuring
Generating Researcher Networks with Identified Persons
101
precise service level, finding experts for a topic, generating researcher networks triggered by them. The following sections explain how two types of researcher networks
can be eventually acquired from articles. As the first step, we gathered Elsevier
journal articles for service target since it is easy to recognize sub-set in information
technology and biomedical domains, and further all of them have metadata and fulltext documents to facilitate applying text mining technology for extracting topics that
have the role of crucial connection point between users and the service.
2 Resolving Co-references on a Semantic Service Platform
OntoFrame is a semantic service platform to easily realize semantic services regardless of application domains [3]. It is composed of a semantic knowledge manager
called as OntoURI, a reasoning engine called as OntoReasoner, and a commercial
search engine. Semantic services based OntoFrame interacts with the two engines
using XML protocol and Web services.
Fig. 1. OntoFrame architecture
The manager transfers metadata gathered from legacy databases into semantic
knowledge in the form of RDF triples, as referring ontology schema 1 designed by
ontology engineers. The manager then propagates the knowledge to the two engines.
The reasoning engine intervenes in loading process to the repository for generating
induced triples inferred by user-defined inference rules. The coverage of the engine
can be roughly said as RDF++ as it supports of entailments of full RDFS (RDF
schema) and some of OWL vocabularies such as ‘owl:inverseOf’ and ‘owl:sameAs’.
1
Currently it is composed of 16 classes and 89 properties using RDF, RDFS, and OWL Lite.
102
H. Jung et al.
Ontology individuals should be clearly identified, thus the manager has additional
role to resolve co-references between the individuals as well as to convert DB to
OWL. The whole process performed in the manager can be viewed as a syntactic-tosemantic process shown in Fig. 2. Annotation for generating metadata can be put in
front of DB-to-OWL conversion.
The process has two independent sub-processes based on time criterion; modeling
time and indexing time. The former includes ontology schema design and rules editing, and the latter concerns identity resolution and RDF triple generation.
Fig. 2. OntoURI process for generating semantic knowledge
OntoURI applies several rules to be managed such as URI generation, DB-to-OWL
mapping, and identity resolution [4]. For example, it assigned different weights to
each clue for resolving ambiguous authors as shown in Table 1. ‘Name’ is a pivot to
initiate the resolution, that is, identity resolution rules will be triggered on finding the
case that two authors located in different articles share the same name. ‘E-mail’
weight is the highest because it is very rare case that different authors share the same
e-mail address. Property ‘hasTopic’ is a threshold feature because it is not binary
feature which can clearly determine whether two authors are the same person or not.
Table 1. Rules for resolving co-references between author individuals
Class
Person
Person
Person
Person
Person
Person
Resource
Name
hasInstitution
E-mail
hasCoauthor
hasTopic
Kind
Order
Pivot
Feature
Feature
Feature
Threshold
Match
Relation
Source
Exact
Exact
Number
Number
Single
Single
Single
Multiple
OntoURI
OntoURI
OntoReasoner
Weight
1
2
4
1
0.8
103
Fig. 3 shows a result of the resolution for author individuals called as ‘Jinde Cao’.
Authority data (Table 2 shows an example) is also applied to normalize individual
names in the surface level.
After resolving co-references between individuals acquired from 453,124 Elsevier
journal articles with metadata and full-text documents in information technology and
biomedical domains, the following identified individuals were loaded in the form of
RDF triple in the repository. The total number of the triples in the repository is
283,087,518. We left identified persons without further trying to actively merge as
one because it is always able to dynamically connect two different identifiers with
‘sameAs’ relation.
1,352,220 persons
339,947 refined topics
91,514 authorized institutions
409,575 locations with GPS coordinate
Table 2. Example of authority data
Normalized form
IBM
Microsoft
Variant form
International Business
Machines Corporation
MS
London
Academic Inc.
Academic Press Inc, LTD
마이크로소프트
런던
Kind
Abbreviation
Class
Institution
Abbreviation
Korean
Korean
Alternative
Institution
Institution
Location
Publication
OntoFrame service including research network was designed as an academic research information service such as Google Scholar. However it controls individuals
with URI-based (Uniform Resource Identifier) identification scheme and lies on Semantic Web service platform, that is, can be empowered by both search and reasoning
in contrast with other similar services. It provides several advanced services; topic
trends to show relevant topics by timeline, domain experts to recommend dominant
researchers for a topic, researcher group to reveal collaboration behavior among researchers, researcher network to trace co-author and citation relationships in a group,
and similar researchers who study relevant topics with a researcher.
3 Finding Experts
Experts finding is very useful to seek for consultants, collaborators, and speakers.
Semantic Web technology can be one of competent solutions for recognizing identified researchers exactly through underlying identification scheme. Deep analysis in
full-text documents will be needed as topically classified documents in high precision
can ensure recommend the right persons for a given topic. Thus we propose an experts-finding method based on identity resolution and full-text analysis.
104
H. Jung et al.
Fig. 3. Example of identified authors (‘Jinde Cao’)
Extracting topics from documents is the most basic task to acquire topic-centric
experts. Extracted several topics will be assigned to each article. Indexer extracts
index terms from an input document, and then, After matching the terms with topics
in a topic index DB, successfully matched terms are ranked by frequency and then
top-n (currently, five) of them are assigned to the document. The following workflow
shows how experts for a given topic can be found [5].
1. Knowledge expansion through reasoning
Make direct relations between far-off classes in ontology schema for constructing shorter access path.
2. Querying and retrieving researchers
Call SPARQL query with a given topic.
Convert the query to corresponding SQL query.
Exploit backward-chaining path to retrieve the researchers classified into
the topic.
3. Post-processing
Group the retrieved researchers.
Rank them by names or the number of articles.
Make an XML document as a result of expert finding.
105
As our researcher network service is initiated from finding experts for a topic, the
service regardless of network types requires person(s) mandatorily. Topic should be
also provided in the case of generating topic-based network.
4 Generating Researcher Networks
We designed two kinds of researcher networks in the viewpoint of the constraint considered to connect researchers in the network. The first type is topic-constrained network and the second is person-centric network.
Topic-constrained network shows a network connecting researchers under a given
topic. It implies that all of the relationships between researchers should share the
same topic. The following pseudo code and SPARQL query is to generate a topicconstrained network. The first step retrieves the co-author pairs that wrote an article
together classified into a given topic <topURI> identifier through SPARQL query.
The second step searches a given researcher from the pairs. That is, two arguments, a
topic and a researcher, need to be acquired topic-constrained network. The last step
recursively traces the pairs acquired from the first step through the co-authors of the
seed, i.e. the given researcher, as another seeds.
Fig. 4. A topic-constrained network for topic ‘neural network’ and researcher ‘Jinde Cao’
1. Get co-author pairs for a given topic
SELECT DISTINCT ?person1 ?person2
WHERE {
?article aca:yearOfAccomplishment ?year .
FILTER(?year>=startYear && ?year<=endYear) .
?article aca:hasTopicOfArticle <topURI> .
106
H. Jung et al.
?article aca:createdByPerson ?person1 .
?article aca:createdByPerson ?person2 .
FILTER(?person1 < ?person2) .
}
2. Select a target researcher in the pairs
3. Trace the pairs through the seed
Person-centric network shows a network connecting researchers focused on a researcher without considering the shared topics between researchers. It is useful to
understand the relationship between a given researcher and his colleagues in detail.
The first step acquires the co-authors that wrote together with a given researcher <perURI> identifier through SPARQL query. The second step ranks them with the number of co-authorship. The ranked results are applied to visualized network as the
distance from the central researcher.
1. Get co-authors of a target researcher
SELECT ?per1 ?per2
WHERE {
?article aca:yearOfAccomplishment ?year .
FILTER(?year>=startYear && ?year<=endYear) .
?article aca:createdByPerson ?per1 .
?article aca:createdByPerson ?per2 .
FILTER(?per1 < ?per2) .
FILTER(?per1=<perURI> || ?per2=<perURI>) .
}
2. Rank them with the number of co-authorship
Fig. 5. A person-centric network for researcher ‘Jinde Cao’
107
5 Conclusion
This paper showed a solution for three major issues on researcher network service;
resolving co-references for assuring precise service level, finding experts for a topic,
generating researcher networks triggered by them. By the instrumentality of semantic
service platform, two kinds of precise researcher networks were implemented with
ease. Resolution rules and authority data were applied to resolve co-references between ontology individuals. SPARQL queries and reasoning were utilized in both
finding experts and generating the networks.
We plan to develop a pipelining system that assembles existing semanticallyoperated services such as ‘finding experts’ and ‘generating person-centric network’ to
directly acquire researcher network from a topic query without go through user interactions. It will make it easier for users to access researcher network.
References
1. Whitaker, I., Shokrollahi, K.: BiomedExperts: Unlocking the Potential of the Internet to Advance Collaborative Research in Plastic and Reconstructive Surgery. J. Annals of Plastic
Surgery 63(2) (2009)
2. Pronovost, S., Lai, G.: Virtual Social Networking and Interoperability in the Canadian
Forces Netcentric Environment. Technical report CR 2009-090, Defence R&D Canada
(2009)
3. Sung, W., Jung, H., Kim, P., Kang, I., Lee, S., Lee, M., Park, D., Hahn, S.: A Semantic Portal for Researchers Using OntoFrame. In: 6th International Semantic Web Conference and
2nd Asian Semantic Web Conference (2007)
4. Kang, I., Na, S., Lee, S., Jung, H., Kim, P., Sung, W., Lee, J.: On Co-authorship for Author
Disambiguation. J. Information Processing & Management 45(1) (2009)
5. Jung, H., Lee, M., Kang, I., Lee, S., Sung, W.: Finding Topic-Centric Identified Experts
Based on Full Text Analysis. In: 2nd International ExpertFinder Workshop at ISWC 2007 +
ASWC 2007 (2007)
Towards Socially-Responsible Management of Personal
Information in Social Networks
Jean-Henry Morin
University of Geneva – CUI, Department of Information Systems,
Route de Drize 7, 1227 Carouge, Switzerland
[email protected]
Abstract. Considering the increasing number of Personal Information (PI) used
and shared in our now common social networked interactions, privacy issues,
retention and how such information are manages have become important. Most
approaches rely on one-way disclaimers and policies, often complex, hard to
find and lacking ease of understanding for ordinary users of such common networks. Thus leaving little room for users to actually retain any control how the
released information is actually used and managed once it has been put online.
Additionally, personal information (PI) may include digital artifacts and contributions for which people would legitimately like to retain some rights over their
use and their lifetime. Of particular interest in this category is the notion of the
“right to forget” we no longer have control over, given the persistent nature of
the Internet and its ability to retain information forever. This paper examines
this issue from the point of view of the user and social responsibility, arguing
for the need to augment information with an additional set of metadata about its
usage and management. We discuss the use of DRM technologies in this context as a possible direction.
Keywords: Personal Information Management, social responsibility, social
networks, privacy, right to forget, DRM.
1 Introduction
Social networks and services have now penetrated most if not all of our activities
whether professional or personal. They range from social bookmarking, social tagging, slide sharing, micro blogging, photo and video sharing, friendships, colleagues,
etc. to more elaborate forms of interaction and collaboration through user centric
conversational threads such as Google Wave [1]. Most if not all of these Web 2.0
services have unclear if not unaddressed positions with respect to the management of
personal information and the corresponding social responsibility.
In a similar way as social responsibility has become an important issue in the corporate world (corporate social responsibility, CSR [2]), we argue, based on recent
evidence drawn from an increasing number of deceptive situations in social networks,
that there is an urgent need to address Personal Information management in a socially
responsible way to sustain social networks and services and consequently put the User
back in “control” of his digital fate.
Towards Socially-Responsible Management of Personal Information
109
While this latter aspect of the lack of concern for users in IT services isn’t new, its
importance is growing with social networks. Each service has its own policies with
respect to how they manage users information be it personal information or the content they share using such services. Often written in tiny characters, (e.g., point 8
being the minimum according to marketing practitioners), such agreements are much
too verbose and complex for the vast majority of the users. Therefore putting at risk
users information without them being even aware of what they actually agreed to
when checking the “I agree to the terms and conditions” button or box upon signup.
Furthermore, once published and shared, users retain very limited to no control
whatsoever on their information. How it will be used, for how long, etc. Consequently, the information becomes available forever thus raising concern over the
notion of the “right to forget” or “forgetfulness” [3] and the associated data retention
and privacy policies. Often referred to as the panoptic society following the description of Foucault [4], users are increasingly positioned as “prisoners” locked in a oneway panopticon service environment offering a birds eye view to the service provider
on blindfolded users.
The key question in this context becomes in addition to raising user awareness on
the issue, to study the conditions and requirements that should guide the design, implementation and use of social network services providing socially responsible management of personal information in the digital age.
This paper is structured as follows. In the next section we describe the problem and
the background. Section 3 describes a set of requirements that should be fulfilled to
address the problem and design future services in a socially and ethically responsible
electronic society. In section 4 we draw a parallel with Kim Cameron’s work on the
laws of identity establishing how these may also be applied in the context of Personal
Information (PI) and time bound social information. Finally section 5 discusses a
proposition of using DRM technologies in this context before concluding remarks in
section 6.
2 Background and Problem Statement
Let’s first consider some key factors as background behind the problem. Given the
global trend of democratization of Internet access and accessibility to broadband networks, more and more people are getting on-line. The cost of accessing the Internet,
even in mobile settings using smart phones has dropped significantly to a point where
it is becoming accessible to everyone even with mobile flat fee unlimited plans.
Recent evolution in social networking services have drawn an impressive number
of people to connect in on-line communities through services such as Facebook,
LinkedIn, Friendster, MySpace, etc. This further led to other social activities around
micro-blogging, bookmarking, pictures, presentations and videos, etc.
Smart phones, Netbooks and Tablets are gaining popularity and provide very interesting platforms for mobile location-based services in social networks. Business models surrounding these platforms increasingly subsidize the cost of the hardware by
billing the services rather than the devices. The revenue streams associated to services
based billing models are far more stable and exhibit more value for service providers
in terms of customer retention. Also to be noted is the spectacular cost reduction of
110
J.-H. Morin
storage and processing power, which put into perspective of the increasing capabilities for searching and mining raises new issues in our electronic society. Existing and
future searching and mining capabilities will become instrumental in allowing to find
information at very low cost in very little time.
While all this may seem fine in terms of infrastructure there is a growing concern
about managing our on-line identities and the information we increasingly share online. Information becomes accessible globally and indefinitely. Ready for searching
and mining. Individuals can store information in quantities that amount to what nations themselves couldn’t have dreamt of in a not so distant past. Increasingly people
are warned about being careful with the information they share and publish based on
this idea of “publish once, publish forever”.
Assuming these trends will be further confirmed in the future, it opens up a whole
range of issue with respect to Personal Information Management. Most countries now
have strict laws addressing Personally Identifying Information (PII). These regulations impose specific practices for service providers on how to manage such information internally. However what happens when the users themselves share personal
information, not necessarily identifiable, on such sites is the focus of this paper arguing for the need to both enhance awareness and in particular to address the issue of
technically allowing for the management of such information in terms of usage rights,
retention and scope of sharing.
Assuming there is a problem with how PI are managed in social networks and that
individuals should retain a right over the control of how their information are shared
and used we contend there is a need for Managed Personal Information centered on
the user rather than on the service provider in a social responsibility mindset. As a
result, our proposition is that PI should be augmented with an additional layer of
metadata governing its usage in a persistent way. Thus allowing to retain control over
their information and manage the policies governing their use in the digital realm.
One example of such a generic policy might be a default expiry date for general information whereby any information, unless otherwise defined as published with specific editorial requirements (e.g., scientific publications, cultural heritage, etc.), would
become obsolete after six months. Of course, users ultimately have control over such
policies even to an extent allowing them to potentially “recall” published information
for any reason.
3 Requirements
In order to fulfill the above proposition let’s review some of the requirements behind
this problem. We consider these requirements essentially from the viewpoint of the
user and the underlying enabling infrastructure since it is basically his information
that are at stake.
3.1 User Consent, Awareness and Control
Releasing PI should require explicit consent from the user. He should be made aware
of the conditions under which his information is released as well as the extent to
which he will be able to retain any form of control over it.
111
Every piece of PI released should exhibit a default usage control policy either set
by the user explicitly or by the service provider in the most conservative way possible
for the user.
Usage control policies may involve setting expiry dates to content, allowing to recall PI unilaterally, limiting the scope of sharing to specific people or groups, constrain the PI usage to specific situations and conditions, etc. In summary anything
deemed necessary by the PI owner regardless of the infrastructure, service provider or
parties involved in the conversations.
3.2 Infrastructure Requirements
From the infrastructure standpoint, the overall service provisioning relying on PI
should be technology and provider independent thus allowing for interoperability and
portability of PI across platforms and operators.
Releasing PI can be considered as important as releasing Identity information or
paying. Therefore Infrastructures managing PI should exhibit similar process patterns
whereby the users are not only aware but also familiar with it. Moreover, the user
should be an explicit acting element in the process. Being involved explicitly in the
HCI interaction allows for better awareness upon PI release preventing them from
being released without consent.
4 From the Laws of Identity to the Laws of Personal
Information (PI)
Looking at the work done in the area of Identity Management we think this area
shares many similar requirements with the issue of PI management. Building on Kim
Cameron’s laws of identity [5], we propose to examine and transpose these laws in
the context of Personal Information (PI) Management. We argue this framework covers some essential requirements for socially responsible management of PI in social
networks.
Let us briefly review the seven laws putting them into perspective of PI Management Systems (PIMS) and what they mean in this context based on the requirements
discussed in the previous section.
1.
User Control and Consent: users should explicitly consent to the information they release on PIMS and retain control over these information.
2. Minimal Disclosure for a Constrained Use: by default and without otherwise explicitly indicated by the user, any PIMS should reveal the minimum amount of information the user has explicitly, or not, agreed to and
limit its use to that it was intended for.
3. Justifiable Parties: PIMS should limit and enforce the disclosure of PI to
the parties clearly identified within a PIMS relationship (e.g., one-to-one
relationship, group relationship, public information, etc.)
4. Directed Identity: should be renamed to Directed Information, whereby
users can specify the scope of release of the information either for all to
discover or for specific private use.
112
J.-H. Morin
5. Pluralism of Operators and Technologies: Interoperability and portability
of PI across platforms and service providers.
6. Human Integration: people should be part of the human-computer process
involving the release of PI.
7. Consistent Experience Across Contexts: releasing and managing PI should
exhibit similar patterns of operation independently from the service provider or the technologies used.
We found Cameron’s work on the laws of identity particularly insightful in the context of PI management. Many of the proposed items appear to be directly usable when
transposed to PI. At this stage, it is beyond our purpose to decide which ones might be
part of a mandatory set while others might be part of an optional category. The first
four cover most of the essential user centered requirements identified in the previous
section while the last three are more oriented towards the enabling infrastructure.
Each of these aspects having a specific criticality level depending on how the
PIMS behaves with respect to the issue, can lead to a simple matrix of readiness level
(e.g., color codes) helping the users quickly and unambiguously picture what the level
of Social Responsibility of the Service Provider is with respect to PI management.
In order to implement some of these features, we have identified DRM technologies as a likely technical approach allowing to apply persistent protection and rights
management to PI.
5 Using DRM to Address Personal Information Management
DRM technologies have been mainly used to address the issue of persistent content
protection and distribution in the context of the Entertainment industry and the Enterprise sector to safeguard intellectual property (i.e., copyright, trade secrets) and more
recently to address compliance issues in the corporate world.
DRM is the acronym of Digital Rights Management. It represents a technology allowing to cryptographically associate usage rules, also called policies, to digital content. These rules govern the usage of the content they are associated to. They have to
be interpreted by an enforcement point prior to any access in order to determine
whether or not access can be granted. If successfully interpreted a license is used to
decrypt and render the content using a trusted interface (e.g. browser, application,
sound or video device, etc.) The content being itself encrypted using strong cryptographic algorithms, it becomes persistently protected at all time, wherever it resides.
The general DRM scenario can be decomposed in the following four main steps:
1.
2.
Content preparation and packaging: this step requires the content owner
to securely package the content by encrypting it together with its usage
rules. The rules are also cryptographically attached to the content thus allowing Superdistribution. To be noted that the rules could also be dynamically acquired provided the only attached rule is to acquire these. This is
particularly useful to retain some control over the rules and its associated
content.
Content distribution (and Superdistribution): from thereon, the content
may be freely distributed (superdistributed) and shared through any media
3.
4.
113
(Web, CD, DVD, email, ftp, removable storage, streaming, etc.) since it is
persistently protected.
Content usage: this step involves a consumer trying to access and render the
content. It typically requires acquiring a license (from a license server)
based on the interpretation of the rules attached to the content. If successful the license is granted and returned to the users DRM enforcement point
for decryption and rendering of the content in a trusted interface. To be
noted that the license server is not necessarily the content owner, this role
may be outsourced to external actors such as content “clearing houses” or
service provider. This activity is of great importance, as it will provide the
usage data and metering information to the content owners for marketing
and market analysis purposes.
Settlement, clearing of transactions and usage metering: Finally, this step
concerns the financial clearing and settlement of the completed transactions. It is mostly back office and is based on the collected data from the
license acquisition request transactions.
As an example among some of the most widely known DRM systems is FairPlay of
Apple which is the DRM technology used for all of iTunes content (music, video,
apps, etc.) We have now become familiar with the fact that any security system is
bound to be broken given enough time and effort. Therefore security approaches
needn’t be full proof military grade solutions especially when it comes to mundane
usage situations and content. They should strike a balance between an acceptable
level of risk and its related cost assuming most users aren’t criminals a priori. This is
exactly what Apple did with FairPlay and one can reasonably admit the approach has
proven successful both economically and technically.
Considering personal information (PI) is intellectual property we contend there is a
need for its protection and management in a persistent way. Using DRM technologies
in ways similar to what Apple did with FairPlay to protect PI would provide the users
with a much-needed solution to safeguard their own information when shared over
social networks. DRM is poised to become a technology not only reserved to large
companies and content publishers as our society increasingly progresses towards User
Created Content (UCC, UGC) and Remix. Ordinary people now need the tools and
techniques allowing them to factor in their own creativity and intellectual property
while preserving the rights of others. This includes the rights to determine the conditions and the extent to which one is willing to share and distribute his information.
As a result, our proposition is to give users access to a Personal Rights Management system allowing them to specify and choose the rules and conditions under
which they’ll be releasing personal information on social networks through third party
service providers. Ultimately such rules will persistently apply and govern the use of
the released content no matter where it resides thus providing the users with increased
confidence on how his PI is used.
There are however many limitations to this proposition that need to be mentioned.
First and foremost: interoperability and the lack of standards in DRM technologies.
This industry has been dominated by proprietary incompatible solutions mostly driven
by the rights holders or key players in the ecosystem. We think given the need to
114
J.-H. Morin
bring such functionality to the level of the general users may drive new initiatives
trying to harmonize these technologies towards standards.
Another key challenge will be to design a policy expression language simple
enough for general users to be able to quickly and efficiently recognize and express
their needs in terms of PI management without the burden of having to enter into
complex settings that would defeat the whole purpose as most people would then
ignore them altogether. Designing such features will be instrumental in the adoption
of PI Management by users.
The lack of concern by users over PI during the growth of social networking services will be hard to reverse for service providers who have become accustomed to
basically owning everything their members put online. This will require either a significant socially responsible mindset change in understanding the user has rights over
his PI or a set of public policies to force them into understanding the criticality of the
issue. In the worst case, a legal step might be considered like in France for example
where a law on the right to forget is being proposed and discussed. Needless to say
this is a useless path given the national territorial scope of such laws making them
basically inoperative on a global media such as the Internet.
Finally there is this whole debate about the evil nature of DRM as argued by several activists considering DRM as Digital Restriction Management that should be
banned altogether from every service as it limits our fundamental rights. While we
agree with many of the oppositions, their extreme position is highly arguable, as they
don’t propose any alternative. Our position has been clear for many years now and we
have worked on and proposed models [6] to handle exception management in DRM
environments allowing to accommodate for fair use and other exceptional situations
in traceable and accountable ways.
6 Conclusion and Future Work
Personal Information (PI) Management in social networks is becoming an area of
growing concern not only for service providers facing social responsibility issues but
also for ordinary users increasingly unaware of how their information are managed
and used when shared. In addition to raising and describing the issue we have drawn a
parallel between Identitiy Management and PI Management based on Cameron’s laws
of identity. In this context and given some preliminary requirements we argue DRM
technologies should be brought down to the level of ordinary users in order for them
to manage the rights to their PI.
We are well aware this paper is prospective in nature and therefore covers initial
ideas in terms of a proposed approach to address this issue. Future work will focus on
studying and designing a lightweight rights framework optimized for PI Management
and implement a prototype within an existing social network. In addition, work is
needed on the usability and awareness issues.
115
References
1. Bekmann, J., Lancaster, M., Lassen, S., Wang, D.: Google Wave Data Model and ClientServer Protocol, http://www.waveprotocol.org/whitepapers/internalclient-server-protocol (retrieved January 2010)
2. Aaronson, S.A.: Corporate responsibility in the global village: the British role model and the
American laggard. Business and Society Review 108(3), 309–338 (2003)
3. Rouvroy, A.: Réinventer l’art d’oublier et de se faire oublier dans la société de
l’information? In: Lacour, S. (ed.) version augmentée du chapitre paru, sous le même titre,
dans La sécurité de l’individu numérisé. Réflexions prospectives et internationales,
pp. 249–278. L’Harmattan, Paris (2008),
http://works.bepress.com/antoinette_rouvroy/5
4. Foucault, M.: Surveiller et punir. In: de la Prison, N. (ed.), Galimard, Paris (1975)
5. Cameron, K.: The Laws of Identity. In: Identity Weblog, December 5 (2005),
http://www.identityblog.com/stories/2005/05/13/TheLawsOfIdent
ity.pdf Retrieved (January 2010)
6. Morin, J.-H.: Exception Based Enterprise Rights Management: Towards a Paradigm Shift in
Information Security and Policy Management. International Journal On Advances in Systems and Measurements 1(1), 40–49 (2008)
Porting Social Media Contributions with SIOC
Uldis Bojars, John G. Breslin, and Stefan Decker
DERI, National University of Ireland, Galway, Ireland
[email protected]
Abstract. Social media sites, including social networking sites, have captured
the attention of millions of users as well as billions of dollars in investment and
acquisition. To better enable a user’s access to multiple sites, portability between social media sites is required in terms of both (1) the personal profiles
and friend networks and (2) a user’s content objects expressed on each site.
This requires representation mechanisms to interconnect both people and objects on the Web in an interoperable, extensible way. The Semantic Web provides the required representation mechanisms for portability between social
media sites: it links people and objects to record and represent the heterogeneous ties that bind each to the other. The FOAF (Friend-of-a-Friend) initiative
provides a solution to the first requirement, and this paper discusses how the
SIOC (Semantically-Interlinked Online Communities) project can address the
latter. By using agreed-upon Semantic Web formats like FOAF and SIOC to
describe people, content objects, and the connections that bind them together,
social media sites can interoperate and provide portable data by appealing to
some common semantics. In this paper, we will discuss the application of Semantic Web technology to enhance current social media sites with semantics
and to address issues with portability between social media sites. It has been
shown that social media sites can serve as rich data sources for SIOC-based applications such as the SIOC Browser, but in the other direction, we will now
show how SIOC data can be used to represent and port the diverse social media
contributions (SMCs) made by users on heterogeneous sites.
1 Introduction
“Social network portability” is the term used to describe the ability to reuse one’s own
profile across various social networking sites. The founder of the LiveJournal blogging community, Brad Fitzpatrick, wrote an article1 from a developer’s point of view
about forming a “decentralized social graph”, which discusses some ideas for social
network portability and aggregating one’s friends across sites. However, it is not just
friends that may need to be ported across social networking sites (and across social
media sites in general), but content items as well.
Soon afterwards, “A Bill of Rights for Users of the Social Web2“ was authored by
Smarr et al. for “social web” sites who wish to guarantee ownership and control over
one’s own personal information. As part of this bill, the authors asserted that
1
2
http://bradfitz.com/social-graph-problem/
http://opensocialweb.org/2007/09/05/bill-of-rights/
117
participating sites should provide social network portability, but that they should also
guarantee users “ownership of their own personal information, including the activity
stream of content they create”, and also stated that “sites supporting these rights shall
allow their users to syndicate their own stream of activity outside the site”.
OpenSocial from Google is another related effort that has gained a lot of attention
recently. While at the time of writing, OpenSocial has been mainly focusing on application portability across various social networking site, the following statement3 mentions future reuse of data across participating sites: “an OpenSocial app added to your
website automatically uses your site’s data. However, it is possible to use data from
another social network as well, should you prefer.”
To enable a person’s transition and / or migration across social media sites, there
are significant challenges associated with achieving such portability both in terms of
the person-to-person networks and the content objects expressed on each site. As well
as requiring APIs to access this data (such as SPARQL endpoints or AtomPub interfaces), representation mechanisms are needed to represent and interconnect people
and objects on the Web in an interoperable, extensible way.
The Semantic Web4 [1] provides such representation mechanisms: it links people
and objects to record and represent the heterogeneous ties that bind us to each other.
By using agreed-upon Semantic Web formats to describe people, content objects, and
the connections that bind them together, social media sites can interoperate by appealing to common semantics. Developers are already using Semantic Web technologies
to augment the ways in which they create, reuse, and link content on social media
sites. Some social networking sites, such as Facebook, are also starting to provide
query interfaces to their data, which others can then reuse and link to via the Semantic
Web5, 6.
The Semantic Web is a useful platform for linking and for performing operations
on diverse person- and object-related data gathered from heterogeneous social media
sites. In the other direction, social media sites can serve as rich data sources for Semantic Web applications. As Tim Berners-Lee said in a 2005 podcast7, Semantic Web
technologies can support online communities even as “online communities ... support
Semantic Web data by being the sources of people voluntarily connecting things together”. Such semantically-linked data can provide an enhanced view of individual or
community activity across social media sites (for example, “show me all the content
that Alice has acted on in the past three months”).
Social media sites should be able to collect a person’s relevant content items and objects of interest and provide some limited data portability (at the very least, for their
most highly used or rated items). We will refer to these items as one’s social media
contributions, or SMCs. Through such portability, the interactions and actions of a person with other users and objects (on systems they are already using) can be used to
create new person or content associations when they register for a new social media site.
In [2], it was shown that social media sites can serve as rich data sources for
SIOC-based applications such as the SIOC Browser. In the other direction, we will
3
http://code.google.com/apis/opensocial/container.html
http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21
5
http://www.openlinksw.com/blog/~kidehen/?id=1237
6
http://www.dcs.shef.ac.uk/~mrowe/foafgenerator.html
7
http://esw.w3.org/topic/IswcPodcast
4
118
U. Bojars, J.G. Breslin, and S. Decker
demonstrate in this paper how SIOC data can be used to represent and port the diverse
SMCs made by users on heterogeneous sites.
2 Getting Content Items Using SIOC
The SIOC initiative [3] was initially established to describe and link discussion posts
taking place on online community forums such as blogs, message boards, and mailing
lists. As discussions begin to move beyond simple text-based conversations to include
audio and video content, SIOC has evolved to describe not only conventional discussion platforms but also new Web-based communication and content-sharing mechanisms. In combination with the FOAF vocabulary for describing people and their
friends, and the Simple Knowledge Organization Systems (SKOS) model for organising knowledge, SIOC lets developers link posted content items to other related items,
to people (via their associated user accounts), and to topics (using specific “tags” or
hierarchical categories).
Fig. 1. Porting social media contributions from data providers to import services
Various tools, exporters and services have been created to expose SIOC data from
existing online communities. These include APIs for PHP, Java and Ruby, data exporters systems like WordPress, Drupal and phpBB, data producers for RFC 4155
mailboxes and SIOC converters for Web 2.0 services like Twitter and Jaiku, and
119
commercial products like Talis Engage and OpenLink Virtuoso. A full set of applications that create SIOC data is available online8.
All of these data sources provide accurate structured descriptions of social media
contributions (SMCs), that can be aggregated from different sites (e.g., by person via
their user accounts, by co-occurring topics, etc.). Figure 1 shows the process of porting SIOC data from various sources to SIOC import mechanisms for WordPress and
future applications. We will now describe the SIOC import plugin for WordPress.
3 Import SIOC Data, with a WordPress Example
The SIOC import plugin for WordPress9 is an initial demonstrator for social media
portability using SIOC. This plugin creates a screen (see Figure 2) in the WordPress
administration user interface which allows one to import user-created content in the
form of SIOC data.
Fig. 2. SIOC RDF import into WordPress
Data to be imported can be created from a number of different social media sites
using SIOC export tools (as described above). For example, a SIOC exporter plugin
for a blog engine would create a SIOC RDF representation of every blog post and
comment, including information about:
•
•
•
•
•
•
8
9
The content of a post
The author
The creation / update date
Tags and categories
All comments on the post
Information about the container blog
http://rdfs.org/sioc/applications/#creating
http://wiki.sioc-project.org/w/SIOC_Import_Plugin
120
The data representation used (RDF) enables us to easily extend this data model with
new properties when they become necessary. The import process implemented by the
WordPress SIOC import plugin is the following:
• Parse RDF data (using a generic RDF parser called ARC)
• Find all posts - sioc:Post(s) - which exhibit all of the properties required by the
target site
• For each post found:
• Create a new post using WordPress API calls
The pilot implementation currently works with a single SIOC file and imports all the
posts contained within it. Figure 3 shows an example post imported into WordPress:
Fig. 3. Imported post in WordPress
Since SIOC is a universal data format, and not specific to any particular site, this
pilot implementation already allows us to move content between different blog engines or even between different kinds of social media sites. However, the import of a
single file shown here is useful for demonstration purposes.
We will now describe how a SIOC import tool can be extended to port all usercreated content from one social media site to another. By starting from a site’s main
SIOC profile, we can retrieve machine-readable information about all the content of
this site - starting with the forums hosted therein, and then retrieving the contained
posts, comments, and associated users. This extended SIOC import tool needs to retrieve all SIOC data pages (possibly limited by some user-defined filters) and to recreate all the data found in this SIOC page on the target social media site.
This will result in a replica of the original site, including links between objects
(e.g., between posts and their comments). Often, a part of the content that a user
wants to port is not publicly available. SIOC exporters can also be used in this case,
121
but the user will first need to authenticate at the source site and ensure that they have
enough privileges to access all the data that need to be migrated.
Another step in social media portability is keeping two sites synchronised (if
required): having the same set of users, posts, comments, category hierarchies, etc. In
principle, this can be achieved by importing a full SIOC dataset and then monitoring
SIOC data feeds for new items added (some SIOC export tools may need to be extended to do this). Implementing this in practice will undoubtedly unfold some interesting challenges.
Another example for using a complete site import would be for platform migration.
For example, this could occur if a person has been using a mailing list for a particular
community, and they then decide that the extended functionality offered to them by a
Web-based message board platform is required.
It is not just discussion-type content items that can be ported. Using the SIOC
Types module10, various content types can be collected in sioc:Container(s) and
ported in the same way (Sounds, MovingImages, Events, Bookmarks, etc.).
4 The Role of SKOS and FOAF
SIOC allows us to describe most user-created content, but it can also be combined
with other RDF vocabularies such as Dublin Core (DC), Friend-of-a-Friend (FOAF)
and Simple Knowledge Organisation Systems (SKOS). These vocabularies can be
used when there is a need to migrate some additional data specific for a particular site.
DC provides a basic set of properties and types for annotating documents and resources. DC’s “Type” vocabulary also defines various document types such as MovingImages, Sound, etc., that can be used to describe media elements from social media sites.
SKOS is designed for describing taxonomies such as category hierarchies. By exposing categories in SKOS we ensure the portability of this information to other social media sites.
Finally, FOAF is designed for describing information about people and their social
relations. This vocabulary is already used together with SIOC to describe information
about users, and additional properties from FOAF (e.g., foaf:knows) can be used to
describe users’ social networks. This can be useful when porting data from a social
networking site.
5 Conclusions and Future Work
In this paper we have shown how SIOC data can be used to represent and port the
diverse social media contributions being made by users on various sites. We began by
describing the need and requirements for such portability, then talked about sources of
data including various SIOC data producers, and next we described how such SIOC
data can be imported into a system such as WordPress. We finally talked about how
this data can be augmented using other vocabularies such as FOAF. For future work,
we mentioned the issue of who should be allowed to reuse certain data in other sites
(as spam blogs are often duplicating other people’s content without authorisation for
10
http://rdfs.org/sioc/types
122
SEO purposes). As well as collecting a person’s relevant content objects, social media
sites may need to verify that a person is allowed to reuse data / metadata from these
objects in external systems. This could be achieved by using SIOC as a representation
format, aggregating a person’s created items (through their user accounts) from various site containers, and combining this with some authentication mechanisms to
verify that these items can be reused by the authenticated individual on whatever new
sites they choose.
References
1. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American 284(5),
35–43 (2001)
2. Bojars, U., Breslin, J.G., Passant, A.: SIOC Browser - Towards a Richer Blog Browsing
Experience. In: Blogtalk Reloaded: Social Software - Research and Cases, Vienna, Austria
(October 2006) ISBN 3833496142
3. Breslin, J.G., Harth, A., Bojars, U., Decker, S.: Towards Semantically-Interlinked Online
Communities. In: Gómez-Pérez, A., Euzenat, J. (eds.) ESWC 2005. LNCS, vol. 3532,
pp. 500–514. Springer, Heidelberg (2005)
Reed’s Law and How Multiple Identities Make
the Long Tail Just That Little Bit Longer
David Cushman
Faster Future Consulting, United Kingdom
[email protected]
Abstract. Reed’s Law or “Group Forming Network Theory” (as Dr. David P
Reed originally and modestly called it) is the mathematical explanation for the
power of the network. As with many great ideas, it is quite simple, easy to
understand and enlightening. This paper sets out to explain what Reed’s Law
describes and includes more recent understandings of the collaborative power
of networks which should help to make sense of and gives context to the exponential. It also suggests that the multiple complex identities we are adopting in
multiple communities are not necessarily a “bad thing”. The contention of this
paper is that the different modes of thought these actively encourage are to be
welcomed when viewed in the context of unleashing the power of self-forming
collaborative communities of interest and purpose.
1 Introduction
In the beginning, we had Sarnoff’s Law1: a mathematical description from a broadcast, mass media age. It was first applied to cinema screens, and latterly to TV.
Sarnoff’s Law states that the value of a network grows in proportion to the number of
viewers. It is basically a straight line: the more viewers, the more value the network
has. Most audience measurement techniques have simply followed this rule ever
since. Some (such as unique users/visitor counts) have, inappropriately, continued to
apply it to websites and social networks. However, this is a serious underestimation
when you move out of broadcast models.
Metcalfe’s Law offers a better fit. It offers a better way of measuring the relentless
growth of the power of the Internet. This law states that the value of the network
grows in proportion to the number of nodes on the network. For example, one fax
machine on its own is useless (1 squared = 0). Two (2 squared = 4) has more utility.
For each one that is added to the network, the value of all nodes in the network is
increased (3 squared = 9 etc). If your website is getting 10,000 unique users a month
more than a rival, the gap between you and them in terms of potential value created is
10,000 squared (100 million!). Having 10,000 more nodes on a network – even if
there is just a linking from one to one – is much more valuable than having 10,000
passive viewers for a broadcast (assuming you have gone to the trouble of doing more
digitally than simply replicating the broadcast model).
1
http://en.wikipedia.org/wiki/Sarnoff's_law
124
D. Cushman
Metcalfe’s N-squared value explains why the growth of networks used for one-toone communication (e.g. phone services, e-mail, instant messaging) follows the pattern
it does. Simply, new users will almost always join the larger network because they
will reason it offers more value to them. Frankly, it usually does: a tipping point is
reached, and the floodgates open. For example, if and when a VoIP provider assumes
dominance over our mobile identities (as defined by our personal ‘number’), then the
operators may be in trouble. Metcalfe’s Law charts an impressive and rapid rate of
growth of utility. However, even this, according to Reed [2], is a gross underestimation. What Metcalfe’s Law fails to take account of is that each of the nodes on the
network can choose to form groups of their own, of whatever size or complexity they
choose, with near neighbours or distant, initially unrelated, nodes. They can choose
small groups, large groups, be part of multiple groups, uber- and subgroups, etc.
2 Wild Growth but Counterintuitive: Do the Maths
If we add up all of the two-person, three-person, four-person, etc. groups, the utility is
2 to the power of N (Figure 1). This represents astonishing, wild growth potential.
Reed quotes the example of a minister who was offered whatever he chose as a reward from his king. He asked for two copper coins on the first square of a chess
board, four on the next, eight on the third and so on, following the progression of 2 to
the power of N. The King thought he had gotten off lightly, until when they reached
the 13th square the number of copper coins had reached 8192. If he had owned a big
enough abacus he would have discovered that by the 64th square he would be handing
over somewhere in the region of 18 quintillion copper coins, a number rather higher
than there are grains of sand in the world. However, he did not bother, beheading the
smart minister instead (not something we can do with our web rivals).
Fig. 1. Various measures of the growth of networks
Reed’s Law and How Multiple Identities Make the Long Tail Just That Little Bit Longer
125
There are those who argue that Reed’s Law [2] is counterintuitive. [1] states that:
“Reed's Law says that every new person on a network doubles its value. Adding 10
people, by this reasoning, increases its value a thousandfold (210). But that does not
even remotely fit our general expectations of network values - a network with 50,010
people can't possibly be worth a thousand times as much as a network with 50 000
people.” They argue that n log (n) offers a more accurate interpretation of the growth
of networks. This is illustrated in the following diagram adapted from their wellreasoned argument [1].
Reed has refined his law in the following mathematical terms. The number of possible sub-groups of network participants is 2 to the power of N -N -1, where N is the
number of participants. Therefore, what Reed’s Law describes is the possible number
of sub-groups, and this reveals a potential, not an actual, value. That theoretical value
gets closer to being realised if the network is used in a particular way - in a way
which is becoming more and more the norm as the digital native generation takes
charge. Groups offer their greatest potential value when they work together to do
much more than chat.
3 Collaboration, Flow Not Focus
In charting the possible number of sub-groups, Reed’s Law reveals the collaborative
potential of a network. When collaboration happens, new value - often immense value
-emerges. What do we mean by collaboration? [7] offers: “Collaboration is more
than just ‘working together’ … Collaboration implies that multiple people produce
something that the individuals involved could not have produced acting on their own
… Technology advances have meant that some level of time-shifting and placeshifting is now possible, reducing the simultaneity inherent in the original scenario.”
Stowe Boyd [3] suggests the greatest value unleashed by networks comes when the
group is not only one which has self-formed with a collaborative purpose, but one in
which people are willing to drop everything to join in the flow as and when they are
required to by their connections – i.e. drop everything to act in real time. Boyd believes that far from the pipe dream that some may regard this as, this is the natural
place our involvement in networks leads to. He asks us to think of attention (i.e. demand on our time) as being more about flow than focus [3]:
• “Don't listen to industrial era or information era (the last stage of industrialism)
nonsense about personal productivity. Don't listen to the Man.
• “The network is mostly connections. The connections matter, give it value, not the
nodes.”
• “Time is a shared space - your time is truly not your own”
• “Productivity is second to Connection: network productivity trumps personal productivity.”
This belief in the power of the network - and his willingness to subsume personal
focus to it is based on the simple notion that: “I am made greater by the sum of my
connections - so are my connections.”
126
D. Cushman
4 Don’t Just Network for Networking’s Sake
When a network is for simple one-to-one communication, there is little potential for
collaboration (other than between one and another). There is no potential for unleashing the wisdom of crowds [8], for tapping into the notion that none of us is as clever
as all of us. Ross Mayfield, CEO of Socialtext, offered his own equation for the value
created by collaboration [5]. In his “Ecosystem of Networks” diagram (Figure 2,
Table 1), he argues that the growth and potential value of a network is not only defined by its freedom to self-form into groups, but crucially by what those groups do.
Fig. 2. “Ecosystem of Networks” by Ross Mayfield
Table 1. “Ecosystem of Networks” by Ross Mayfield
Network
Layer
Political
Network
Social
Network
Creative
Network
Unit Size
1000s
150
12
Distribution
of
Links
Power
Law/Scale-Free
Random/Bell
Curve
Even/Flat
Social Capital
Weblog Mode
Sarnoff's Law
(N)
Metcalfe's Law
(N2)
Reed's
Law
(2n)
Publishing
Communication
Collaboration
For example, if your blog (your place in the network) is simply about publishing
information, its value to you and your readers follows Sarnoff’s Law. If you use
Facebook to communicate with your friends, then the value derived follows Metcalfe’s Law. If your node, your place in the network, is used to collaborate with others
- to share information, mash-up images, create new ideas, services and products, then
the value growth can follow Reed’s Law.
127
In other words, Reed’s Law reveals the potential value growth in collaborative
networks of shared interest and purpose. It is in this that the runaway value Reed’s
Law describes is found. It is this which answers [1]’s concern: how is a network with
ten more people worth 1000 times more? Imagine a collaborative network of
1000 scientists who have been seeking the cure for cancer. If ten more join there
are now 1000 more potential subgroups. If just one of the new juxtapositions/mergers/mashings of ideas results in that cure for cancer few would argue that
the network had delivered a value 1000 times greater value than the network’s previous state.
Of course, not every single potential group will form. Not every new idea will
deliver cancer-curing value. This much we know. It is why simply enabling collaborative networks will never deliver a 100% replica of the Reed’s Law value curve. For
each potential group that does not form, a big chunk of subsequent value is lost. What
we find harder to know, or even imagine, is the value emerging from those collaborative groups which do form.
5 Self-forming Communities and Multi-fac(et)ed Identities
It seems to me that the pursuit of that emerging value is the best indicator to where
our development and investment efforts should be focused if we wish to create sustainable value. Networks which allow self-forming collaborative communities of
shared interest and purpose will create value. This is the reason networks of collaboration have real power, create value you could not predict and are fast becoming the
model for a new way of socioeconomic life. Also, as networks gain increasing influence on the macroeconomics of life, they are also having greater and greater influence
on our microlife - on the creation of our own individual identities.
Our identities become increasingly complex. The more I collaborate in selfforming groups – the more complex ‘I’ become. It is a question of psychological selfdeterminism: “who do I think I am?” Our identity - whether it carries the label of a
name or a number - is a work in progress. Ever-shifting, responding to communities it
is part of, your identity is as much (perhaps more) created by those around you as by
yourself.
The desire and need for psychological self-determinism is working as a powerful
adjunct to the growing influence of global, digital networks. When communities were
fixed in location, your identity was created by your relationships within that fixed
community. Your identity was equally fixed. In terms of Group Forming Network
Theory you belonged to just one group, and it was pretty much fixed in size.
In a socially-networked world, the creation of your identity becomes a process
which is contributed to by more people, more often, and from very varied backgrounds. The community you exist in shapes your identity from its perspective and
from your own. Your identity varies from community to community. If once you were
the blacksmith’s son and village blacksmith-in-waiting, now you are a huge variety of
identities - depending on the community you are interacting with at any one time.
Our identities become increasingly multi-faceted. For example, on my blog, my
identity is relatively serious, thoughtful. On Facebook, it is more playful. I am displaying a different facet of a complex identity. The community I feel I am part of
128
D. Cushman
when writing my blog joins in the construction of my serious and thoughtful persona
(by their comments and expectations) [6]. The community I feel part of on Facebook
also joins in the construction of my persona there - by the way it acts, by its response
to what I do, by the tools it offers me. The push and pull of the forces forging my
identity in all elements of my life are communal. I interact with a community, therefore I am. Each community creates a different facet of that identity - and in doing so
makes a contribution to subtly reshaping the core.
As a simple example, becoming a parent changes your personality. You have a
new role to play and a new set of relationships - with your child, with your partner
(now a parent, too) with other parents, grandparents, etc. Each interaction changes
you in small but important ways, and these result in changes at your core. This may
be an extreme and emotionally-loaded example, but the co-creation of facets of your
personality has more than a superficial impact. This may be why ‘the edglings’ that
Stowe Boyd describes or Generation-C that [4] describes, have a different set of
wants, and are not satisfied by the norms of mass production/media. It is through new
mobile, fluid, co-creating communities they find themselves. They find they want to
share in, to be part of, and to engage with these communities. They are people for
whom collaboration and participation is the norm - for whom Reed’s Law is right.
Understanding which facets of personalities you seek to engage with, understanding that you are dealing with personalities created from converged facets: these are
real challenges for those marketing and/or creating social media today. Furthermore,
the notion that we want just one digital identity is challenged by the emerging value
multiple identities offer.
6 An Adjustment to Reed’s Law: A Longer Tail
There is an adjustment required for Reed’s Law. If each of us is a node on the network, each time one new node is added the value of the network (assuming the caveats described above) doubles. However, if each node has multiple identities then the
potential value must be multiplied by X, where X = the number of identities per node.
This clearly can only add to the value in an “uber” network - one in which multiple
identities apply, with the Internet itself being the biggest of them all.
What is the value of X? The average number of social networks regularly used by
the average social networker is greater than one. For example, I am a user of five, but
a regular user of three. Plenty of people who use them will use just one. Those who
use none are not part of any network and are therefore not part of the Reed’s Law
value curve creation of the Internet. X must equal a factor somewhat larger than 1.
Some estimate the average at 2.5. This then goes some way to restoring some of the
value lost to the Reed’s Law calculation when potential groups do not form. This
suggests that the actual ‘real’ curve may be somewhere between Metcalfe’s and
Reed’s curves.
Of course, if the potential of every group were fulfilled and we apply the multiple
identities factor, then the result must be an even steeper curve than even Reed predicted. The number of possible sub-groups of network participants becomes 2 to the
power of N(x) - N -1, where N is the number of participants and (x) is the average
number of identities of each participating node. The theoretical growth of value in
129
participatory and collaborative networks of multiple identity nodes is greater even
than Reed predicts. It seems reasonable to challenge this identity complication. Why
should not N simply be the value which encompasses the total of N(x)?
It is worth making the distinction because the N value does not reflect the diversity
of thought the multiple identities of one individual node (person) can offer. The
thoughts of my Facebook identity may differ from those of my Blogger.com one. That
is because my modes of thought, my openness, my willingness to think differently in
different contexts/communities does vary. Environment/community counts. For example, Twitter thoughts may kick the whole long-winded reasoned argument out of
the window – resulting in a different set of problem solving thinking, that short-cuts
the logical and makes leaps using instinct.
Fig. 3. The long tail in Reed’s Law
Given the notion that the converged identities that make up the network are collaborating to create, it is reasonable to suggest that the supply of what they create
should match the demand for what is created. If the network works unfettered, it
should make only that which there is a demand for. To discover the demand curve, all
we need do is tip Reed’s Law on its side (Figure 3).
7 Conclusions
In this paper, we presented an overview of Reed’s Law and various related theories
for describing the collaborative power of a network. We examined these laws and
suggested that the multiple complex identities we are adopting in various online
communities are not necessarily a negative development. We contend that the different modes of thought these actively encourage are to be welcomed when viewed
130
D. Cushman
in the context of unleashing the power of self-forming collaborative communities of
interest. Adding our identity multiple to Reed’s equation makes the long tail just that
little bit longer still: the more identities, the longer the tail.
References
1. Briscoe, B., Odlyzko, A., Tilly, B.: Metcalfe’s Law is Wrong (2006),
http://www.spectrum.ieee.org/print/4109
2. Reed, D.P.: That Sneaky Exponential: Beyond Metcalfe’s Law to the Power of Community
Building. Context Magazine (1999),
http://www.reed.com/Papers/GFN/reedslaw.html
3. Boyd, S.: Overload Shmoverload. Message (2007),
http://www.stoweboyd.com/message/2007/03/overload_shmove.html
4. Moore, A., Ahonen, T.: Communities Dominate Brands, Futuretext (2005),
http://communities-dominate.blogs.com/
5. Mayfield, R.: Ecosystem of Networks (2003),
http://radio.weblogs.com/0114726/2003/04/09.html
6. Cushman, D.: I Am Part of a Community, Therefore I Am, Faster Future (2007),
http://fasterfuture.blogspot.com/2007/08/
i-am-part-of-communitytherefore-i-am.html
7. Rangaswami, J.P.: Maybe It’s Because I’m a Calcuttan, Confused of Calcutta (2007),
http://confusedofcalcutta.com/2007/08/31/
maybe-its-because-im-acalcuttan/
8. Surowiecki, J.: The Wisdom of Crowds, Random House (2005),
http://www.randomhouse.com/features/wisdomofcrowds/
excerpt.html
9. Anderson, C.: The Long Tail (2006), http://www.thelongtail.com/
Memoz – Spatial Weblogging
Jon Hoem
Bergen University College
[email protected]
Abstract. The article argues that spatial webpublishing has influence on weblogging, and calls for a revision of the current weblog definition. The weblog
genre should be able to incorporate spatial representation, not only the sequential ordering of articles. The article show examples of different spatial forms,
including material produced in Memoz (MEMory OrganiZer).
Keywords: Memoz, spatial webpublishing, spatial montage, spatial weblogging.
1 Introduction
What I call spatial weblogging can be seen as a natural response to a more general
development of media at large: from media with a bias towards making time one of
the most significant factor towards emphasizing space.
The following discussion of the cultural implications of personal media, and weblogs and spatial publishing systems in particular, should be understood in relation to
two major movements concerning the differences between editorial and conferring
media, and between evanescent and positioned media [1]. The first concerns the
movement of power, from a situation where central units were in control towards a
situation where large parts of the production, distribution, and use of media content
happens through collective processes. The second concerns a shift in media concerning time, from a situation where the time between an event and the public mediation
of this event was considered important towards an increasing importance of space.
This influences production, distribution and use.
Editorial media follow a tradition where a relatively small number of people select,
produce and redact media content before this is distributed to a public audience where
every individual user is addressed in the same manner. A distinctive mark of editorial
media is that production and publishing are controlled by formal procedures before
the content is made available to the public. Editorial media is contrasted by conferring
media where there are no formalized procedures for controlling the content before it is
published. Those who edit and produce the content are individuals not part of an organization. Conferring media are also characterized by the users’ active participation.
What I choose to call evanescent media is characterized by a close relationship between events, the production of content and the moment of publishing. Evanescent
media are both contrasted and complemented by positioned media, that is media
where space becomes more important than time. Examples are digital media where
both the production and consumption of mediated content is made dependent on
132
J. Hoem
where the user is situated in space. This shift is introduced by mobile devices that
make their users able to produce, store and distribute media content combined with
specific information about these devices position. Some personal publishing-solutions
already include positioning (e.g. by introducing geotagging in weblogs), other solutions let the users make explicit connections between maps and information (like
Google's My Maps).
2 Weblog Definition
In a situation where conferring and positioned media become more widely used it is
about time to discuss whether a weblog-definition, emphasizing chronology as a key
feature, should be revised. Weblogs are already one of the most important genres of
conferring media, but the full impact of mobility and positioning is yet to be seen.
Spatial relationships between mediated objects and the events they represent will form
sub-genres that we need to include when discussing the features of weblogs.
When Jörn Barger coined the term weblog, more than ten years ago, he presented
the following description: "A weblogger (sometimes called a blogger or a pre-surfer)
logs all the other webpages she finds interesting. /../ The format is normally to add the
newest entry at the top of the page, so that repeat visitors can catch up by simply reading down the page until they reach a link they saw on their last visit" [2].
Barger mentions the reverse chronological order as a common feature, but this is
probably of minor importance when trying to explain the success of weblogs. Far
more important are usability, the ability to publish with a personal voice, and the hypertextual connections (permalinks, trackbacks, linkbacks etc) to and feedback from
other users. This makes an individual part of collaborative efforts that also facilitates
the creation of communities. Expressed differently: If one remove the chronological
sequencing of posts one would only loose the most obvious technical characteristics
of weblogs, but all the features that have made personal publishing a success remain.
Weblogs can be seen as online diaries, but arguably the most powerful metaphor is
the ship log. The reference to "log" emphasizes that a weblog is always about things
and events that belong to the past at the moment of publishing. The log includes different representations of events, the time when the references were made and finally
references to the locations where the events took place. The latter is yet not developed
into a significant feature of today's weblogs. "Log" also imply some assumptions
about frequency: In the classical log of a ship one will be likely to expect either entries on a regular basis, by the occurrence of important events, or both. In the ship-log
the relationship to a specific place is arguably just as important as the reference to
time. On the Internet a place can be virtual, defined by a URL, or it can be a reference
to a physical position, defined by a Geo-URL.
3 Spatial Montage
Until print technology emerged in the latter half of the 1400s, paintings and decorated
architecture were the dominating technologies when mediating stories. In Europe
the invention of printing made it possible to organize information more systematically
133
- represented linear - and because of relatively cheap reproduction printed text became
dominant for almost 500 years. Visual representations where of course developed
along with print, but one can argue that print did become the preferred way of distributing knowledge. This is still clearly seen in education, where printed books dominate.
Spatial, visual representations are still largely the domain of art and entertainment.
There is, however, a lot to be learned from the long traditions that store, structure
and present information visually. Even though we do not remember all information as
images, information is often easier to remember when connected to familiar spatial
forms. In ancient Greece, a predominantly oral culture, one developed techniques to
help a speaker, who had no physical storage media available, remember long passages
of linked information. Simonides, who is considered the founder of mnemonics
(μνημονικός mnemonikos - "of memory"), developed the rhetorical discipline known
as memoria (memory). To remember information and the relevant arguments Simonides recommended a speaker to associate individual items, that should be remembered,
to specific rooms in a house. During his performance the speaker would make a mental
walk through the imagined house and recall the items for each room he visited.
Fig. 1. Early maps conveyed values rather than a representation of what the world really looks
like. The Psalter map accompanied a 13th Century copy of the Book of Psalm. Being a Mappae
Mundi (world map) it was not designed for navigation, but the spatial representation was intended to illustrate a world view: for example, Jerusalem is located in the map's center, in accordance with the contemporary Christian world view.
134
J. Hoem
Fig. 2. Aby Warburg, "Mnemosyne-Atlas" [4], 1924-1929. Warburg made compilations of
texts and illustrations from a variety of sources. The compilations were also photographed,
as spatial representations of specific themes. The photographs could later be used in other
juxtapositions.
A visual representation can be used to recall the linear sequence of information
elements, but a more significant feature are the spatial connections that can be read in
any order - not restricted by the sequence of pages predefined by an author. Spatial
representations enable multi-linearity and give more room for individual associations.
Lev Manovich use the term spatial montage to describe situations where more than
one visual object is sharing a singular frame at the same time. The objects can be of
different sizes and types and their relationship form a meaningful juxtaposition that is
perceived by the viewer. Where traditional cinematic montage privileges the temporal
relationship between images, computer screen interfaces introduce other spatial, and
simultaneous relationships [3].
Our ability to remember information is often dependent on whether we are able to
construct mental maps. Mental maps are essential when we are learning - a process
where we have to make connections between new information to existing knowledge.
To use content in new contexts is an essential quality of the compilation of a digital
text. In As We May Think, the article that introduces the idea of hypertextual organization in its modern form, Vannevar Bush [5] described a trail blazer, a person who's
profession is to construct trails in large complexes of information for others to follow.
Bush's concern was how researchers should be able to keep themselves informed in
their fields. Today, the amount of available information force every information user
to adopt similar methods. To make new combinations of existing information objects
135
can be such a method. In this context spatial forms of publishing seem to provide
flexible options for linking web-based resources with existing knowledge.
4 Spatial Web Publishing
Personal publishing on the Web affects all media, and a number of user-friendly, free
services contribute to the wide spread of new genres. These include weblogs, wikis,
and various services used to produce websites. All the different solutions have in
common that the expressions are constantly changing, they are often played out
within various social networks, and new aesthetic qualities are developed through
various forms of interaction between the users. Among the publishing solutions used
by young people we find a number of services that encourage visual and spatial expression. Using these solutions the users are able to easily create their own websites
as a complex composition of text, images and video, complemented with music players, chat boxes, guest books etc.
Fig. 3. A fragment of a typical page at Piczo.com. The visual appearance clearly illustrates who
the users are, most are early teenage girls. The publishing concept used by Piczo is, however,
interesting to more than teenagers. When editing the page all elements can be moved freely.
136
J. Hoem
Even though pictures and videos are widely used, most weblogs still present information in ways that have not changed significantly from how they were presented
ten years ago. This is contrasted by the most popular personal publishing-tools, which
are used by youngsters. These webpages are also frequently revised and updated, and
other users are able to respond. However, when it comes to chronological structure
this does not seem to be a significant feature. Instead some popular systems let their
users structure information spatially. During previous work with weblogs in education the attention was driven towards systems like Piczo, publishing systems that were
well known and widely used by most youngsters a few years ago. What made Piczo
particularly interesting was the fact that the user interface made it possible to place
the published objects in any position on the screen, not limited by screen size or if a
specific position was occupied by another object.
Spatial publishing-systems (systems where media objects can be placed in a position on the screen chosen by the user) allow their users a lot of freedom to express
and present themselves. Where weblogs have analogue predecessors like diaries,
journals, personal letters and logs, the spatial publishing systems have strong relationships to poster walls and scrapbooks. When these media are taken into the digital domain new forms occur, where quite extensive reuse of media-objects seems to be an
integrated part of this publishing-culture.
Piczo had several shortcomings, especially a a potential tool for learning, but it
initiated the idea of a spatial Memory organizer – Memoz. Memoz is a publishing
environmet that let the users publish spatially on a "screen-surface" that is not
restricted by the physical sceen-size.
Fig. 4. An example of spatial publishing with Memoz. In the background the user has integrted
a map with geotagged pictures, made with Google Mymaps. Note how a satellite photo is
placed in a position corresponding to the underlying map. When editing the surface all objects
can be moved, scaled and stacked on top of each other.
137
The design was inspired by some of the features known from commercial systems,
but the specific design was developed with education in mind. Memoz' key-features
are:
•
•
•
•
•
•
Spatial publishing where there are no restrictions on the size of the publishing surface.
Easy combination of different media-expressions (text, pictures, video, animations, maps, etc)
Features that allow collaboration between several users. A user is able to
share editing-access to a publishing space with other users giving Memoz
some basic wiki-like features.
Each object can be adressed by an unique URL, a spatial permalink (SPerL),
facilitating links between objects on the publishing surface.
Commenting on individual media-objects.
Open architecture enabling compiling of a variety of web-resources.
A fully functional prototype of Memoz was made during fall 20071, making it possible to publish videos, pictures, maps (using Google maps) and texts literary side by
side. Memoz was then tested in selected schools as part of a research project in 2008.
5 Spatial Weblogging with Memoz
Working in Memoz is closely related to content resources available on the Internet.
This gives the teacher a concrete base and a tool to teach students how they can work
with a variety of sources, how they perform source criticism, and how they should
refer to sources using hyperlinks.
In Memoz the spatial publishing can be directly connected to the use of digital
maps. Users can organize information objects spatially and visually, and collaborate
on and present information and media elements in relation to a geographical position.
In relation to education, the use of spatial publishing follows a long tradition of
'place-based education' in which teaching is related to local resources in the curriculum (the local fauna, culture, history, etc.). Location-based teaching is inspired by the
desire to bridge the gap between what happens in the classroom and places and events
in the learner's surroundings. This perspective focuses on that one would like students
to care about their local environment, the people living there, and become more able
to take action when local problems occur. The structuring of the learning experience
then becomes a way to create informed citizens who become more able, and interested in participating.
A primary objective was to determine whether the students were able to take advantage of their everyday skills related to Internet use in school situations. Quite a
few students managed to draw on experiences with other systems when working with
Memoz. For example, knowledge of HTML, as they have learned as part of mastering
1
Memoz was designed by Mediesenteret Bergen University College. The following research
project was funded by ITU (The Norwegian National Network for IT-Research and Competence in Education).
138
J. Hoem
other publishing solutions. Many of these students showed great creativity when they
faced problems with the technology. The students who master these techniques seem
to find that they have some skills that are relevant to problem-solving in the school
situation.
Fig. 5. A webpage made with Memoz where the students have used a huge image of a tree as
background in order to show how different authors of criminal literature are related to one another. Only a small part of the overall page is shown on the screen at a given time. The page
was used to assist an oral presentation, and the students made spatial hyperlinks making it possible to follow a walkthrough. Memoz automatically scrolls the page to show the objects that
are linked to.
139
Memoz seems to have a potential as a supplement to existing tools such as presentation tools, Learning Management Systems, weblogs etc. There are, however, also
examples where the students did not use the potential that the tool offers. These students used the publishing surface in Memoz more as decoration than exploration.
However, an important observation is that there is a wide range of activities related to
the processes of selecting the content and form of what was presented, which can often not be found as part of the final product. The objects are transformed, moved and
even deleted during the process of compiling the final page.
Increasing use of online sources makes it necessary to deal with new questions
about knowledge sources. Student's work in Memoz has been closely linked to online
content resources, giving teachers both a justification for and a tool to motivate students to reflection on the source types, source criticism, and source references. The
students can draw on experiences from their private use of media, but this does not
mean that this can be implemented in education without a critical follow-up from
teachers. Some guidance are needed to give the application of students' competence a
direction, so that the work processes and the media products produced become valid
knowledge resources.
6 Digital Bricolage
Traditionally, media users were directed against practices and cultural understandings
that provide relatively clear guidelines for what are considered good products. Previously, these socialization processes have been linked to social institutions, especially
dominated by education and mass media, supplemented with experience from the
individual's private sphere. New understandings and practices have been developed
within the framework of subcultures. Some subcultures have evolved, been transformed, and become a part of the established mass culture.
Today, individuals or groups of users can produce media products that are easily
distributed through social networks, and in some cases to a large audience. The consequences are rapid proliferation of aesthetic practices, which challenge the traditional
arenas of culture formation. Mediated expressions are transformed and recontextualized quickly., and the production and distribution of media text involve
collective processes [6].
Without content with qualities that the users find interesting no information service
will ever become successful. However, when looking more closely at digital texts one
see that one of their major qualities is their ability to provide context. A successful
text uses elements from different sources, and supports social relationships to the producers, most often through links and comments.
The process of producing a digital text often involve some kind of copying content,
like material previously produced by the user on his own computer, or material found
on the Internet. In personal publishing most new texts are made in response to information already available, as comments, as additions to texts published by others or as
autonomous texts connected to other texts through different ways of hyperlinking.
The connections between texts may be characterised as communities. These communities constitute a public without any demands of formal connections between the
participants [7].
140
J. Hoem
The production of new digital expressions almost always involve some kind of
copying content, or selection from predefined functions, whether this is the copying
from material previously produced by the user on his own computer, or found on the
Internet. Copying and reuse of media material is an activity that resemble what
Claude Lévi-Strauss call bricolage. Lévi-Strauss introduced le bricoleur as the antagonism to l'ingenieur, referring to the modern society's way of thinking in contradiction to how people in traditional societies solve problems. A bricoleur collects
objects without any specific purpose, not knowing how or if they might become useful. These objects become parts of a repertory which the bricoleur may use whenever
a problem needs to be solved [8].
Fig. 6. Part of a page made entirely of existing objects: pictures, videos and texts. The pupils
have not contributed with any material, but the overall compilation is nevertheless unique.
The copying and reuse of media material make young publicists able to produce
new expressions. These media-elements are likely to be taken from different, often
commercial, presentations and combined into new personal expressions that are
141
shared online. These activities resemble those of the bricoleur, understood as a person
who take detours to be able to achieve a result that in the end seem like an improvised
solution to problems of both practical and existential character. The process of bricolage may be highly innovative as objects often end up being used in contexts very
different from the ones they originated from. Thus those who behave as bricoleurs
often perform a complex set of aesthetic and practical considerations when using objects from their repertories in new media-expressions.
7 Further Research on Spatial Weblogging
Memoz has not been shipped as a service outside education, and the examples shown
are from school situations. Thay are produced within a limited timeframe, and can
hardly be seen as examples that can be directly compared to texts produced with ordinary weblogging tools. It is, however, possible to imagine how new posts can be
added to a page that grow over time, placed on the screen surface in relation to previous objects covering similar topics. New objects can even be placed over older ones,
resembling the development of a poster wall.
In Memoz any sequential relationship between objects has to be created manually,
using spatial hyperlinks. This is, however, a functionality that can be developed into a
system where the sequential ordering can be followed in ways known from tools like
Etherpad and Google Wave. In other words: if spatial weblogging is to be considered
a genre, it is to be further developed.
The spatial expressions, which until now has been viewed on traditional computers
screens, can also be mediated by devices that in various ways can be connected to the
reader's position. Mobile devices such as small computers, cell phones, ebook-readers
etc. can establish a connection between the content and the location where the device
(and the user) is at the moment of reading. This creates opportunities for a new, virtual
level of information that comes in addition to what we otherwise experience in the
physical environment. In the meeting between these spheres one can see a continuous
interchange between physical and virtual expression. Space becomes a meeting place
where two types of environments influence each other or being built together.
It will be interesting to look more into how spatial web publishing can lead to new
aesthetic and social practices in other arenas. One can already see a number of examples of how the spatial publishing are related to physical, place-bound aesthetic expression. These practice fields can meet in what is often described as "shifting" or
"enhanced" reality (augmented reality).
The relationships between space, location and representation are issues that are the
basis for architecture and urban planning, but which also have great relevance for
understanding the Internet's texts. In large online hypertexts there is no natural beginning or end. Thus the user must develop an understanding of the text's structure
through the active use of text. With the increasing complexity of how information is
mediated, in both physical and virtual spaces, it becomes particularly relevant to look
more into how the spatial screen-based representations can be connected to physical
space in different contexts. This related perfectly to Barger's original definition of
webloggging and to Vannevar Bush's thoughts about storing and exploring information, both emphasizing the making of connections to past events and experiences for
others to follow.
142
J. Hoem
References
1. Hoem, J.: Personal Publishing Environments, Doctoral theses at NTNU, 3 (2009),
http://infodesign.no/2009/08/personal-publishingenvironments-all.htm
2. Barger, J.: Weblog resources FAQ (1999),
http://www.robotwisdom.com/weblogs/
3. Manovich, L.: The Archeology of Windows and Spatial Montage (2002),
http://www.manovich.net/macrocinema.doc
4. Freiling, R.: The Archive, the Media, the Map and the Text (2007),
http://www.medienkunstnetz.de/works/mnemosyne/
5. Bush, V.: As We May Think, The Atlantic Monthly (July 1945),
http://www.theatlantic.com/doc/194507/bush
6. Hoem, J.: Openness in Communication, First Monday Special Issue on Openness (2006),
http://www.firstmonday.org/issues/issue11_7/hoem/index.html
7. Hoem, J., Schwebs, T.: Personal Publishing and Media Literacy. In: IFIP World. Conference
on Computers in Education (2005),
http://infodesign.no/artikler/personal_%20publishing_media_li
teracy.pdf
8. Lévi-Strauss, C.: The Savage Mind (La Pensée Sauvage). Oxford Univ. Press, Oxford
(1962)
Campus Móvil: Designing a Mobile Web 2.0 Startup
for Higher Education Uses
Hugo Pardo Kuklinski1 and Joel Brandt2
1
Digital Interactions Research Group, University of Vic, Catalunya, Spain
2
Human-Computer Interaction Group, Stanford University
[email protected], [email protected]
Abstract. In the intersection between the mobile Internet, social software and
educational environments, Campus Móvil is a prototype of an online application for mobile devices created for a Spanish university community, providing
exclusive and transparent access via an institutional email account. Campus
Móvil was proposed and developed to address needs not currently being met
in a university community due to a lack of ubiquitous services. It also facilitates
network access for numerous specialized activities that complement those normally carried out on campus and in lecture rooms using personal computers.
1 Introduction
The synergy between novel technology and use patterns has enabled the convergence
of mobile devices and Web 2.0 applications. This synthesis is a new conceptual space
called Mobile Web 2.0 [1], leading to an always-on empowered web consumer.
Handsets are becoming more powerful and sophisticated in terms of: processing
power; new multimedia capabilities; more network bandwidth for internet communications; access to already-available WiFi access points; more efficient web browsers;
larger high-resolution screens; novel hybrid mobile applications; and massive online
communities. The adoption of 3G mobile devices by hardware manufacturers and
operators has made available an infrastructure that promotes connected physical mobility and a new and attractive market for services.
The Mobile Web 2.0 concept [1] can be linked to each of the seven principles outlined by O’Reilly [2] in his article describing Web 2.0.
• The Web as a platform. A mobile device has never had as much computational
power or storage capacity as its non-mobile counterpart. The Web as a platform
emerges as a strong synergizing agent for mobile devices.
• Database management as a core competence. The alliance of mobile and Web
2.0 allows the integration of data with the ease of quick access from any place and
at any moment, supporting data ubiquity.
• Ending of the software release cycle. For mobiles, this can be an advantage given
certain system characteristics of these devices like reduced memory, minimal
graphical user interfaces and the security risk associated with the installation of
software by third-party developers.
144
H.P. Kuklinski and J. Brandt
• Lightweight programming models and the search for simplicity. Reduced interfaces and limited storage systems, graphical austerity as well as the use of application protocols, will be the base of any implementation for Mobile Web 2.0.
• Software above the level of a single device. Software is now being designed for
use on multiple hardware platforms, most commonly, personal computers and mobile devices.
• Both rich user experiences and harnessing collective intelligence are the key to
future development, given a current environment in which the mobile data industry
is based on content provided by the carriers.
The other attributes of Mobile Web 2.0 are: being able to capture the point of inspiration; global nodes and multi-language access; mashups on mobility; location-based
services (as an organic use); and mobile search emphasizing context.
In the intersection between the mobile Internet, social software and educational environments, Campus Móvil is the result of a research project with a strong business
focus1. The Campus Móvil startup project is a prototype of an online application that
will be used via mobile devices in a Spanish university community, with exclusive
and transparent access provided through an institutional email account.
The Campus Móvil project covers three main knowledge areas: Web 2.0 applications, mobile devices, and teaching innovation policies in the new European Space for
Higher Education. European structural changes in education were suggested through
the Bologna Process, which includes a special emphasis on innovation in technology
usage as an essential value proposition for new pedagogical strategies. In the European Union, different financial sources have been set out for the implementation
of Bologna. There is also growing interest from the IT business world towards contributing new and innovative ideas for Internet-based applications in the higher education domain.
The Campus Móvil proposal is not entirely different from other web community
strategies. The integration of tools as a sort of “cocktail” of existing products from the
mobile business market is its main added value. Many of these products are in their
initial phases of development and they have not previously been targeted towards the
academic community. This project aims to offer an attractive basic service where
currently there is an empty market. Also, various bonus activities (without any added
cost to the users) will be proposed.
2 Understanding the Consumer and the Market
It is necessary to take into account the evolution of the “always-on empowered Web
consumer” which is now being examined by all companies and marketing strategies.
This kind of user has driven the Internet industry in the last few years, especially in
the new mobile Internet market and on the almost deserted landscape of mobile
Web 2.0 services. In this sense, through Campus Móvil, a consumption and production space is proposed between institutions and students, at both the interaction and
1
This paper was part of the research project “Mobile Web 2.0: A Theoretical-Technical
Framework and Developing Trends” carried out by Hugo Pardo Kuklinski (University of
Vic), Joel Brandt (Stanford University) and Juan Pablo Puerta (Craigslist.org Inc.).
Campus Móvil: Designing a Mobile Web 2.0 Startup for Higher Education Uses
145
pedagogical levels. The exigencies of an academic environment must be taken into
account along with the increasing necessities of connectivity, ubiquity and being able
to creatively adapt to these new converging technologies.
Mobile Internet is a market with huge difficulties. These include high costs for
connectivity, the slow speeds encountered when web surfing due to broadband limitations, short battery life in mobile devices, the lack of mobile Internet consumers
beyond the iPhone, along with other factors. Nevertheless, expert predictions [1], [5][8] indicate that mobile Internet will be one of technology / consumer markets with
large growth during the next few years, especially as mobiles begin to support collaborative applications.
With the specific characteristics associated with mobile device usage, some things
must be taken into account when promoting a new product as Campus Móvil: users
on the move want to be entertained with short, direct and friendly content items; the
offerings should be centered on value-added services related to specific instances (e.g.
time and space) based on ubiquity, capturing contexts from the current point in time
and using location-based services. These mobile uses can also be synchronized with
regular web applications.
The key values of the Campus Móvil project are:
• To keep the community informed (showing only the latest news in one’s
university);
• To enable reciprocity (by providing a useful platform to users within the mobile
application, users will respond and consume / create content in return);
• To provide social validation (we advocate the creation of a powerful universityoriented community without external users “polluting” this interaction model);
• To promote a desire to like and use the platform (if a useful service is offered, then
users will cooperate and aid with the service’s growth leading to new services and
content items); and
• To leverage institutional authorities (universities will be integrated into project in
an institutional way, conferring higher prestige to the product).
3 Characteristics and Value Proposition
Campus Móvil2 will be designed for the Spanish university market, with a second
expansion phase to other Spanish-speaking countries, planned for the third year of the
product (2010).
The opportunities in focusing on Spanish university users are significant. Spain has
a high mobile phone density: for its 48 million users, there is a penetration level of
107.46 mobile phones per 100 people, and young people are the biggest market. The
Spanish university system also stimulates and finances the use of innovative tools in
the academic community.
Even though wireless networks have not been developed at a large scale in the
Spanish market, universities have their own networks. The government offers free
WiFi coverage on all public university campuses. Some issues that can be addressed
2
www.campusmovil.net and www.campusmovil.es
146
by this strategy include: the over-extended use of public transportation; the lack of
public access computers on campus; a scarce density of laptops per student; and a
limited amount of Web 2.0 applications designed for the Spanish-speaking market
(Mobile Web 2.0 applications do not exist in this market).
Campus Móvil will cover three unsolved necessities: 1) capturing the point of inspiration in the academic environment; 2) generating snippets which then will be
retrieved and reused on other computing environments; 3) taking advantage of the
dead time without computing availability and network access (public transportation,
hours between lecture classes, libraries, public spaces outside campus) for keeping
connected and interacting with the university community, via services providing access to today’s news and events or knowledge-management level functionality. Campus Móvil will allow to users to interact with their university on mobile devices more
quickly and in an improved manner.
4 The Main Concepts around Campus Móvil
Interaction on mobile devices happens in a different context where the physical environment plays a role in the interface, and depending on what the user currently has as
their primary activity. There must be a balance between functionality and complexity,
as discussed by [4]. These are the main concepts that Campus Móvil provides.
Partnership with Spanish public universities. By invitation, we will offer authorities, professors and administrative staff a free implementation of the administrative
and academic services available via Campus Móvil, based on the students’ necessities
explained earlier.
The contents and services proposed for this concept will provide an integration of
online campus services adapted to the mobile environment: current campus news in
brief (i.e. 15-20 words); absences of professors; examination information; on-campus
and off-campus events agenda; located-based services; marks; brief responses to
student’s demands, such as FAQs; multimedia services about academic activities;
freshmen services; general alerts; and security alerts.
Exclusivity, transparence identity and real profiles. Network-free access will be
enabled by means of a university email account. We will organize the Campus Móvil
community into groups: universities and knowledge areas and faculties. If the user is
not a member of any of these groups, access will be restricted to only a few levels of
general information (not including personal profiles of students or academic data),
unless the user has been authorized by another person to be included in his or her user
network. Tags for related sub-communities will also be shown.
The contents and services proposed for this concept will allow members of the
Campus Móvil community to consume, share and upload four kinds of data: text files,
pictures, audio and video (all with headlines, tags and a brief description) from their
private page or from public spaces.
Voice equal to value. We will promote the production and consumption of shortduration podcasts and videocasts, especially in academic interactions from professorto-student and student-to-student. 3G technology is not suitable for transmitting high
147
quality multimedia content, but it can be used to promote a podcasting service like
iTunesU. We help the universities to create a similar tool and to facilitate content
production, by means of a common platform and an easy-to-use development pattern.
Producing short texts when mobile. This concept will promote the production and
reading of short texts during a mobility state, for example, taking notes, creating diary
entries, and microblogging (à la Twitter or Jaiku). Each Campus Móvil member has a
personal page with an interface for entering 20-word answers to questions like “What
do I need for tomorrow’s lecture?” or “What do I plan to do today after school?”
Production and retrieval of snippets. Mobile devices offer a platform to produce
snippets capturing the point of inspiration, and Campus Móvil offers the possibility
for retrieving and reusing them (from a regular website with password access) [3].
The contents and services proposed for this concept will include ideas from lecture
classes; data produced in public spaces where there is neither access to computers nor
any Internet access; help and memory queries for research meetings; andvarious kinds
of snippets for later retrieval in desktop computing applications.
5 Interface Design
To adapt Web 2.0 applications for mobile interfaces that are only 240 pixels wide and
have only limited graphical capabilities, the best approach is to have complementarity
with the desktop website. This is similar to other examples of complementarity between desktop and mobile interfaces in Mobile Web 2.0 applications (Mosh, Facebook, Twitter, Dodgeball, Netvibes, MySay, Mindjot, etc.).
Fig. 1. Providing complementarity between the mobile and desktop websites
Campus Móvil will be primarily designed for interactions on mobile devices, although there will also be regular web applications to support and complement the
mobile Web with the desktop Web (Figure 1). More complex procedures, such as
subscriptions, long-time consumption, retrieval of snippets generated from mobile
applications or channels, and group-to-group communication, will be reserved for use
via the desktop Web.
The desktop interface (levels shown in Figure 2) will cover three operations: marketing content; more complex interactions; and extending the data available via
the mobile version. The personal page (in level 2) will be the main interface in the
community.
148
Fig. 2. Regular desktop Web architecture
The mobile interface will cover four operations (levels shown in Figure 3): producing and reading short texts, less than 20 words; permitting one to carry out easier
interactions; producing snippets while on campus; and listening audio files.
Fig. 3. Mobile architecture
Fig. 4. Sample desktop interface views
149
150
Fig. 5. Corresponding mobile interface views
The first prototype of our interface designed for both the regular, i.e. desktop Web
(Figure 4) and for the mobile Web (Figure 5) is shown here as a work in progress.
6 Conclusions and Future Work
In this short paper, we have described a project which intersects the mobile Internet,
social software and educational environments. Campus Móvil is a prototype for an
online application that will run on mobile devices and has been created for a Spanish
university community. It provides exclusive and transparent access via an institutional
email account. The system was proposed and developed to address needs not currently being met in the university community due to a lack of ubiquitous services.
Campus Móvil facilitates network access for numerous specialized activities that
complement those normally carried out on campus and in lecture rooms using personal computers.
The main activity in the next phase is to prototype the interactions, and to evaluate
the functionality in focus groups so as to find out where breakdowns will happen. We
research the interactions carried out via both the desktop and mobile interfaces, looking at how users interact with the system, and exploring whether to follow existing
trends or to innovate new ones. Following definitions from the original partners, the
action lines described here and further research on the prototype, we will finish all the
software and content segments of the desktop and mobile applications, keeping the
service in beta version for four months and accessible only by invitation. This will
allow us to solve any technical and usability problems.
We will then identify future development trends in Mobile Web 2.0 applications,
which will enable us to create synergies with other mobile Internet applications with a
services proposal related to the Campus Móvil project (especially those geared towards academic uses and higher education institutional management). We will then
launch the initial commercial phase of Campus Móvil. The market niche will be
Spanish public universities. Before the final launch, it will be necessary to close an
agreement with a financial support agency or venture capitalist.
151
References
1. Jaokar, A., Fish, T.: The Innovator’s Guide to Developing and Marketing Next Generation
Wireless/Mobile Applications, futuretext (2006)
2. O’Reilly, T.: What Is Web 2.0: Design Patterns and Business Models for the Next Generation of Software (2005),
http://tim.oreilly.com/news/2005/09/30/what-is-web-20.html
3. Brandt, J., Weiss, N., Klemmer, S.R.: txt 4 l8r: Lowering the Burden for Diary Studies
Under Mobile Conditions. In: CHI 2007 Extended Abstracts on Human Factors in Computing Systems, San Jose, CA, USA (2007)
4. Brandt, J., Weiss, N., Klemmer, S.R.: Designing for Limited Attention, Technical Report,
CSTR 2007-13 (2007),
http://hci.stanford.edu/cstr/reports/2007-13.pdf
5. Castells, M., Fernandez-Ardevol, M., Linchuan Qiu, J., Sey, A.: Mobile Communication and
Society: A Global Perspective. MIT Press, Cambridge (2007)
6. Levinson, P.: Cellphone: The Story of the World’s Most Mobile Medium and How It Has
Transformed Everything. Palgrave Macmillan, Basingstoke (2004)
7. Steinbock, D.: The Mobile Revolution: The Making of Worldwide Mobile Markets. Kogan
Page (2005)
8. Steinbock, D., Noam, E.M.: Competition for the Mobile Internet. Springer, Heidelberg
(2003)
The Impact of Politics 2.0 in the Spanish Social Media:
Tracking the Conversations around the Audiovisual
Political Wars
José M. Noguera and Beatriz Correyero
Journalism Department, Faculty Communication, UCAM,
30107 Murcia, Spain
{Jose M.Noguera, Beatriz.Correyero, jmnoguera}@pdi.ucam.edu,
[email protected]
Abstract. After the consolidation of weblogs as interactive narratives and
producers, audiovisual formats are gaining ground on the Web. Videos are
spreading all over the Internet and establishing themselves as a new medium for
political propaganda inside social media with tools so powerful like YouTube.
This investigation proceeds in two stages: on one hand we are going to examine
how this audiovisual formats have enjoyed an enormous amount of attention in
blogs during the Spanish pre-electoral campaign for the elections of March
2008. On the other hand, this article tries to investigate the social impact of this
phenomenon using data from a content analysis of the blog discussion related to
these videos centered on the most popular Spanish political blogs. Also, we
study when the audiovisual political messages (made by politicians or by users)
“born” and “die” in the Web and with what kind of rules they do.
Keywords: Political communication, Spanish blogosphere, political blogs,
Spanish General Elections 2008, YouTube.
1 Introduction
Since Joe Trippi started a "political campaign 2.0" for Howard Dean in USA, we have
a lot of examples about how politicians want to make profit in social networks. Virtual worlds like Second Life or networks like Facebook are just one of the platforms
that politicians want to explore. And in a World Live Web everytime more "visual",
the audiovisual political wars are an evident reality.
Often, the aim of these messages is to obtain more visibility in the mass media, but
at the same time the result in the social media might be unexpected and even, it might
turn against own politicians who designed the message. Political parties begin to
measure the potential of the social networks, but these "audiovisual wars" on the Web
are still a part of a phase of experimentation.
What elements encourage the visibility of the political messages on the Web? Is it
a collective social credibility? Are the social media able to identify the aims of some
political messages? Are users "trained" to make good spaces for political debate? Do
the institutional political messages arrive to these political spaces? Spanish
The Impact of Politics 2.0 in the Spanish Social Media
153
blogosphere can show the answers to these questions if its behavior is observed during a political campaign.
The methodology of this paper to describe that role begins with a selection of some
of the most important Spanish politician blogs. According to several ranking tools,
web sites like Escolar[1], Periodistas21[2], Eurogaceta[3], Internet Política [4] and
others were collected. The sample was made gathering different kind of blogs, with
journalists, teachers, politicians and just users of the Web. With this sample of blogs,
the most popular tags in the political conversations around the audiovisual messages
can be selected.
After that, a sample of videos made by the most important Spanish parties (PSOE
and PP) was chosen and their contents were tagged. From that moment is possible to
follow how people (social media) use these messages. In addition, the conversations
are studied with tools designed for tracking the buzz, like Technorati, Google Trends,
Meneame (the Spanish version of Digg) or YouTube. The aim is not to value the
efficiency of the audiovisual political campaigns on the Web. The real point is about
following the political messages on the blogs, and to obtain this kind of data could be
very interesting information for people which are working designing political campaigns for the social media on Internet.
1.1 The Context: New Trends of Modern Political Campaigns
If 2005 was the year of blogging, 2007 was the year of "videocracy". Nowadays, viral
propaganda begins to appear as one of the main forces and videos are the best tool for
it. Discussion and dialogue is not trapped by the image, but stimulated by it. With
tools like YouTube, Google Video or Vimeo, image is again a first-order political
weapon, especially combined with humor.
Nobody doubts that blogs and videos play an important role in presidential campaigns. Politicians all over the world use blogs and sharing video sites like YouTube
to keep a hard and scheduled campaign in parallel to the campaign on traditional
media. There are lots of examples of the growth of blogs and videos on the political
scene during 2007 in United States, France, United Kingdom and Spain.
For this reason, the main purpose of this article is to know if political blogs in
Spain have been influential in increasing Blogosphere´s activity and encouraging the
conversation about these audiovisual formats. To achieve this goal, the following
tasks are required before:
9 Explaining the candidates’ use of technology and their online behaviour.
9 Analysing the impact on Technorati in terms of reactions and authority of the
videos produced by the political parties on Youtube.
9 Tracking the way of travelling that these political videos do around the Blogosphere.
9 Explaining these ways around the Spanish Blogosphere trying to find “tipping
points”.
But before that, it can be useful to present Spanish presidential campaign for the elections on 9 March 2008 in order to show, first of all, the attitudes of Spanish politicians
towards the new media and the use that they make of modern technologies, blogs,
videos and social networks; secondly, the responses of online political citizens
154
J.M. Noguera and B. Correyero
engaged with politics, specially engaged to the main political parties; and finally, the
impact of the videos upload by the parties on YouTube, in order to demonstrate how
video is emerging as a vehicle for promoting the political process into the Spanish
blogosphere and how political videos have a special way of travelling around
the blogs.
Since the beginning of 2007, Spanish parties are using video on their own web
sites or on YouTube, not only to create states of opinion to a sector of the population
that surfs on the Internet, but also to gain and maintain political and media attention
and to wake up the political apathy of the young people, which is expressed especially
in the low indexes of electoral participation. Politicians seem to have found a way to
fight against this apathy. The point is if politicians have real online skills to communication, and this is related to the entertainment. According to William McGaughey
[5], "political campaigns are today a branch of the entertainment culture. Experienced
entertainers make successful political leaders".
We wonder, if today's political campaigns should entertain, what are the main
components to be used to achieve this goal? In this paper, these elements are the following ones: Videocracy, videopolitic, buzz marketing, permanent campaign, crowdsourcing, metacoverage and targeting.
Videocracy. Nowadays, the power of visual images on the society has been clearly
moved. The great impact of television, cinema, Internet, and advertising on public
opinion and political affairs is a fact. In Italy, for instance, one man has dominated the
world of images for more than three decades. In a recent documentary titled "Videocracy", we can see how the director of documentary, Erik Gandini, explores the power
of television in this country, and specifically Berlusconi's personal use of it during the
last 30 years.
Videopolitic. Drawn by the power of the image, political leaders around the world are
turning to the web to deliver video messages to the voters in an effort to get more
sympathizers. The video format opens the door for originality and spontaneity and
like everybody knows, "visual images can be more powerful than words". In this
sense, the expert in communication and language of the university of Berkeley,
George Lakoff [6], explains that 98% of thoughts are unconscious and based on "values, images, metaphors and narratives, which are what really can convince voters in
one either way”. Do you tube? Hillary does, Obama does, and of course, Spanish
leaders –Zapatero and Rajoy- do. During the pre-campaign, the two main parties in
Spain- Populars (PP) and Socialists (PSOE)- started a war of videos in order to create
buzz and to reach publics conversations. Sometimes Spanish popular TV contests had
been the source of inspiration for creating political videos that pretend to ridiculize
the adversary. For instance, we can see the video of the Socialistic Youth to promote
the topic of Education for the Citizenship [7] -it’s a new course that the actual Government imposed in the school-. Paradoxically, the slogan of this polemic video was:
"Por la igualdad, por la convivencia, Educación para la Ciudadanía SÍ" ("For the
equality, for the coexistence: Education for the Citizenship YES").
Audiovisual formats can be used to sell proper merits of the candidates like Zapatero’s video distributed on Internet called "Con Z de Zapatero" ("With Z of Zapatero")
[8] and another video presented by Mariano Rajoy who praised the Spanish National
155
Holiday (12th October, "Día de la Hispanidad"). We describe the objectives of these
videos later on.
Buzz Marketing. But online political video isn’t just only for candidates. It is also for
citizens who want to reach out to other people and communicate their own points of
view about politicians and political issues. Nowadays parties have political war rooms
with teams of experts in communications who monitor and listen to the media and the
public, respond to inquiries, and synthesize opinions to determine the best kind of
action to creating buzz (buzz marketing), to get people talking about the candidates
and promoting political viewpoints. Clues to know dominants issues among voters are
provided by blogs, podcasts, wikis, vlogs and social networks like MySpace. This
allows the political parties to target them, too. Art Murray [9] believes that the most
important aspect of the modern campaign is "targeting".
Crowdsourcing. But besides that, politicians try to involve the public in political
activities for enriching political discussions. They are exploring ways to use Web 2.0
technology to promote to political supporters to the participation and on this way,
getting to collect data from them. They are creating their own social networks. An
own social network is a great tool and easy to use for community and links between
supporters and to segment. And they are much more customized to do any activities to
do through other networks like Facebook more generalists. Its basic objective is that
people feel engaged to the campaign, but at the same time, anyone who wants to participate in a more active way can find here any material or information that is helpful
(including the party, you can use what their supporters do). This basic form of crowdsourcing is taking what your supporters do for the party to hear what they propose and
to be in contact with each other: creating community.
The PP candidate for president of the Government in 2008 saw the benefits of
crowdsourcing. On his web, Mariano Rajoy asked for volunteers to design their election videos.
During the Spanish political campaign, the crowdsourcing was used by traditional
media and online newspaper too. Spanish public television (RTVE channel) and
YouTube created an official home in YouTube for presidential candidate videos and
provided a platform to let people engaging dialogue with candidates. This format had
already been practised in the United States as "You Choose" in march 2007 when
YouTube and CNN created a platform for cosponsored presidential debates.
Permanent Campaign. Other important component of the modern political campaign
is "permanent campaign". Patrick Caddell, an advisor to then President-Elect Jimmy
Carter in 1976, gave a name -the Permanent Campaign- to a political mind-set that
had been developing since the beginning of the television age. According to Time
Magazine [10] Caddell wrote, "Essentially, governing with public approval requires a
continuing political campaign".
Nowadays, candidates for the presidency are in a perpetual campaign mode. The
frontier between campaign and government has almost dissapeared. For instance, in
Spain, although the official electoral campaign period only lasts for the 15 days before the election, (with the exception of the day just before the election), many parties,
156
especially the PP and PSOE, start their "pre-campaigns" months in advance, often
before having closed their electoral lists members.
Finally, the last significant element in modern political campaigns is metacoverage, what it is, the interest of media and political parties in reporting on the strategies
of campaign and its design.
2 2008 Electoral Campaign in Spain
In the last five years, in Spain, political parties are working on encouraging political
dialog in several ways. "In 2004 Spanish General Elections we attend to the political
birth of the smart crowd, the emergence of the Policy 3.0 and the triumph of mobile
phones as a mobilizing tool. In 2005 regional elections were the starting point of political blogs in Spain". [11] 2008 was the year of video. Mass-media, politicians and
political parties have discovered the viral power of video on the Internet but the videocracy prevails against a more open and participatory videopolitic1. These videos are
part of the campaign called 2.0 based on the creation of electoral communities where
thousands are the volunteers that expand the message of the candidate, to create videos that ridicule to the rival, and constructing their own networks of conversation and
support online. The problem is that politicians pay attention to people during election
cycles. But once they get elected, they ignore each other until the next election cycle
comes back.
In order to design their campaigns, Spanish political parties have taken as a starting
point these topics: United States presidential elections (2008), the last presidential
elections in Europe (UK and France), the use of the new technologies in the electoral
strategies and the rise of Web 2.0 tools witch offers the chance to engage interested
citizens.
2.1 PSOE’s Electoral Campaign Trough Videos
José Luis Rodríguez Zapatero, current Spanish Prime Minister, is the leader of Spanish Socialist Workers Party (PSOE). The first phase of the Socialist Party’s campaign
in 2008 General Elections was done under the slogan "Con Z de Zapatero" ("With Z
of Zapatero"), a joke based on the Prime Minister and socialist candidate's habit of
tending to pronounce words ending with D as if they ended with Z. The campaign
was linked to terms like equality (Igualdad-Igualdaz) or solidarity (SolidaridadSolidaridaz), emphasizing the policies carried out by the current government.
We have studied the travel of the conversation about this video trough Spanish
blogosphere.
The second phase was done under two slogans "La Mirada Positiva" [12] ("The
positive outlook") and "Motivos para creer" [13] ("Reasons to believe") emphasizing
the future government platform.
1
Videoyvoto.tv was the first initiative of online journalism exclusively in video format
launched in Spain by the Zeta Group to report on the regional and municipal elections of
May 2007. It combined the editorial opinion of professionals in the Zeta Group, information
on videos that brought the agency EFE and the participation of citizen journalism. Also were
collected testimonies that the public record and send on the campaign.
157
The Socialists built the 2008 election campaign around Spain’s social progress under Zapatero (the laws he introduced to protect battered women, to promote sexual
equality and to allow gays to marry).The videos created by this party were focused on
accusing Mr Rajoy of creating problems where there are none. They want to be decisive in mobilising PSOE supporters who are thinking of abstaining. Tacticians estimate that a party needs of close to 75 per cent to win enough seats to claim victory
and form a government.
One of the most characteristic features of Zapatero is his eyebrows (is the symbol
of his surname in signs language). To make this gesture is a message of support in one
of the videos of the main group which support Zapatero [14].
Socialists even used instant messaging (IM) in their electoral campaign. They created the nick iZ ([email protected]), Zapatero’s identity on Messenger chat, the first one of
political character in all the world. The automated program answered questions about
the lists of candidates or the election manifesto.
2.2 PP’s Electoral Campaign Trough Videos
Mariano Rajoy is the Popular Party leader (PP). The PP believes can weaken to the
Socialist party by attacking its management on the economy, which is decreasing, and
on immigration, which is increasing. For the pre-campaign, the PP has used the slogan "Con Rajoy es Posible" ("With Rajoy it is Possible"). PP usually emphasized the
campaign proposals, such as "Llegar a fin de mes, Con Rajoy es Posible" ("Making it
to the end of the month, with Rajoy it's possible").
The PP videos try to portray Spain as a country on the verge of recession and overrun by immigrants. These audiovisual criticized Zapatero’s Government for being
incompetent in solving economic problems and blame him for failing in creating
employment.
One of the most visited videos launched on YouTube by Popular Party was "Zapatero´s Big Lie", [15] which accuses the Prime Minister of not telling the truth about
the suspension of negotiations with Basque separatist group ETA. We have studied
the travel of the conversation about this video trough Spanish blogosphere.
In others videos Rajoy shows the letter C from Canon, the digital tax added when
buying a CD or DVD or any device to storage information. He promised that he was
going to eliminate this tax if he won the elections. Spain pays the highest prices
in Europe.
To conclude this section, we can state that both candidates know that videos won’t
convince voters to change parties, but they could convince the supporters to go out
and vote instead of staying at home
3 Methodology and Case Study
Research on ICT requires methodological innovations to work with social media
(blogs, wikis, video and all kind of Web 2.0 tools) and getting results to explain the
new relations ways of social shaping. The role of traditional political communication
has changed due to these social and technical innovations. Viral actions have emerged
as a new goal for the political purposes, however virality not always has the same
rules to be successful.
158
The diffusion based on crowdsourcing, buzz marketing or the possibility for expanding same messages on several platforms are part of these rules. New media ecosystem has problems applying old methods as content analysis (tagclouds could be a
special and specific version for the Web) and there are new techniques to improve
within this field, which are related for example to tracking conversations on real time
on Web.
Messages were tracked at same time that conversations were too. These messages
become in viral actions thanks to users and due to this fact, we need to improve the
field of communication research that is focused on the mechanisms that explain how
information flows are on social media of the Web. In the research, the case study is
the political information made by viral videos in Spain. At this point, the methodological innovation tries to identify where the influence is. In the study, influence was
identified with a significative change in the volume of conversation in the Blogosphere regarding the videos. The first step is gathering a sample with the most relevant viral videos, the second step is getting their averages of reactions and Technorati
authority, the most popular blog search engine with Google BlogSearch.
The objective is showing where the influence is. Do we have real tipping points on
the Blogosphere? For answering this question, on one hand we gathered a sample
made by fifteen relevant Spanish blogs, maintained by politicians, journalists and
other kind of users. This list was selected according several rankings (Alianzo [16],
Wikio [17], 86400 [18] and Compareblogs [19]), which use criteria like RSS subscribers or incoming links. Part of the final sample is made with five journalistic/political blogs: Escolar, Periodistas 21, NetoRatón 2.0 [20], Guerra Eterna [21]
and La Huella Digital [22]. On the other hand, two political videos were selected to
track reactions: "Zapatero´s Big Lie" (PP) and "With Z of Zapatero" (PSOE).
With the advent of Web 2.0, the communication research has developed topics as
collaborative publishing, social filters or viral messages, but not yet others like influence, especially in terms of how some nodes of social media could be considered as
real tipping points on the Web.
3.1 Influence of Journalistic and Political Blogs
Tracking influence of several blogs, in this case on the Spanish blogosphere, can not
be checked without a clear issue (the conversation around selected political videos)
and a closed period as sample. Thanks to Technorati, the number of reactions after
each video were registered according the hypertext (links) generated by each one.
After that point, the Technorati authority of blogs in each day was measured to see if
the presence of the most important blogs increased the volume of conversation around
the videos some days.
With Web 2.0, we are in a clear context which we need to track the users´ demand
of information on the web. Due to this fact, tools like Google Trends show us what
people are looking for. In a similar way, Technorati give us data about how many
people are talking on the Blogosphere, according the number of links related to the
political videos. On the Web, when a concept is named that concept is linked too.
The conversation was registered during eighteen days (from 18th January to 4th
February). During this time, 47 posts were published on Spanish Blogosphere related
to political video “The Big Zapatero´s Lie”. These posts were considered as the first
159
video reactions. In this case, the biggest appearance of relevant blogs was around the
middle of the period. After that presence, there is no growth of publications on blogs.
In other words, the moments with more reactions do not come after the biggest authorities.
The conversation around the second video of our sample was registered the same
number of days, from 18th October to 3th November, and 112 posts were published
which contained links to the video called "With Z of Zapatero". During the weeks of
conversation around the second video, big falls and big tops were gathered of Technorati authority, there was not regularity and because of this, video´s travel trough
Blogosphere is more related to power of nets than related to power of rankings (in
this case, a ranking of Technorati authority, made with outcoming links to political
videos).
3.2 Two Ways for Drawing Internet Information Flows
The traditional opinion about how messages circulate on the Web has been related to
a view of Internet based on rankings mostly of cases. A Web based on the power of
big nodes (sites) which have a relevant influence with their publications and opinions.
In case of social media, we are talking about turning points made by users (blogs,
podcasts, social filters…). In this case of study, the main question is knowing if there
were some visible big nodes on the political conversation of the Spanish blogosphere.
Specifically, we tracked the conversations about political videos made by PP and
PSOE during the 2008 Campaign.
However, according the data obtained through this study, the political conversations are more related to a different point of view about the Web, based on the nets
and not just on the big nodes. As we have seen during almost three weeks with each
video, many reactions on blogs come without a big presence of authority before them.
And the appearance of big nodes (relevant blogs maintained by influencers) does not
come with a bigger volume of conversation, the information flows are the same or
even less than in the first steps of the period.
In this sense, we could describe the Web under a big paradigm of nets and not just
of relevant webs (tipping points). According this point of view, the big nodes are an
echo of the conversations, not the cause of their visibility. Because of this, it might be
overvalued the importance of big nodes on the Internet and social media. The power
is in the interconnected clusters and the political strategists should consider if the
myth of tipping points on social media is real or not.
4 Conclusions
In the 2008 Spanish electoral campaign, the most popular political videos generated
conversations on the Blogosphere and gave us the opportunity to measure the role of
influencers. The main objective was tracking the travel of these videos on one of the
most relevant social media: the blogs. Tracking the content of political videos was
useful to identify how parties used this tool to discredit the adversaries.
There is a clear problem for political strategists: parties still design their campaigns for the mass media, even those messages focused on social media on the Web.
160
According to the content gathered on political campaigns, we can conclude that parties only want to generate enough polemic and turning video into news to jump to
informative TV programs. By this way, the main message during part of the political
campaign is a meta-coverage (content about the content), news about the video reactions and not about the storytelling and content that there is inside it.
When we track Technorati reactions after each video, we measure the influence of
blogs every day. If there is not conversation´s growth after the appearance of big
nodes, we could consider that these tipping points are not a cause. They could be just
an echo of conversation.
From this point of view, the most effective way to pitch an idea is creating mass
marketing through social media. But if video is understood as political advertising, it
will not be useful. The campaign 2.0 is not about using technology with a funny or
polemic video on platforms like Youtube. And according the gathered data, the key is
not about the presence in a particular node, it is a question of a radical change in the
tone of the political conversation during all its phases.
The presence of important nodes (in terms of Technoraty authority) is not relevant
in volume of political conversations. Because of this fact and as we underlined, the
key for successful campaigns could be based on the capacity to manage the interaction with people, hearing them and taking part with a real conversation, the most clear
paradigm of the Web 2.0.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Escolar.net, http://www.escolar.net
Periodistas 21, http://periodistas21.blogspot.com
Eurogaceta, http://eurogaceta.com
Internet Política, http://internetpolitica.com
McGaughey, W.: Five Epochs of Civilization: World History As Emerging in Five Civilizations. Thistlerose Publication, Canada (2000)
Lakoff, G.: The Political Mind. Why You Can’t Understand 21st-Century Politics With an
18th-Century Brain. Viking/Penguin (2008)
Educar para la ciudadanía,
http://www.youtube.com/watch?v=WjXja39DH1s&feature=fvw
Con, Z., de Zapatero.: http://www.youtube.com/watch?v=7rNSY6v01zg
Murray, A.: The Most Important Component of The Modern Political Campaign Strategy,
http://www.completecampaigns.com/article.asp?articleid=53
Times,
http://www.time.com/time/columnist/klein/article/0,9565,1124
237,00.html
Varela, J.: El año de los blogs políticos (2005),
http://periodistas21.blogspot.com/2005/03/el-ao-de-losblogs-polticos.html
La mirada positiva, http://www.20minutos.tv/video/JoWyCucNeg-lamirada-positiva-del-psoe/0/
Motivos para creer, http://www.youtube.com/watch?v=9qiM2os3pgk
161
14. Vídeo de promoción de la Plataforma de Apoyo a Zapatero (PAZ),
http://www.youtube.com/watch?v=RZxjCvq44ng&feature=channel
15. La gran mentira de Zapatero,
http://www.youtube.com/watch?v=MiNhMJHRkcQ&feature=player_em
bedded
16. Alianzo, http://www.alianzo.com
17. Wikio, http://www.wikio.es
18. 86400, http://86400.es
19. Compareblogs, http://www.compareblogs.com
20. NetoRatón, http://www.netoraton.es
21. Guerra Eterna, http://www.guerraeterna.com
22. La Huella Digital, http://lahuelladigital.blogspot.com
Extended Identity for Social Networks
Antonio Tapiador, Antonio Fumero, and Joaquín Salvachúa
Universidad Politécnica de Madrid, ETSI Telecomunicación,
Avenida Complutense 30, 28040 Madrid, Spain
{atapiador,amfumero,jsalvachua}@dit.upm.es
Abstract. Nowadays we are experiencing the consolidation of social networks
(SN). Although there are trends trying to integrate SN platforms. they remain as
data silos between each other. Information can't be exchanged between them.
In some cases, it would be desirable to connect this scattered information, in
order to build a distributed identity. This contribution proposes an architecture
for distributed social networking. Based on distributed user-centric identity, our
proposal extends it by attaching user information. It also bridges the gap
between distributed identity and distributed publishing capabilities.
Keywords: WWW, social networks, social software, digital identity,
architecture.
1 Introduction
Social networks (SN) is one of the key words when we talk about the Web 2.0 realm.
We are experiencing nowadays the consolidation of the SN platforms. The recent
launching of the Facebook platform has officially opened up the opportunity for issuing the next wave of value added applications for the next generation SN platforms.
When we talk about SN in general terms, we are considering a wide scope of web
services ranging basically between content and contact oriented social networks, that
could be understood as wide and narrow sense social networks. The former are platforms which main goal is content publication, and social relations are a side effect of
the interactions. The last are specialized on contacts creation and management, and so
they are social networks in a narrow sense.
The trend seems to have the contact oriented platforms integrating content sharing
services, being Facebook again the best example. Meanwhile we still have in place a
lot of independent, isolated content sharing services (e.g. Flickr, blip.tv, Youtube,
Slideshare) and contact management centered social networks (e.g. Xing, LinkedIn)
we'll be using for a while.
The social networking services scenario is living a consolidation stage in terms of
platforms (e.g. Xing A.G. has acquired the two major professional networks in Spain)
and, at the same time, with the announcement of the Open Social initiative from
Google, all the actors in such a scenario are positioning themselves for starting the
race for that value added services.
At this time, we have the Web plenty of a wide offer of social network services.
Users develop their personal or professional activities in different platforms. A user
163
may use a blogging platform to express her thoughts, a photo platform to publish her
images, and a social network platform to be in contact with her friends. But all these
platforms ignore the activity the same user is developing in the rest of web sites. This
is sometimes desirable for privacy reasons, but in other multiple cases interoperability
is an added value. Things would be easier if images show available in the blogging
platform to be integrated within posts, and last posts appear in the social network
user's profile.
This contribution describes an architecture for distributed SN that supports this interoperability. It proposes a distributed model for contacts and content integration. It's
is build around distributed user-centric identity frameworks, such as OpenID.
2 Distributed Identity
Such 'new age' services will be depending on the basic functionalities we will be delivering from our open platforms. One of the key functionalities of such platforms is
the identity. We are still using as de-facto standard the ancient user/pass combination
for identifying ourselves on them. As we have a lot of different services appearing on
the Web 2.0 explosion, we have to manage not only a lot of different passwords for a
variety of services, but a growing number of different profiles in a series of distributed platforms all around the world.
This situation represents a fragmentation of user activity along all the SNs. There
isn't a coherence between different user's profiles created in every SN. Furthermore,
there is no way to combine digital contents created on every platform.
Current scenario can be desirable for privacy reasons. Sometimes we don't want
our activities traced and bound between every place we log in. But other times,
Fig. 1. Digital identity is distributed among SN platforms
164
A. Tapiador, A. Fumero, and J. Salvachúa
specially when we want to build a coherent identity and reputation, such interoperability would make things easier. In these cases, from the user point-of-view, it will be
desirable to have one single profile that could be validated against any service she
would be accessing. The decentralized, user-centric, soft identity schemes are a way
of implementing such a requirement. The most popular of these schemes is OpenID
[1]. It's being implemented within the main development communities of the Web
2.0, e.g. Wordpress or Blogger from blogging platforms segment. Some security
problems [2] have been identified, so a different kind of decentralized, user-centric
identity framework should be used in cases where the security requirements are
harder.
3 Architecture for Extended Identity
The idea behind our proposal is the qualitative leap that entails using a dereferenceable IDs, like OpenID. Until now, using a new SN platform implies providing a login
and a password. In most cases, an email address is also required. The address will be
verified by the web site, which sends an email including a link that will confirm the
validity of the address. If we analyze the information the SN platform has about us at
this point, it is limited to some credentials to authenticate in the web site, and an email
address to contact with the user.
On the other hand, we have authentication frameworks like OpenID. In these
frameworks, we provide the SN platform with a dereferenceable ID, that is, an URI
representing a resource that can be located and fetched. If the SN platform can
dereference the ID, then it will be able to obtain extended information from it. Upon
an OpenID URI we are able to get an HTML document, which can contain a lot of
information about the user.
We can use this ID for registering ourselves in a whole bunch of web services.
These web services could include blogging, photo, slide, video sharing services, and
narrow sense social networking services. Those service obtain more information
about usdereferencing the ID, and discovering information attached to it. We also
incorporate information back from web services to the ID. In such a way, we are able
to connect information to sites. We will, for instance, allow our blog to know where
our photos are (or some of them) or our friends knowing where are our videos are.
3.1 Architecture Components
Our proposed architecture is client-server solution, and it is composed by three elements: Client Agents, Identity Servers and Resources Servers.
Client Agent. A client agent (CA) is any application environment in a local machine
or device controlled by a user. Examples of CA are mobile applications and browsers
running in a PC. Network connection is assumed. The environment where the CA is
running can provide it with API facilities such as authentication, resources fetching or
content publishing.
Identity Server. An identity server (IS) provides users with user-centric, decentralized IDs belonging to a given identity framework. Users authenticate with their IS and
then have a mechanism for identifying themselves in the rest of SN platforms.
165
It is an "OpenID Provider" in the OpenID world, with extended capabilities. It also
supports mechanisms for allowing other parties to access private resources, using protocols like OAuth [3].
The IS acts as a user proxy. It stores the main, authoritative user profile. This profile is composed by links to user resources (such as presence, geo-localization, personal data) and collections (e.g. contacts lists, blogs, albums, podcasts). It also may
provide information to edit these resources and post new resources to collections.
Fig. 2. Example of the described architecture, including a client agent, a identity server and two
resources servers
Profile information is controlled by access control lists. Users control granting or
restricting access to their resources and collections stored in the IS. The IS should
provide mechanisms to manage different profile data sets easily. Users typically want
to show different kinds of profile to different SNs. We want to provide the minimum
required information to some SNs, but detailed information to other trusted SNs. This
is also the case for other IDs apart from ours. Other users are expected to query IS to
obtain more information about somebody when they discover her ID and want to
know more about that person. The IDs should be, therefore, dereferenceable URIs.
CAs obtain the main or favorite editable resources and collections. When the user
logs into her IS using the CA, she obtains not only authentication capabilities, but also
publish information usable by the CA. This information facilitates the CA to write
blog entries, post photos or videos, even in other SN platforms. These are the resources servers.
166
Resources server. A resources server (RS) is any web service providing resources
management. Examples of RS are content oriented web services (e.g. blogs, podcasts,
social bookmarking), but also contact oriented ones. We can also think contacts as
resources. Any SN platform can be a RS if it allows users to sign up and posting
resources.
RS supports authentication using a distributed identity framework. It is a "Relaying
Party" in the OpenID world. They relay authentication in ISs. A user sings up into a
RS using the identity framework provided by her IS. At this step, she selects which
kind of profile she will show to this new RS. Then the RS obtains profile information
from the user by querying the IS.
RS publish resource collections owned by users, like IS may do. CAs obtain from
each RS a complete list of user information generated from the user in this specific
RS. Users can bookmark this information in their IS, controlling who can access
them. This way, they build their authoritative and main profile, which resides in her
IS. The user also controls in the IS which information (e.g. collections and resources)
show to other RSs. In this way, RSs can discover and mashup resources and collections from their users. Access to RS resources is controlled by the the RS itself. There
may be resources announced at the IS but not accessible at the RS and vice-versa.
Synchronization between IS ACLs and RS ACLs should be specified.
4 Information Flows
The main information attached to a user of this architecture is composed by his ID,
along with the resources and collections associated with it. The ID is entered by the
user when logging in the RS, which discovers the IS location, as we explained before.
This procedure is described in the specification of the OpenID protocol [1].
Later, there may be exchanges of information between the IS and the RSs, which
have two ways, from the RS point of view.
4.1 Pull
The RS obtains more information about the ID asking the IS. This allows the SN platforms to know more about the user, finding out the latest entries in her blog or her
contacts network, for example. The following technologies currently support this information exchanges.
OpenID Attribute Exchange. OpenID Attribute Exchange [4] is an OpenID protocol
extension that supports the exchange of key-value pairs between the Relaying Party
(the RS in our architecture) and the Identity Provider (the IS). This technique is limited by the format of the attributes.
HTML. OpenID identifiers are typically HTTP URLs. The RS can dereference the
URL and get the HTML document. This document provides information about the ID.
Two different formats are available:
1. Microformats [5] are semi-structured information embedded in the HTML
markup. Currently, there are formats defined for personal cards (hCard),
events (hCalendar) and tags (rel-tag). Other object types are in the definition
process.
167
2. HEAD links: the HTML <head> section provides support for <link> tags.
These tags are already used for providing extended information about the
HTML document, e.g. blog suscriptions in RSS or Atom format.
Other data formats. The HTTP protocol supports a mechanism for requesting documents in a specific format. This is achieved including the Content-Type header in the
request. This mechanism, along with the former of HEAD links, allow us to obtain the
representation of the ID in different formats. One example are Atom feeds [6], a format used for content syndication. Other example is RDF (Resource Description
Framework) and their schemas (RDFs, OWL), the base of the Semantic Web. FOAF
[7] is a RDF based vocabulary used to describe agents (Users, Groups, Organizations)
and their attributes. The experimental property foaf:openid supports the association of
the user profile information with her ID. SIOC (Semantic Interlinked Online Communities) [8] provides a vocabulary for describing resources and resource collections.
4.2 Push
The RS publishes information about the user in her IS. This case is interesting, for
example, so the IS gathers the activity the user generates in the SN she participates.
OpenID Attribute Exchange. The OpenID extension works in both ways. It can also
be used by RSs to store key-value pairs in the IS.
Atom Publishing Protocol. AtomPub [9] is a protocol designed by the IETF for publishing and editing web resources. One of the documents defined by the specification
are Service documents. Service documents describe available Collections grouped in
Workspaces. Collections are sets of resources. The Service document describes what
kind of resources can be posted to a Collection. AtomPub can be extended, so Collections could also describe which kind of resources contains, for a better integration
with CA publishing and management capabilities of the resources.
5 Validation
OpenSocial [10] is one of the last technologies emerging in social software. OpenSocial is a public API launched by Google in late 2007. It provides management support
on three kinds of resources attached to user's personal information; contacts, activities
in SN platforms and persistent data support.
OpenSocial proposal fits smoothly in our architecture model. In OpenSocial, user's
contacts and activities are exported using Atom feeds. Activities publishing uses the
Atom protocol. Finally, persistent data support shares the same principles with
OpenID Attribute Exchange extension.
To achieve a practical validation of the architecture, we are working on a plugin
[11] for Ruby on Rails web development framework. This plugin provides an application with an authentication framework, supporting several authentication schemes
including OpenID. It also provides authorization and contents and contact generation.
We plan to evaluate the technologies mentioned in the previous section that support
information exchanges between IS and RS. This plugin is currently used in several
168
SN platforms, which include the VCC [12] a rich web content management system
that gives support to conferences.
6 Conclusions
This article proposes an architecture that solves the problem of fragmented user identities on SN platforms. The architecture is based in OpenID, a user-centric distributed
identity framework. The IS stores the authoritative user information. The RSs use the
IS to obtain the extended identity about the users, as well as publishing new information about user's activities. There are currently several technologies supporting information flows among IS and RS. In this sense, we are working on a Ruby on Rails
plugin supporting several of this technologies. This plugin is used as the base of SN
platforms that will validate the proposed architecture. Finally, the proposed architecture fits with the last protocols emerging in the field, such as Google's Social API.
References
1. Recordon, D., Reed, D.: OpenID 2.0: a Platform for User-Centric Identity Management.
In: Proceedings of the second ACM workshop on Digital identity management, pp. 11–16
(2006)
2. Brands, S.: The problem(s) with OpenID,
http://idcorner.org/2007/08/22/the-problems-with-openid/
3. OAuth, An open protocol to allow secure API authentication in a simple and standard
method from desktop and web applications, http://oauth.net
4. Hart, D., Bufu, J., Hoyt, J.: OpenID Attribute Exchange 1.0 – Final,
http://openid.net/specs/openid-attribute-exchange-1_0.html
5. Khare, R.: Microformats: the next (small) thing on the semantic Web? IEEE Internet
Computing 10(1), 68–75 (2006)
6. Nottingham, M., Sayre, R.: The Atom Syndication Format, RFC 4287 (2005),
http://tools.ietf.org/html/rfc4287
7. Brickley, D., Miller, L.: FOAF Vocabulary Specification (2007),
http://xmlns.com/foaf/spec/
8. Semantic Interlinked Online Communities, http://sioc-project.org/
9. Gregorio, J., de hOra, B.: The Atom Publishing Protocol. RFC 5023 (2007),
http://tools.ietf.org/html/rfc5023
10. OpenSocial, http://code.google.com/apis/opensocial/
11. Rails Station Engine, http://rstation.wordpress.com
12. Virtual Conference Center, http://vcc.dit.upm.es
NeoVictorian, Nobitic, and Narrative: Ancient
Anticipations and the Meaning of Weblogs
Mark Bernstein
Eastgate Systems Inc., 134 Main St, Watertown MA 02472 USA
[email protected]
Abstract. What makes a good weblog, a superb wiki, or an exceptional contribution to a social networking site? Though excellence is a frequent source of
anxiety amongst weblog writers, it has not been a concern of weblog scholarship. In contemporary social software, we encounter once more the deep
controversies of 19th century art, reinterpreted to meet the exigencies of time
and technology.
1 Neovictorian Social Software
What makes a good weblog, a superb wiki, or an exceptional contribution to a social
networking site1? The readers and the writers of weblogs are so numerous and so
diverse that they defy classification. It might be tempting to assert that a good weblog
is a popular weblog, or conversely that whatever satisfies the weblog author is, by
definition, good. Neither approach is entirely satisfactory, because neither guides our
critical appreciation of weblogs or our ability to make them more original, more effective, or more beautiful.
We sometimes compare journalistic methodologies or explore the properties of the
social graph but, in discussing or prescribing ideals, most contributors to the BlogTalk
and WikiSym conferences (and to the monograph literature) have been reluctant to
move beyond the box office measurement of audience size and power. Though excellence is a frequent source of anxiety amongst weblog writers, it has not been a concern of weblog scholarship.
Critics often find in new media a mirror of their dreams and anxieties. The student
of the novel finds the shadow of the Victorian serial.
The literary quality of blogs arises from a complex negotiation between discrete and
often random daily entries and the often invisible arc that they together sketch. [1]
1
In discussing social networking sites, I follow the definition of boyd and Ellison [6]: “We
define social network sites as web-based services that allow individuals to (1) construct a
public or semi-public profile within a bounded system, (2) articulate a list of other users with
whom they share a connection, and (3) view and traverse their list of connections and those
made by others within the system. The nature and nomenclature of these connections may
vary from site to site.”
170
M. Bernstein
The student of hypertext looks at social media and finds emergent nonlinear narrative
[3], and the disciple of the codex books descries a formless void [5].
The very nature of the blogosphere is proliferation and dispersal; it is centrifugal
and represents a fundamental reversal of the norms of print culture.
Faced with such contradictory critical frameworks, what is to guide the aspiring
young writer as she sets out to craft her weblog, or the reluctant grandfather who is
eager to join FaceBook or CyWorld but loathe to make a fool of herself?
Much can be learned by abandoning the unfounded bias that everyday writing
lacks intellectual foundations [14], or that people writing for themselves and their
inner circle of acquaintance lack ideas or interests beyond the surface concerns of
their prose. In contemporary social software, we encounter once more the deep
controversies of 19th century art, reinterpreted to meet the exigencies of time and
technology.
Fig. 1. A fire control panel from an early Louis Sullivan skyscraper, now in the Art Institute of
Chicago. Form follows function, but “the tall office building should be a proud and soaring
thing.” In electronic media, everything that can be inscribed will carry meaning [12].
2 Ancient Airs: Of Intellectual, Artistic, Social, and Sexual
Concerns
Our grandparents loved manifestos.
Das Endziel aller bildnerischen Tätigkeit ist der Bau! The ultimate aim of all
creative activity is the building! (Gropius; see [13])
Ancient Anticipations and the Meaning of Weblogs
171
Chastened by the twentieth century’s terrors or the shadows of their elders2, artists
today seldom declare their intention, but we should not therefore conclude that artists
— even young artists — are witless and rudderless. We cannot know where any artistic movement is headed until it arrives, but it does not seem difficult to discern in
current concerns the shadows and refractions of older movements.
What do weblogs want? Critics have argued that weblogs chiefly seek admiration,
that they are narcissistic cries for attention from undisciplined techies eager to tell the
world about their cheese sandwich lunch [4]. This cannot be right3: too many people
choose to read and write weblogs for us to dismiss them as mere childishness, and a
sympathetic reading of the best weblogs reveals many marks of skill and talent.
Weblogs want to be right, and to be seen to be right. Examples abound. Blog writers from Matt Drudge to Joshua Micah Marshall stand conspicuously in the forefront
of political and social reporting, and the disjoint, ad hoc social networks of Grace
Davis and of Kathryn Cramer played significant roles in addressing urgent needs in
the wake of Hurricane Katrina. We have seen this concern before: in the 19th century,
we called it Realism.
Weblogs want to excel. New bloggers —from classroom bloggers to noted novelists
blogging their book tours — frequently express overt concern for their ability to perform well in the blogosphere.
Sorry
What a whinge. What a dreadful self-pitying whine. I do apologise, everyone. But I
have the writer’s primary vanity which is to suppose that if I have experienced
something and been somewhere then others will have too. [9]
Concern with excellence is the dominant issue in wikis, which seek to deploy many
eyes and many pens in order to arrive at a correct and useful statement [7]. When
bloggers seek to situate the excellence of their blogs in their own intrinsic wonderfulness, we recognize Romanticism. When the excellence resides in correct reasoning,
we see dialectic. And where the excellence resides in brilliant craftsmanship, we see
echoes of the pre-Raphaelites and of Aestheticism.
Weblogs want to be connected. Bloggers and social networkers alike want to be
liked, and well liked, and they are anxious that the world’s admiration for them be
manifest. At its best, this impulse connects writers to the world and interconnects a
world of writers. Popular politics has been decried as an echo chamber before:
2
Or by their knowledge of the intentional fallacy, now taught in the cradle. The impact of the
contingency of meaning on the creation of manifestos must not be underestimated. In Europe,
the politics of the Bauhaus conventions of slender supports (floating buildings above the corrupt soil) and flat roofs (uncrowned) meant one thing, while in postwar America the same
gestures came to mean Lever and Seagram — that is, multinational soap and liquor.
3
Though it cannot be right to disregard weblogs as mere narcissism, the accusation is so commonly made and so generally accepted that we should not dismiss it out of hand. A useful
guide might be the strikingly similar accusations that were leveled against several late Victorians — Shaw and Wilde come to mind — whose interests seem particularly congruent with
notable blogs wikis, and blog clusters. Consider, in particular, Shaw’s pragmatic political
radicalism and critical sparring, as well as Wilde’s delight in sexual display and eagerness to
engage in very public and destructive flame wars.
172
M. Bernstein
In politics: I’m Vermont Democrat — You know what that is, sort of double dyed;
The News has always been Republican. Fairbanks, he says to me, “Help us this
year” Meaning by us their ticket. “No,” says I, “I can’t and won’t. You’ve been in
long enough. [8]
The anxiety of weblogs for connection is evidenced by the rich vocabulary that has
evolved for describing those who seek links (spammers, splogs, SEO consultants, link
whores), by the popular fascination with lengthy friend lists, and by a plethora of
tools for estimating readership and influence. To the extent that social sites have ever
expressed a manifesto, it is a message of widespread or universal participation that
advocates of the Reform Act or Suffrage would instantly recognize.
Fig. 2. Interest in NeoVictorian aesthetics of self-(re)presentation is not limited to weblogs, but
it a prominent strand of contemporary culture. Photo: Lady Raven Eve, Singapore.
Many weblogs are candidly sexual, not merely in the direct mode of sexual reference, but also in their emphasis on the authentic discussion of the writer’s relationship
to the quotidian and physical world. The concerns of the body are conspicuous in
weblogs, couched in confessional autobiography or expressed in cheese sandwiches.
Though the nineteenth century is remembered for prudery and inhibition, the nature of
sexuality and the discovery of useful ways to talk about it was among its chief intellectual and artistic projects.
173
3 Plot or Character?
A defining concern of particular interest to the student of social software4 is the tension between expression of plot and expression of character — between exploring
what occurred and describing to whom it happened.
This tension is, of course, very much in the air as writers seek to negotiate and assimilate the achievements of late modern fiction and postmodern metafiction. But
weblogs and social sites share a further concern: both are performed across a span of
time, during which the action unfolds in notionally chronological sequence. Dispute
over the status of hypertext and experimental fiction should not lead us to overlook
how exceptional and restrictive this condition is [15]: even Homer is replete in temporal excursions, flashbacks, and foreshadowing [11], tools that social writers are currently expected to hide or forego.
Weblog criticism has been strangely concerned with the question of whether a weblog accurately depicts its protagonist. The revelation that some weblog protagonists
(Kaycee Nicole, or LonelyGirl15) are partly or entirely fictitious has been greeted with
surprising degrees of shock and outrage, even among sophisticated readers who are
familiar with fiction and the complexities of (re)presentation and construction of meaning. In social sites, issues of authenticity seem even more hotly contested [6], though
here they are often cloaked in concern for the presence of pedophiles. These debates
recall earlier scandals on realistic figures (cf. Olympia) and themes (cf. Social Realism), informed here by a postcolonial reinterpretation of class, ethnicity and gender.
The weblog’s interest with discovery and presentation of true or inherent character
finds many echoes, both historical and contemporary. Of singular interest, though, is
the (mostly-Japanese) interest in cosplay — elaborate self-representation in the form
of highly formal costumed character styles. While many attributes of the costume are
fixed, the cosplay — which characteristically is situated in busy urban streets, foregrounds the artist’s craft and personality; the goth lolita does not represent (or parody)
Nabokov’s character but rather uses that character as a springboard for exploring
experience and personality. Similarly, cosplay (and blogs) frequently blur or transgress conventional boundaries of gender and class. We readily apprehend that the
artist both is and is not the character, just as Oscar Wilde was not (always) Reginald
Bunthorne, Rosetti’s model Lizzie Siddal was not Beatrice, and the Divine Sarah
sometimes slept.
4 Nobitic Weblogs: Writing for Yourself
Crucially for our understanding of contemporary social media from Facebook to
Flickr, from Bento to Tinderbox, the writer's first concern is a nobitic audience, an
audience that is notionally “amongst ourselves”, intended for the writer’s circle of
acquaintance5. Social software’s writers write first for themselves and their inner
4
5
Though not yet a concern of wikis, because the wiki tradition has been to avoid narrative.
Nobitic audiences need not be inconsequential; the politics of the American and French revolutions were carried on by committees of correspondence, and most early scientific writing
took the form of letters and after-dinner speeches.
174
M. Bernstein
circle, and often regard the prospect of a broad audience of strangers as abhorrent.
Will the author’s mother see the work? Will their future employer, or their hypothetical grandchild?
That is, the writer's chief concern is to satisfy themselves, to ensure that their
(re)presentation is artistically honest. In his masterful survey of the history of published journals and diaries, Thomas Mallon identifies seven distinct kinds of journal
writers: chroniclers, travelers, pilgrims, creators, apologists, confessors, and prisoners
[17]. The same taxonomy applies usefully to social software, which serves much the
same purpose with regard to the nobitic circle, the general reader, and especially the
often-overlooked but crucial question of representational talkback [16], of the effect
of the author’s writing on their own future sense and sensibility.
Social media shares with the personal diary an insistence on frequent consultation
and writerly reading; the artistic practice of both the diarist and the blogger demands
frequent contemplation and equally frequent addition to the work. These media demand spontaneity and immediacy; they resist sentimental attachment to what the
writer ought to feel. The same concerns, in 19th century art, gave rise to impressionism and expressionism, to the desire to capture the sense of the moment en plein air
or on the writer’s soul.
Fig. 3. Everyday people face extraordinary tasks all the time. A curator sorts through debris at
the National Museum Baghdad, 2003 Photo: D Miles Cullen, US Army.
5 The Artisan’s Touch
In Abroad, Paul Fussell observed that the travel book emerged from the decline of the
audience for books of essays and sermons [10]. Weblogs and social sites, as a form of
serious writing, play a very similar role, providing a home for brief, occasional pieces
175
of observation and commentary. The blogger’s quest for immediacy, like the painter’s,
sometimes conflicts with some conventionally-valorized issues of craft; the painter,
working swiftly to capture the light, was obligated to work swiftly, without taking time
to mix complex colors or to efface brushmarks. The work that resulted seemed at first
to be casual, slipshod, or unfinished; in time, critics observed that the work was not
unfinished but rather raised new questions about the academic idea of “finish”. Similarly, the blog valorizes personal voice, even sacrificing grammar and spelling or
adopting dialects like Singaporean “Singlish or the SMS-abbreviated LOLcat-drived
l33tspeak. Bad graphic design seems to have become a badge of honor in social sites,
where noisy background, distracting animations and intrusive music tracks abound;
here, too, we may see an echo of Fauvist painting or Brutalist architecture.
A key figure in understanding the quality of blogs, wikis, and social sites is the
flâneur, that sophisticated observer of the urban scene, unbound by class and
unconstrained by convention. Current social software writing is remarkably short on
characters; it has voice in abundance, but dialogue is rare and developed secondary
characters and foils — even cardboard cutouts like Mike Royko’s Slatz Grobnick or
Don Marquis’ Archie and Mehitabel, are seldom in evidence. Throughout, the weblog
spotlight falls on the writer/observer, her struggle to understand or wry amusement at
the absurdities of the daily scene. We are then, perhaps, at a cusp of a development
that parallels Thespis’ innovative approach to theater: we may, at the boundaries of
social software and the craft of interlinked writing, be prepared to let a second actor, a
true character, onto the stage.
References
1. Fitzpatrick, K.: The Pleasure of the Blog: the Early Novel, the Serial, and the Narrative
Archive. In: The 4th International Conference on Social Software (BlogTalk 2006),
Vienna, Austria (2006)
2. Bernstein, M.: The Social Physics of Weblogs. In: BlogTalk 2, The 2nd International
Conference on Social Software, Vienna, Austria (2004)
3. Bernstein, M.: Saving the Blogosphere, BlogTalk Down Under, Sydney, Australia (2005)
4. Bernstein, M.: What Good Is A Weblog?, BlogHui, Wellington, New Zealand (2006)
5. Birkerts, S.: Lost in the Blogosphere: Why Literary Blogging Won’t Save Our Literary
Culture, The Boston Globe (2007)
6. boyd, d.m., Ellison, N.B.: Social Network Sites: Definition, History, and Scholarship.
Journal of Computer-Mediated Communication 13(1) (2007)
7. Leuf, B., Cunningham, W.: The Wiki Way: Quick Collaboration on the Web. AddisonWesley, Reading (2001)
8. Frost, R.: A Hundred Collars. North of Boston, Henry Holt and Company, New York
(1914)
9. Fry, S.: I Give Up (2007), http://stephenfry.com/blog/?p=21
10. Fussell, P.: Abroad: British Literary Traveling Between the Wars. Oxford University
Press, Oxford (1980)
11. Lowe, N.J.: The Classical Plot and the Invention of Western Narrative. Cambridge University Press, Cambridge (2004)
12. Landow, G.P.: Hypertext 3.0: Critical Theory and New Media in an Era of Globalization.
Johns Hopkins Press, Baltimore (2006)
176
M. Bernstein
13. Noble, J., Biddle, R.: Notes on Notes on Postmodern Programming. In: OOPSLA 2004,
Vancouver, Canada (2004)
14. Rose, J.: The Intellectual Life of the British Working Classes. Yale University Press, New
Haven and London (2001)
15. Walker, J.: Piecing Together and Tearing Apart: Finding the Story in Afternoon. In: Hypertext 1999, pp. 111–118. ACM, New York (1999)
16. Yamamoto, Y., Nakakoji, K., Aoki, A.: Spatial Hypertext for Linear-Information Authoring: Interaction Design and System Development Based on the ART Design Principle.
In: Hypertext 2002, pp. 35–44. ACM, New York (2002)
17. Mallon, T.: A Book of One’s Own: People and Their Diaries, Penguin (1984)
Author Index
Bernstein, Mark 169
Bojars, Uldis 116
Boulain, Philip 1
Brandt, Joel 143
Breslin, John G. 116
Broß, Justus 15
Lee, Mikyoung 100
Lee, Seungwoo 100
Lim, Yon Soo 52
MacNiven, Sean 15
Masuda, Hidetaka 88
Matsuo, Yutaka 63
Meinel, Christoph 15
Morin, Jean-Henry 108
Chalhoub, Michel S. 29
Correyero, Beatriz 152
Cushman, David 123
Decker, Stefan
Nakagawa, Hiroshi 88
Nakasaki, Hiroyuki 75
Noguera, José M. 152
116
Fukuhara, Tomohiro 75, 88
Fumero, Antonio 162
Okazaki, Makoto
Gibbins, Nicholas
Park, Hyunwoo
1
Han, Jeong-Min 46
Han, Sangki Steve 38
Hoem, Jon 131
Ishii, Soichi
Jung, Hanmin
63
38
Quasthoff, Matthias
15
Salvachúa, Joaquı́n 162
Sato, Yuki 75
Shadbolt, Nigel 1
Song, Mi-Young 46
88
100
Tapiador, Antonio
Kawaba, Mariko 75
Kim, Kanghak 38
Kim, Pyung 100
Kim, Sang-Kyun 46
Kim, Young-rin 38
Ko, Joonseong 38
Kuklinski, Hugo Pardo
Utsuro, Takehito
162
75
Yokomoto, Daisuke 75
Yoshinaka, Takayuki 88
143
Zimmermann, Jürgen
15

Lecture Notes in Computer Science 6045

Transcription

Similar documents

1 Highlights of Memorial Day Congressional Recess Radio Tour

Module 1 Communica ion

Registering a blog account form Xanga Xanga blog, url~ http://www

Lawlor`s Personal Branding Checklist

Jimmy Gnecco (of Ours) at Schubas | Live Review

Writing Blogs and Building websites

media kit

Blog - QT Web Designs

Kidblog Set up Instructions

Computer Tutorial: Create a Blog Entry