D9.2: Benchmark Report on Structural

Transcription

Report D9.2, V1.0
Dissemination Level: PU
EC Project 257859
Risk and Opportunity management of
huge-scale BUSiness communiTy cooperation
D9.2: Benchmark Report on Structural, Behavioural
and Linguistic Signifiers of Community Health
30th October 2012
Version: 1.0
Toby Mostyn
[email protected]
Polecat Limited, Garden Studios, 71 - 75 Shelton Street, Covent Garden,
London WC2H 9JQ
Dissemination Level:
PU – Public
© Copyright Polecat Ltd and other members of the EC FP7 ROBUST project
consortium (grant agreement 257859), 2012
1/45
Report D9.2, V1.0
Executive Summary
This is a detailed report of deliverable 9.2. Work package 9 focuses on public
data, technology orientated fora and wikis.
Specifically, D9.2 states:
Benchmark report on structural, lexical and behavioural indicators of
community health: a full report including benchmarks on structural, lexical and
behavioural indicators of community health.
The report is a deliverable of the WP9 T9.2, which states:
This task will create benchmarks for structural, behavioural and lexical
signifiers of health vis a vis the desired output (KPIs) of a community through
comparison between multiple communities to surface trends, patterns and
outliers in community interaction set against the context of the purpose of the
community. In particular, it will include: Identify baseline indicators of
community success (KPIs) from community owners such as thread
generation, email receipts, core code changes, or unique users. Analyze
community discussion for sentiment, opinion, motivation lexical signifiers.
Create ontology of lexical signifiers. Analyze behaviour of community
members and assign role motifs using psychographic profiling. Produce a
report showing the connection between lexical, behavioural and structural
signifiers. These connections will be within the context of risk management,
productivity, strength of relationship between members, etc. This delivery will
feed WP3 [T3.1, 3.2, 3.3, 3.4] - modelling of motivation incentives to promote
healthy community interaction based on healthy behavioural, lexical and
structural signifiers. Refers to deliverable D9.2
2/45
Report D9.2, V1.0
Table of Contents
1.
Introduction .....................................................................................5
2.
Overview of the Document ................................................................6
3.
Lexical Signifiers ..............................................................................6
3.1. Psycholinguistic signifiers ............................................................6
3.1.1. Success Language Analysis ...................................................7
3.1.2. Motivation Language Analysis ................................................8
3.1.3. Action Language Analysis ......................................................9
3.1.4. Indicators of less healthy language........................................ 10
3.1.5. User type analysis ............................................................... 11
3.2. Sentiment Based Signifiers ........................................................ 12
3.2.1. Healthy/ Unhealthy Language Analysis .................................. 12
3.2.2. Automated Classification ...................................................... 13
3.3. Topic Based Signifiers ............................................................... 15
3.3.1. Topic by heath (derived) ...................................................... 15
3.3.2. Topic by health (explicit) ...................................................... 19
4.
Behavioural Signifiers ..................................................................... 29
4.1. Identification of health indicators ................................................. 29
4.2. Measurement of user behaviour.................................................. 30
4.3. Discovering user roles ............................................................... 30
4.4. Analysing role/ health relationship ............................................... 31
5.
Structural Signifiers ........................................................................ 31
5.1.1. Community owner feedback on structural signifiers of health .... 32
5.1.2. Analysis of TiddlyWiki structural factors ................................. 32
6.
Project Integration .......................................................................... 34
6.1. The dJST topic by sentiment extraction model .............................. 34
6.2. Graphic Equalizer ..................................................................... 36
6.3. Metaphor base visualisation ....................................................... 36
7.
Conclusion .................................................................................... 37
8.
Appendix ....................................................................................... 38
8.1. User classification training set errors ........................................... 38
8.2. Sub topic density using cosine similarity (example) ....................... 38
8.3. Snapshots of the WikiGroup Network .......................................... 38
8.4.
WikiGroup Network Statistics...................................................... 39
List of Figures ...................................................................................... 41
3/45
Report D9.2, V1.0
List of Tables........................................................................................ 42
List of Abbreviations.............................................................................. 43
References .......................................................................................... 44
4/45
Report D9.2, V1.0
1. Introduction
This document examines some of the work done in the ROBUST project
around signifiers of health in on-line communities, and provides benchmarks
and empirical results on those signifiers where applicable (as the basis for
further research).
It examines three distinct types of signifiers, which are defined below:

Linguistic signifiers: These are indications of health that can be gleaned
from the content of the interaction between users; in most cases, the
posts of those users. The most basic example of a health signifier in
this context would be the sentiment of the text.

Behavioural signifiers: The behaviour of users is a strong indication of
health, and can be expressed in a number of ways. For this reason,
these signifiers overlap with both linguistic and structural data. An
example of a behavioural signifier of health might be “engagement” –
the proportion of all users a user has communicated with.

Structural signifiers: Typically, these are statistics that describe the
community as a whole, and can be analysed statistically. These are the
signifiers that community owners generally know most about, and upon
which they currently assess the health of their community. An example
of a structural signifier is the number of users in a community.
Against each of these signifiers, the document examines data from a number
of different communities. Much of the analysis was against TiddlyWiki. The
TiddlyWiki project (http://www.tiddlywiki.com/) is a development community;
for the most part the members of the community are a geographically
disparate collection of programmers writing either the core TiddlyWiki
application or plug-ins.
TiddlyWiki was chosen in most instances for two reasons. Firstly, the dataset
is a single community with a common purpose, so it provides a suitable test
set for algorithms before they are run over much larger data sets containing
many sub-communities (where verification of results is much more difficult).
Secondly, the network structure consists (generally speaking) of a main
development core community, and sub-communities for plug-in development.
Again, this makes it a good test set for network analysis, in part because it
provides a useful baseline against which results from more complex
communities can be compared.
Other communities used by research within this document were larger and
more complex, involving a myriad of sub-communities: IBM Connections, SAP
SCN
and
_connect.
The
_connect
community
(https://connect.innovateuk.org/) is a forum of the Technology Strategy Board,
5/45
Report D9.2, V1.0
part of the UK government department for Business, Innovation and Skills.
Their relevance lies not only in their data, but also in the fact that they are
customers of Polecat, and therefore are a potential point of exploitation.
The identification and analysis of signifiers in this document is the result of two
main approaches. Firstly Polecat gathered information by interacting directly
with the community owners and analysing the content manually using
specialist analysts. Second, various algorithms and software was tested and
developed to extract further information.
2. Overview of the Document
The document is broken up into four main sections. Firstly, the work done
around the research and benchmarking of linguistic indicators of health in
communities is examined. This includes psycholinguistic profiling of
communities content, sentiment analysis of the content, and a description of
two novel ways to understand the health within a community around a given
topic. Secondly, the document examines the behavioural signifiers of health
that were identified and analysis performed around these. The third section
focuses on structural signifiers of health, and examines both metrics gleaned
from real-life communities and those extracted automatically from community
content. Lastly a section has been included to briefly describe some of the
integration of this research into the ROBUST platform ( and therefore the
application of research likely to be exploited by Polecat).
3. Lexical Signifiers
3.1.
Psycholinguistic signifiers
Building on previous research carried out in the field of psycholinguistics, and
working closely with the TiddlyWiki community and leaders, Polecat analysed
the language of the community and developed a number of standard classes
representing various types of discourse found within. This analysis was based
on analysts within Polecat who have expertise in linguistics – this was not
work carried out by algorithms, but was the result of expert human judgement.
The analysts worked to the following methodology. Firstly, they examined the
goals of both the community and the individuals within it. Oftentimes these
aims overlapped; at other times they were observed to be at cross purposes
to one another. For example, there were examples of individuals attempting to
introduce new functionality to solve a particular problem they had, with little
thought to the overall impact on the software as a whole. Secondly, the
analysts examined the intent of the language with regards to other users (in
relation to the goals identified in stage one). In other words, the analysis
discarded any specific technical terms of the discourse, and focused instead
solely on generic language focus around what the author of the post was
trying to achieve. The rational was that the output from an analysis of this type
would be conceptually and lexically independent (as far as possible), and
could be applied to any community that shared a common goal towards which
6/45
Report D9.2, V1.0
the users were working. Thirdly, using these techniques, the analysts were
able to identify the aforementioned classes and their linguistic properties..
These classes were:
1. Language indicative of success
2. Language indicative of motivation
3. Language indicative of action
4. Language indicative of encouragement
5. Language indicative of negativity
Using these as a basis, the classes were refined into four basic lexicons to
identify the types of behaviour typically found in such communities: a success
lexicon, a motivation lexicon, an action lexicon and a lexicon indicative of
negativity. Language indicative of encouragement was, after some analysis,
deemed to be a sub-set of the motivation lexicon. Given the analysis, these
three classes were those deemed most representative of the health of a
community for those communities working towards a common goal (usually
involving a tangible output). Below is further information on the analysis
carried out to derive these lexicons.
3.1.1.
Success Language Analysis
The language of success for communities was identified in three key areas:
firstly, the success of the community as a whole, or sub-sections of the
community where relevant. Second was technical success, such as the
achievement of better software, new knowledge or novel applications. Thirdly,
the language accounted for personal success for individual users;
achievement of personal challenges or meeting the particular needs of a
community user. The aspects that make up these types of success can be
seen below (Figure 1: Success language patterns).
One area in isolation is not enough to define the success within a community.
For example, personal benefit at the expense of the community or the system
would not be a success. Interaction with, and feedback from, the TiddlyWiki
community suggested success was dependent on all three territories, and the
lexicon had to therefore reflect this.
It should be noted, however, that although there is a linguistic emphasis on
the success of a non-hierarchical community, it is also highly dependent on
strong leadership and organisation which ultimately protects the quality of the
output. This means that the insight gleaned from analysing text for the
language of success is limited, and is best understood in conjunction with
other structural factors.
7/45
Report D9.2, V1.0
Figure 1: Success language patterns
3.1.2.
Motivation Language Analysis
In the case of motivation, it was discovered that the technological and the
personal form the patterns of communication are also important. Again, these
patterns are cross hatched; motivation is never just technological, personal or
collective, but some combination of the three. Thus any lexicon tool will need
to span the scope of these motivation types. In the case of technical
communities, it was found that the identity of the individual formed around
personal motivations is always made in relation to the collective identity of the
8/45
Report D9.2, V1.0
TW community: “I am” what “it is”. There are a mixture of high altitude
motivations such as freedom, emancipation, creativity, affirmation, respect
together with ruthlessly pragmatic motivations around cost saving and
software quality.
Figure 2: Motivating language patterns
3.1.3.
Action Language Analysis
Action language was found to reveal a clear trajectory between the singular
determined act and a more fluid engagement with a light, dynamic, agile,
never ending system. This takes the action language beyond the practical
application of core skills: engineering/ development in the case of the
9/45
Report D9.2, V1.0
TiddlyWiki community. The most potent/compelling space lies in fluid
engagement where the singular acts are transactional tools in this bigger
picture. There is a strong correlation in the action language between the
language of fluid engagement and the descriptor of success as learning,
adapting and configuring a system to meet your needs (and in the process
helping to meet other people’s needs and extending, developing and
maintaining the system and community). Action language is also a proxy
language of belonging a rite of passage into the community you display your
ability through your technical eloquence.
3.1.4.
Indicators of less healthy language
Less healthy language was found to be that which communicates closed
rather than open, that is singular versus multiple, lacks a discourse of
invention, creativity, innovation and is interested in that which can be
measured vs. tested, tried out. Less healthy language also fails to recognise
the dual benefit principles of communities and focuses on altruism.
In a developer community, there also appear to be some debate on the
recycling of hacker language; some people find it a potent term that aptly
characterises the nature of fluid engagement when taken literally, whereas
other people see it as a derogatory/populist term that fundamentally
misunderstands the process. Language that implies hobby/hobbyist can also
be seen to be derogatory or hierarchical (separating the user from the serious
developer) and negating the positive feedback flow principle.
10/45
Report D9.2, V1.0
Figure 3: Less healthy indicators
3.1.5.
User type analysis
. It should be noted that this user type analysis was not related to the role
analysis performed by WP3 (which is described in more detail later in the
document). That work was based on behavioural features of community
users, whereas the techniques outlined below took a purely linguistic
approach. As such, they provide a similar output via differing methods.
The linguistic approach consisted of, initially, an identification of the types of
users in the community by the Polecat analysts, based primarily on the lexical
indicators discovered in the previous section.
Table 1: User types identified by linguistic analysis
Label
Description
Newbie
Members new to a community. They might also be new to online
interaction.
11/45
Report D9.2, V1.0
Elder
Elders may not be held accountable to the same community norms or
scrutiny of the other members. Elders can dominate new members by
a few words, regardless of the value of the words of others around
them.
Core
participants
There are usually as small group of people who quickly adapt to
online interaction and provide a large proportion of an online group's
activity. These individuals visit frequently and post often. They are
important members.
Flamer
Flaming is defined as sending hostile, unprovoked messages. Name
calling, innuendo and such are the tools of flamers. Flamers can also
be the source of new ideas, however.
3.2.
3.2.1.
Sentiment Based Signifiers
Healthy/ Unhealthy Language Analysis
Alongside analysed of the types of language in postings, Polecat also
benchmarked the general “health” of language for a technical community.
Traditionally, sentiment ratings consist of negative, neutral and positive ratings
(commonly, a Likert scale with a bipolar response to a statement “is this
positive or negative?”). However, this approach is usually found lacking
because it sets the expectation of subjectivity; whether something is positive
or negative depends on who is reading the document, and their reaction to it.
By contrast, looking for healthy or unhealthy language makes the measure
objective and therefore more accurate as a single measure for a community.
After analysis of the postings by the linguistic analysts, health was split into
five bands (or ratings) and the data annotated accordingly:
Table 2: Health bands across communities
5 (v. healthy)
There is evidence of collective success such as,
reciprocal trust between the participants and healthy
feedback between the user and the developer. There is
evidence of technical success and personal success.
Often, the participant wants to share newly developed
technology within the community. In other cases, the
participant is mentoring and is delivering full explanations
and information, along with plenty of encouragement to
newer participants.
4 (healthy)
The conversation is full of information, questions,
discussions, but the participants are not particularly
excited. The success can be a mixture of one or two
types of success, such as Personal and Technical or
Technical and collective, etc.
3 (neutral)
The conversation is short. (Sometimes the conversation
is short, but can be Very Healthy) The success is
generally collective, with evidence of feedback. The
conversation lacks enough information.
12/45
Report D9.2, V1.0
2 (unhealthy)
The conversation has some hint of dissatisfaction or
criticism, regarding the technology, a user, a developer
and/or the community. The success is generally
collective, but sometimes can be personal or technical
depending on the content.
1 (v. unhealthy)
3.2.2.
The conversation is simply rude and offensive about a
participant, the community and/or the technology. There is
generally no evidence of success, the conversation is
negative.
Automated Classification
Using the analysis above, and subsequent annotated documents, Polecat
then trained several classifier s using different approaches and benchmarked
both the sentiment of the conversation and the user types previously
identified.
In terms of experimental set-up, the classifier was trained to discover two
classes – healthy language and unhealthy language. This was deliberately
distinct from classes of positive and negative because it removes the implicit
subjectivity of these terms, and instead presents the classification as an
objective metric. The training data for both classifications was selected by the
Polecat analysts, and consisted of around 300 posting in each class. These
postings were taken from the TiddlyWiki community between its inception in
2005 until 2011.
A further 182 postings were selected by the Polecat analysts as testing data,
and annotated as either healthy or unhealthy. Although the classifier actually
returns probabilities of the posting being in one class or another, the training
data simply assigned the documents as either being healthy or unhealthy.
Because there were only two classes in the classification, a document was
considered to belong to a particular class if it had a probability of greater than
0.5. The training data was run against six classifiers and the accuracy of each
of these classifiers was assessed. These algorithms were: balanced winnow,
C45, decision tree, maximum entropy, MVMaxEnt and Naive Bayes. The
feature set was entirely linguistic and contained no structural information.
The results of the classification are shown below. The evaluation shows, for
each classifier, the precision of the healthy class and the precision of the
unhealthy class1. The “All” column shows the overall accuracy of the classifier;
simply the number of correct classifications as a fraction of the number of
classifications made.
Precision has been calculated here as the number of documents classified
correctly for the given class divided by the number of documents actually in that
class (or true positives / (true positives + false positives))
13/45
1
Report D9.2, V1.0
Table 3: Sentiment classifier results
Precision
Balanced
Winnow
C45
Decision Tree
MaxEnt
MC MaxEnt
Naïve Bayes
Healthy
Unhealthy
All
1
0.333333333
0.398373984
0.650406504
0.943089431
0.918699187
0
0.706896552
0.724137931
0.844827586
0.379310345
0.448275862
0.679558011
0.453038674
0.502762431
0.712707182
0.762430939
0.767955801
As can be seen from the table above, Naïve Bayes scored best overall in this
classification, although it had a poor precision for the unhealthy class. By
contrast MaxEnt had a lower overall precision, but showed better results as an
average across positive and negative precision.
This classification was run across a number of the most influential forums
collected by Polecat and made available to the ROBUST project. As a
benchmark, the results of the classification are shown in the table below to
show the expected positive and negative distributions for on-line forums.
Table 4: Sentiment split for on-line forums
ASP.NET
Android Foums
Digital Point Forums
Electronic Arts UK
NoNewbs
Tech Support Guy
TechAreana
Ubuntu Forums
Iphone Dev SDK
XDA Developers
Positive
Negative
% Positive % Negative
585
693
0.46
0.54
500
535
0.48
0.52
420
521
0.45
0.55
311
354
0.47
0.53
86
77
0.53
0.47
513
595
0.46
0.54
466
525
0.47
0.53
369
471
0.44
0.56
236
266
0.47
0.53
613
811
0.43
0.57
Finally, Polecat performed classification against the user types it had
identified. Across the TiddlyWiki community, the following classifications were
extracted for the various algorithms. For reference, the classification error
rates during training are included in the appendix (38).
Table 5: Classified user types for the TiddlyWiki community
Algorithm
C4.5
Decision Tree
Maximum Entropy
Naive Bayes
Newbie
2675
1
268
787
Core Participant
Elder
159
2805
2557
1884
139
167
148
302
14/45
Report D9.2, V1.0
3.3.
Topic Based Signifiers
All of the work above is focused on the health of a community. By contrast,
Polecat's specific use case means that they are focused also on metrics
specifically tailored to the information needs of large corporations and
decision makers. On-line community data provide essential channels for
companies to monitor the way they are their products are being discussed,
allowing them to react accordingly. Moreover, it also provides essential
information around entire sectors and industries, informing strategic decision
making. Needless to say, the volume of this data is often prohibitively large
meaning key messages and indicators can be missed.
Analysing the health of the community from the perspective of a particular
external party, rather than the internal members, is therefore a key metric,
providing essential information these companies. Typically, this is in an
information retrieval scenario where the user (in this case the company)
wishes to search communities and view the discussion around the user
himself, to understand the health of this conversation and allowing them to
react accordingly.
The work above already allows these users to query community data and get
a sense of the health of the discussion. However, users often require more
granular results. One such important filter is that of topic: what is the heath in
the community around company A and topic B?
There are two approaches to this – the deriving of a topic model from the
data, to see the topics that are being discussed and the health of these.
Another, less well explored, area is finding the community health around
explicitly defined topics.
3.3.1.
Topic by heath (derived)
Topic by heath (derived)
WP3 have proposed a dynamic joint sentiment-topic model (dJST) which
allows the detection and tracking of views of current and recurrent interests
and shifts in topic and sentiment, based on Latent Dirichlet Allocation (LDA).
This means that users and readers of a community are able to easily see the
health around particular subjects, and thus gain a deeper insight into the
discourse.
The dJST model makes the assumption that documents at the current epoch
are influenced by documents at past. Therefore, the sentiment -topic word
distributions are generated from those word distributions calculated previously
by the model. This can be done using three different techniques:

A sliding window, where the current sentiment-topic-word distributions
are dependent on the previous sentiment-topic specific word
distributions in the last S epochs.

A skip model where history sentiment- topic-word distributions are
considered by skipping some epochs in between
15/45
Report D9.2, V1.0

A multi-scale model where previous long (and short) timescale
distributions are taken into consideration.
As data is received in a stream, results are buffered until the end of the
specified epoch (an epoch is defined as a number of milliseconds as a
configurable parameter of the system). At that point, a model is extracted,
where documents are represented as term vectors. Because of the
assumption that documents at current epoch are influenced by documents at
past, the current sentiment-topic specific word distributions at the current
epoch are generated according to the word distributions at previous epochs.
Evaluation dataset
The dataset for this evaluation was review documents from the Mozilla Addons web site between March 2007 and January 2011. These reviews are
about six different add-ons: Adblock Plus, Video DownloadHelper, Firefox
Sync, Echofon for Twitter, Fast Dial, and Personas Plus. All text were
converted to lowercase and non-English characters were removed.
Documents were further pre-processed by stop words removal based on a
stop words list and stemming. The final dataset contains 9,114 documents,
11,652 unique words, and 158,562 word tokens in total.
The unit epoch was set to quarterly and there were a total of 16 epochs, and
the total number of reviews for each add-on were plotted against the epoch
number. It can be observed that at the beginning, there were only reviews on
Adblock Plus and Video DownloadHelper. Reviews for Fast Dial and Echofon
for Twitter started to appear at Epoch 3 and 4 respectively. And reviews on
Firefox Sync and Personas Plus only started to appear at Epoch 8. The
review occurrences have a strong correlation with the release dates of various
add-ons. We also notice that there were a significantly high volume of reviews
about Fast Dial at Epoch 8. As for other add-ons, reviews on Adblock Plus
and Video DownloadHelper peaked at Epoch 6 while reviews on Firefox Sync
peaked at Epoch 15.
16/45
Report D9.2, V1.0
Figure 4: dJST by number of reviewers
Each review is also accompanied with a user rating in the scale of 1 to 5. This
user rating represents the quality of the user’s experience using the plug-in,
ranging from 1 representing a negative experience, to 5 for a very positive
experience. The average user rating across all the epochs for Adblock Plus,
Video DownloadHelper, and Firefox Sync are 5-star, 4-star, and 2-star
respectively. The reviews of the other three add-ons have an average user
rating of 3-star.
Figure 5: dJST by average rating
17/45
Report D9.2, V1.0
Word polarity prior information was incorporated into model learning where
polarity words were extracted from the two sentiment lexicons, the MPQA
subjectivity lexicon2 and the appraisal lexicon 3. These two lexicons contain
lexical words whose polarity orientations have been fully specified. Words
were extracted with strong positive and negative orientation: the final
sentiment lexicon consists of 1,511 positive and 2,542 negative words.
Evaluation metrics and results
The model was evaluated using two metrics:

Predictive perplexity. This is defined as the reciprocal geometric mean
of the likelihood of a test corpus given a trained model’s Markov chain
state. Lower perplexity implies better predictiveness, and hence a
better model.

Sentiment classification: Document-level sentiment classification is
based on the probability of sentiment label given a document. For the
data used here, since each review document is accompanied with a
user rating, documents rated as 4 or 5 stars were considered as true
positive and other ratings as true negative.
Evaluation performed by WP5 as part of D5.2 studied the influence of the
topic number settings on the dJST model performance in comparison with
other models. With the number of time slices fixed at S = 4, the topic number
was varied.
Figure 6: dJST perplexity and classification shows the average per-word
perplexity over epochs with different number of topics. JST-all has higher
perplexities than all the other models and the perplexity gap with the dJST
models increases with the increased number of topics. All the variants of the
dJST model have fairly similar perplexity values and they outperform both
JST-all and JST-one.
Figure 6: dJST perplexity and classification shows the average documentlevel sentiment classification accuracy over epochs with different number of
topics. dJSTs outperform JST-one with skip-EM and multiscale-EM having
similar sentiment classification accuracies as JST-all beyond topic number 1.
Also, setting the number of topics to 1 achieves the best classification
accuracy for all the models. Increasing the number of topics leads to a slight
drop in accuracy though it stabilises at the topic number 10 and beyond for all
the models. Nevertheless, the drop in sentiment classification accuracy by
modelling more topics is only marginal (about 1% drop) for sliding-EM and
skip-EM.
2
3
http://www.cs.pitt.edu/mpqa/
http://lingcog.iit.edu/arc/appraisal_lexicon_2007b.tar.gz
18/45
Report D9.2, V1.0
Figure 6: dJST perplexity and classification
The conclusion from WP5’s evaluation was that both skip model and
multiscale model achieve similar sentiment classification accuracies as JSTall, but they avoid taking all the historical context into account and hence are
computationally more efficient. On the other hand, dJST models outperform
JST-one in terms of both perplexity values and sentiment classification
accuracies, which indicates the effectiveness of modelling dynamics.
3.3.2.
Topic by health (explicit)
Users, in the Polecat use case, are often more interested in tracking the
health of a community conversation around a specific topic. For example,
Polecat recently did some work with the Irish government, who were trying to
set an agenda for the Irish Economic Forum. They wanted to understand the
main areas around the Irish economy that were being discussed, and the
health of these areas, so that they could make the forum as relevant as
possible. Querying using the Boolean expression “Ireland and economy” fails
here for a number of reasons. Firstly, the recall is affected because many
documents related to the economy will not mention the term “economy”.
Secondly, the results give no indication of what the specific sub-topics of
economy are most and least important. A derived topic model does not
generally add information here; generally speaking, the main topics about
which companies are discussed are well known. By contrast, the health
around less discussed topics might be of more interest; identifying unhealthy
topics allows companies to react accordingly – either by changing policy, or
engaging with the community to allay negative sentiment.
The aim of the research, therefore, is the creation of a retrieval system that
allows users to query data in the traditional manner but additionally allows the
user to specify a topic, and see the density of this topic over the result-set, as
well as the density of the most pertinent sub-topics. Both topic and query are
provided as key words, and the techniques below can both be thought of as a
form of query expansion using one of two techniques: the use of explicit lists
to describe a topic, or the automatic creation of a word list.
To meet this use case requirement of identifying and analysing explicit topics,
Polecat developed a number of techniques.
19/45
Report D9.2, V1.0
Retrieving documents by topic query
Traditional information retrieval systems use simple queries. Whilst some
techniques are often applied to the query before the search is submitted, such
as disambiguation or spelling corrections, there is little attempt to treat any
query word as a topic or subject and retrieve documents that do not directly
include the term. Partly this is because it is very difficult to anticipate when a
user has specified a term as a precise query, or as a topic. However, the
Polecat use case shows a customer requirement for topic searching, so
treating certain query terms as topics is an essential feature.
It should be noted here that this is not a traditional query expansion problem
(though it is certainly related). Query expansion has tended to focus on
discovering synonyms, discovering alternative term morphologies or fixing
spelling errors4. By contrast, this research aims to retrieve documents for an
entire topic; for example, and user interested in the topic of “obesity” may be
interested in results concerning “heart disease”. Heart disease is a concept
related to obesity, but certainly not a synonym. Polecat has found no research
to date where query expansion is used to discover a topic.
The initial problem was to define the concept of “topic”, and how a topic could
be represented as a bag of words. Research into explicit semantic analysis [1]
suggests that single Wikipedia pages can be treated as semantic concepts
(referred to hereafter as “concept”). In this research, the semantic similarity of
documents was calculated by finding the Wikipedia pages for the major words
in each document, and creating a vector of cosine similarity scores for each
document for each one of the discovered concepts. The similarities of the
vectors thus gave the semantic similarity.
Building on the positive results of this research, Polecat created a graph for
the entire set of English Wikipedia pages. This graph had the structure as
shown in Figure 7: Graph structure for Wikipedia topic service.
http://en.wikipedia.org/wiki/Query_expansion
20/45
4
Report D9.2, V1.0
Figure 7: Graph structure for Wikipedia topic service
When a user specifies a topic term, such as “obesity”, this term is sent to the
topic service. This service then finds the Wikipedia page that best matches
the topic term. This is achieved in the following way:
If the topic term matches a concept exactly, return that concept
Disambiguate the topic term.
exactly, return that concept 5
If
the
disambiguated
term
matches
Look for redirects for the topic term. If the redirected term matches
exactly, return that concept 6
Search the content of each concept. Return the highest
concept (dependant on the implemented ranking algorithm)
ranking
Once the topic has been identified, the service returns the TF-IDF vector 7 for
that concept. This vector is used to perform query expansion on the topic
term. This means that the final query submitted to the search:
<original-query> AND (<topic-term> OR <tf-idf-term1> OR … <tf-idf-termN>)
There is no empirical evidence from previous research into using Wikipedia
for query expansion as to how many term should be used in the expansion
itself. Therefore, evaluation was performed for 1 to 30 terms in the expansion,
Polecat imported the Wikipedia disambiguation data and created a lookup
service
6 Polecat imported the Wikipedia redirect data and created a lookup service
7 http://en.wikipedia.org/wiki/Tf%E2%80%93idf
21/45
5
Report D9.2, V1.0
against a gold standard created by the Polecat analyst team on the
communities data pulled every day by Polecat as part of the MeaningMine
product.
An abbreviated version of the recall, precision and f-measure statistics are
shown below for expansion at increments of 5 up to 30. Although the results
are shown at increments of five, the evaluation was run using term expansion
at each interval between these increments too.
Table 6: Precision for topic query expansion
No query terms
Cardiac innovation
Offshore gas
Energy census
Energy appliances
Shell arctic
Cyber security
CO2 and climate
change
Intellectual property
Innovation in Ireland
Aging population
1
0.17
1.00
0.97
0.37
1.00
1.00
5
0.15
0.53
0.32
0.34
0.53
0.15
10
0.19
0.52
0.90
0.15
0.44
0.14
15
0.16
0.39
0.59
0.14
0.37
0.08
20
0.08
0.39
0.59
0.13
0.30
0.06
25
0.08
0.39
0.39
0.10
0.29
0.04
30
0.08
0.39
0.32
0.08
0.28
0.03
0.86
0.86
1.00
1.00
0.26
0.37
0.98
0.22
0.26
0.32
0.19
0.01
0.27
0.26
0.18
0.01
0.22
0.19
0.10
0.01
0.25
0.17
0.08
0.01
0.18
0.16
0.06
0.01
Table 7: Recall for topic query expansion
No query terms
Cardiac innovation
Offshore gas
Energy census
Energy appliances
Shell arctic
Cyber security
CO2 and climate
change
Aging population
1
0.44
0.92
0.01
0.74
0.76
0.09
5
0.46
1.00
0.01
0.77
0.81
0.52
10
0.58
1.00
0.33
0.82
0.82
0.52
15
0.79
1.00
0.35
0.82
0.83
0.89
20
0.81
1.00
0.35
0.84
0.84
0.89
25
0.78
1.00
0.37
0.87
0.84
0.91
30
0.77
1.00
0.37
0.88
0.84
0.92
0.12
0.61
0.58
0.01
0.44
0.37
0.58
0.01
0.46
0.32
0.60
0.18
0.48
0.26
0.60
0.13
0.61
0.19
0.63
0.14
0.78
0.17
0.66
0.14
0.88
0.16
0.70
0.13
Table 8: F-Measure for topic query expansion
No query terms
Cardiac innovation
Offshore gas
Energy census
Energy appliances
Shell arctic
Cyber security
CO2 and climate
change
Aging population
1
0.24
0.96
0.01
0.49
0.86
0.17
5
0.23
0.69
0.02
0.47
0.64
0.24
10
0.29
0.68
0.48
0.25
0.57
0.22
15
0.27
0.56
0.44
0.24
0.51
0.15
20
0.15
0.56
0.44
0.22
0.44
0.12
25
0.15
0.56
0.38
0.18
0.43
0.07
30
0.15
0.56
0.35
0.14
0.42
0.06
0.22
0.71
0.74
0.02
0.33
0.37
0.73
0.03
0.34
0.32
0.29
0.02
0.35
0.26
0.28
0.01
0.33
0.19
0.18
0.01
0.38
0.17
0.15
0.01
0.30
0.16
0.12
0.01
22/45
Report D9.2, V1.0
As is expected, recall tends to increase as the terms are expanded (although
this is not the case for all queries). For the most part, there is little significant
increase in recall beyond an expansion to 15 terms. By contrast, precision
falls more rapidly, tending to bottom out for most queries around an expansion
at 10 terms. As a result, identifying the most suitable number of terms to
expand to has been difficult. As can be seen, the f-measure does not give a
clear signal on this because of the variance between different topics.
However, an expansion to between 10 and 12 terms appears to be optimal
(because it gives the highest score based on the average variance from the
mean for the f-measure of each topic).
Calculating Topic Density
As well as refining the document set in order to give the user a most granular
view of community heath, Polecat also developed methods to measure the
density of topics over a given results set. This allows any health metric to be
correctly assessed in terms of its impact: simply seeing the number of
matching documents gives no indication of how prevalent the topic is over the
documents themselves, nor how correlated the query and the topic are. This
research was performed in two stages:
The first stage was using lists of terms that described a topic rather than
extracting a topic automatically as in the technique described above. This
allows analysts and users to have full control over the topics they are querying
for by creating relevant word lists tailored to their exact information need.
There are instances where customers are searching for topics that either
cross a number of Wikipedia “concepts”, or represent a sub-set of one.
Given that the user/ analyst creates the relevant terms in this scenario, the
research challenge was to identify the algorithm that best calculated the
density of a topic, given a user query. More formally, given a user query q how
is it possible to calculate the density of the topic T (represented by the term
vector) over the documents D, taking into account the relevance (score) of the
documents to the query:
The density of the terms in this taxonomy was tested with three techniques:
simply counting the terms, using a TF-IDF metric, and using an altered
implementation of BM25. This density was weighted for each document based
on its correlation to the original query.
Further insight was added by finding the relative density of this topic i.e. how
dense the topic was for this query compared to how dense the topic is against
the background corpus. To calculate this, the density of the topic was
calculated using the same technique, replacing the original query with a stop
word, to represent a background distribution. Query scores were profiled and
23/45
Report D9.2, V1.0
the distribution was found to be exponential, and this distribution was used to
calculate the relative density of the topic against the query in comparison to
the expected density. This was done because some topics would have almost
no density, so any mention of them would be significant. Similarly, other topics
would be discussed frequently, so a high density would not represent a
significant statistic. The background distribution allows the discovered density
of the topic to be normalised against its expected density.
The data used in the evaluation was human judgements by the Polecat
analyst teams for ten queries using two taxonomies over community data
collected by Polecat. This data was all of the community data collected by
Polecat in the period 01-05-2012 to 31-05-2012, and included a variety of
traditional communities, but did not include social media data. These analysts
examined the documents in the results-set for the ten queries, and, for each
query, estimated the total percentage of the conversation that was around the
given topic (which is called here “topic density) compared to the density of that
topic in the background corpus. For example, the analysts suggested that the
topic “energy reputation” was discussed with documents matching the query
“Nigeria” around 60% more than would be expected for the entire corpus.
Figure 8: Topic density for taxonomy “energy reputation” and Figure 9: Topic
density for taxonomy “finacial distress” show the human judgement scores for
the five topic density scores assigned to the two word lists by the analysts,
and the results for the different density functions.
1.2
1
0.8
Human
judgement
0.6
Count
0.4
TF-IDF
0.2
BM25
0
1
2
3
4
5
Figure 8: Topic density for taxonomy “energy reputation”
24/45
Report D9.2, V1.0
1.2
1
Human
judgement
0.8
Count
0.6
TF-IDF
0.4
BM25
0.2
0
1
2
3
4
5
Figure 9: Topic density for taxonomy “finacial distress”
The second stage added further capability to this functionality: instead of
using a word list created by the user, it built on the document searching
technique by using the aforementioned Wikipedia service to discover topics.
This technique allows the user to measure topic density given only a single
topic term, and to find relevant sub-topics and measure the density of these
sub-topics. This involves two stages:
a - Finding the most relevant sub-topics:
The most relevant sub-topics are discovered by querying the Wikipedia topic
service. Firstly this gets the Wikipedia page that best matches the topic word
using the steps outlined in the previous section. It then uses an algorithm to
find the most relevant connected pages, and the TF-IDF vectors from these
pages form the sub-topics. Polecat tested various techniques for discovering
the best sub-topics: page rank, shared categories, document similarity and
number of shared links.
The data set for testing was made against human judgements from the
Polecat analyst team. Three analysts were each presented with 15 topics
terms (page names or “concepts” in Wikipedia) with every single associated
topic. Here an associated topic was either a page that was linked to from the
Wikipedia of the original topic, or a category that had an associated page. The
analysts then selected, for each topic, an ordered list of the twenty most
closely associated topics. These judgements were then merged into a single
list using a simple scoring technique 8 to output a single ranked judgement of
the most closely associated sub-topics for each topic.
In order to run the experiment, Polecat then calculated a ranked list of the
most closely associated sub-topics for each of the given topics using four
techniques. These ranked lists were then compared against the human
Any time a sub-topic was selected, it was assigned a ranking score (20 at
position 1, decrementing by one thereafter). Each topic then received a score
than was the sum of these individual ranking scores, and a new ranked list was
compiled. Where a tie occurred, the judgement made by the senior member of
the team was favoured.
25/45
8
Report D9.2, V1.0
judgement using three metrics shown it the table below: the number of shared
elements between the generated lists (column “A” for each algorithm),
Spearman’s rank correlation coefficient of the two lists (column “B” for each
algorithm) and the Jaccard distance between the two lists (column “C” for
each algorithm) For example, in the table below, the Jaccard distance
between the ranked list of sub-topics extracted by the document similarity
algorithm and the ranked list of sub-topics from the human judgement is 0.34.
An analysis of the results shows that, for each of the three measures (shared
elements, Spearman’s rank correlation coefficient, and the Jaccard distance),
document similarity had the most accurate results. However, further
investigation suggests that the technique may depend on the type of topic
being queried. The next stage of the research looks to classify topic types and
select sub-topic by an associated algorithm most suitable to that class of topic
specified.
Table 9: Evaluation of sub-topic retrieval algorithms
Category
algorithm
extraction
Category
Association
Page Rank
Evaluation Measure
A
B
C
A
B
Document
Similarity
C
A
B
Shared Links
C
A
B
C
biofuel
7.00 -1.00 0.17
1.00 -1.00 0.02 14.00 -1.00 0.34
health
9.00 -1.00 0.28
5.00 -0.85 0.16 13.00 -0.28 0.41 14.00 -0.17 0.44
sustainability
13.00 -0.87 0.25
4.00 -1.00 0.08
insurance
1.00 -1.00 0.02
fraud
0.00 -1.00 0.00 13.00 -0.91 0.33
hydraulic fracturing
5.00 -1.00 0.14
palliative care
10.00 -0.95 0.26
Topic pension
9.00 -1.00 0.17
8.00 -1.00 0.20
9.00 -0.85 0.17
9.00 -0.66 0.22 16.00 -0.25 0.39 19.00 -0.04 0.46
6.00 -1.00 0.15
3.00 -0.99 0.08
4.00 -1.00 0.11 10.00 -0.84 0.27 11.00 -0.74 0.30
8.00 -1.00 0.21 19.00 -0.53 0.49
5.00 -1.00 0.13
6.00 -1.00 0.16
7.00 -1.00 0.18 13.00 -1.00 0.34 17.00 -0.73 0.45
nigeria
2.00 -1.00 0.05
6.00 -1.00 0.15
6.00 -0.99 0.15
0.00 -1.00 0.00
obesity
4.00 -0.92 0.10
6.00 -0.98 0.15
7.00 -0.78 0.18
3.00 -1.00 0.07
5.00 -1.00 0.14 14.00 -0.79 0.38
5.00 -1.00 0.14
old age
8.00 -1.00 0.26 16.00 0.03 0.52 11.00 -0.46 0.35
5.00 -0.66 0.16
bankruptcy
8.00 -1.00 0.20
8.00 -0.99 0.20
7.00 -0.74 0.17 14.00 -0.86 0.34
10.00 -0.64 0.25
5.00 -1.00 0.13
7.00 -0.95 0.18
6.00 -0.88 0.15
6.00 -1.00 0.14
5.00 -1.00 0.12
7.00 -0.88 0.16
1.00 -1.00 0.02
159 -11.5 4.13
120 -11.9 3.11
business intelligence 14.00 -0.83 0.38
climate change
innovation
103 -14.2 2.65
Key explanation
102 -13.4
2.7
9
A = Number of links that the human judgement and algorithm results share
B = Spearman’s rank correlation coefficient between the two human judgement
and algorithm results
C = Jaccard distance between the human judgement and algorithm results
26/45
9
Report D9.2, V1.0
b - Calculating the density of the sub-topics was carried out using two
techniques. One, in order to compare with the baseline recorded in section
above, used the amended BM25 algorithm. In an effort to improve this, and
given the use of Wikipedia data, this built on the results from the original
research into explicit semantic analysis and calculated the cosine similarity of
each document against the TF-IDF vector for that concept. More formally, the
topic T is represented by the vector t:
To date, the human judgements have not yet been completed, so definitive
results from are not yet available. The experiment is set up as follows: ten
topics have been identified, and the ten most closely associated sub-topics
were selected using the document similarity technique described above.
These sub-topics have been given to the Polecat analyst team, who have
been asked to estimate the density of these topics over the document results
for a given query. It is then planned that the density calculated from the
algorithm described above can be compared with this human judgement, and
the efficacy of the technique assessed. It should be noted that it may well be
the case that the sub-topics on which the assessment is made are not the
best sub-topics to choose given the contents of the documents in the result set. However, the aim of this experiment is not to assess the quality of the
selected sub-topics, only to measure the accuracy of the density estimated by
the algorithm.
Application and feedback
Alongside the evaluation of algorithms described above, Polecat also
developed a test application for topic searching over community data. This
allows the user to query data with a standard query and a topic query
(denoted with leading slash). Results that match this query and topic are
retrieved. Also visualised is the density of the sub-topics. A screenshot is
shown below: On the right of the page, the documents that match the query
and topic term are displayed, and the user is able to page through the results.
On the left, three graphs are shown. At the top is a radar diagram. On the
outer axis of this is shown each of the selected sub-topics for the topic
specified in the query. The red area displays the density of this sub-topic over
the documents retrieved by the query. It should be noted, however, that the
density score is normalised by the visualisation, so that the largest of these is
always shown as a maximum density. It gives no indication of actual density,
therefore, only of density relative to the other sub-topic densities. To display
the overall density of the topic (and therefore put the radar chart in context),
relative to the density seen in the background corpus, the speedometer chart
is displayed below. The axis of this chart is from -10 (meaning that the topic is
27/45
Report D9.2, V1.0
not discussed at all) through 0 (meaning the topic is discussed exactly as
would be expected) to 10 (meaning that the topic is discussed a great deal
more than expected. More scientifically, this means that the topic density is at
or above 2.5 standard deviations from the mean of the background
distribution). A line chart is shown alongside the speedometer. Given that the
data displayed is temporal, this simply shows the number of posts that match
the query for each day in the specified time period.
Figure 10: feedback prototype for topic density evaluation
Each action from the user is logged in the application so that an accurate
picture of usability is built up. Further, users are able to feedback on a number
of aspects of the results, namely:

Whether a given retrieved document is considered irrelevant

Whether a sub-topic is not relevant

Whether the density for a given sub-topic looks incorrect
This feedback will provide further judgements against the results. It is
envisaged that the most useful aspect of this will be the assessment of the
selected sub-topics. It is the opinion of Polecat that the technique for selected
the most relevant and closely associated sub-topic for any given topic differs
depending on the type of topic. Anecdotally, it has been noticed that the
degree of document similarity between topic and sub-topics varies according
to what the topic is, suggesting that this technique is limited in its utility. For
example, document similarity has been seen to work poorly for quite technical
topics. The topic “computer programming” returns sub-topics such as
“Method”, which is certainly a topic associated with programming, but not one
28/45
Report D9.2, V1.0
in which the user is likely to be interested. They ar e more likely to be
interested in aspects of the action of computer programming, such as debugging, gathering requirements etc. By contrast, the results for more
nebulous topics are better. For example, query “equality” gives a number of
aspects of that topic, such as “social equality”, “liberalism” etc).
To date, the application is still being evaluated. Polecat plans to exploit the
capabilities from this application provided the feedback is deemed to have
reached a certain level of quality. This exploitation will be in two phases. The
first stage is to allow the user to query using a topic term (probably using the
slash notation). One of the major difficulties users have with MeaningMine is
identifying the dataset in which they are interested, so this will provide a
significant step forward in unlocking the potential of the product. The second
stage is to add the sub-topic densities as an insight 10 into the product.
4. Behavioural Signifiers
Identifying behavioural signifiers of health within communities was work
predominantly carried out in WP3 (by partner OU) as part of D3.2. They
focused on detecting and understanding the correlation between community
social behaviour and its overall health. Their assumption was that the type
and composition of behaviour roles exhibited by the members of a community
(e.g. experts, novices, initiators) could be used to forecast change in
community health. Therefore, they framed their main research question as:
“can we accurately and effectively detect positive and negative changes in
community health from its composition of behaviour roles”? Behavioural
health was thus understood in terms of the actions and interactions of users
with other community users; the role that a user assumes is the label
associated with a given type of behaviour. Roles were identified by a set of
behaviours or interactions, such as “engagement, contribution, popularity,
participation, etc”.
4.1.
Identification of health indicators
The work identified four health indicators, based on previous research: loyalty,
participation, activity and social capital. These correlated to four specific sets
of community features:
1) Churn rate: This is the number of users that have posted in the
community for the final time as a proportion of the entire number of
users that have posted in the same time period.
2) User count: This is calculated as the number of users that posted in the
community at least once in a given time period.
MeaningMine presents a number of ways of summarising the result-set from a
query via various facets of extracted information, such as topic models,
sentiment etc. These are termed “insights” within the product.
29/45
10
Report D9.2, V1.0
3) Seed/ non seed proportion: This is the number of seed posts (thread
starters) that generate at least one reply as a proportion of seed posts
that generate no replies.
4) Clustering coefficient: This is the average network degree of users in
the graph as an indication of how inter-connected users are.
4.2.
Measurement of user behaviour
As well as health indicators, the research examined user behaviour. The
rational for this was: “understanding the behaviour of community users and
how that relates to community health indicators, could provide community
managers with information of healthy and unhealthy behavioural traits found in
their communities”. These roles were calculated from six key numerical
features of the community for each user:
1) Focus dispersion: the forum entropy of the user, where a high value
indicates that the user is active on a large number of forums within a
community.
2) Engagement: the proportion of all users that the user has replied to,
where a high number indicates that the user has wide engagement.
3) Popularity: The proportion of all of the users in the community that have
replied to the user.
4) Contribution: The proportion of all thread replies that were created by
the user.
5) Initiation: The proportion of threads that were started by the user –
essential a measure of how active the user is in instigating discussion.
6) Content quality: this is the average points per post awarded to the user.
Obviously this measure is only applicable to certain communities with
this feature available.
4.3.
Discovering user roles
In order to map the user behaviour to certain user roles in the system, the
research took the following steps. First, the user behaviours above were
pairwise analysed for correlation (correlated features are not a useful in
describing unique behaviour). Results suggested that engagement,
contribution and popularity were all highly correlated with each other (as might
be guessed intuitively), resulting in the use only of the dimensions focus
dispersion, initiation, content quality and popularity.
With this trimmed list of user behaviour features, the users were then
clustered
using
unsupervised
clustering
techniques: Expectation
Maximisation, K-Means and Hierarchical clustering. The best result, judged
according to the cohesion and separation of the clusters, was K-Means.
30/45
Report D9.2, V1.0
Clusters were labelled using a maximum-entropy decision tree that divided
clusters into branches that maximises the dispersion of dimension levels. This
process was performed until single clusters, or the previously merged
clusters, were in each leaf node; the path to the root node was then used to
derive the label. This gave the labels in the table below:
Table 10: Derived cluster labels
Cluster Labels
Focussed Novice
Focused expert participant
Knowledgeable member
Knowledgeable sink
Focussed expert initiator
Mixed novice
Mixed expert
Distributed novice
Distributed expert
4.4.
Analysing role/ health relationship
Having identified the health indicators and behavioural roles, it was then
possible to identify patterns that explain the relation between a degradation or
improvement in a community’s heath, and the behaviour of the members. This
was done using two distinct techniques:
Health indicator regression – this used the role composition in each
community as a predictor for each of the heath indicators.
Health change detection – this performs a binary classification task to detect
changes in community health from one time step to the next investigation
whether it is possible to detect changes to communities that could result in
bad health.
It should be noted here, however, that the research drew no empirical
conclusions on the relationship. It simply demonstrated that a relationship did
exist.
5. Structural Signifiers
The investigation of structural signifier had two main elements. The first was
some research carried out by Polecat into those signifiers that most reflected
the health of a community by discussion with community owners. Secondly,
WP5 carried out some analysis on the TiddlyWiki community to gather some
initial statistics on the shape and size of the community. This is particularly
31/45
Report D9.2, V1.0
valuable in light of the linguistic signifiers of health identified later in the
document.
5.1.1.
Community owner feedback on structural signifiers of health
In order to provide a benchmark of the structural indicators of health that are
important to a community, Polecat talked to community owners from a range
of currently running on-line communities and, using a simple questionnaire,
asked them which statistics from the community they monitor to understand
the health at a given time. These communities were: TiddlyWiki, SAP SCN,
IBM Connections and _connect.
The results from this are shown in the table below. What is most striking about
the results is that, for four vibrant communities, there is mostly overlap
between the metrics, and very few metrics at that. This suggests either that
community owners have a limited grasp of the structural metrics that inform
them of the communities health, or that very little is needed to build an
adequate picture of how the community is performing.
Table 11: Structural health signifier matrix
User Metrics
TiddlyWiki
IBM
SCN
# logins per day
No
Yes
Yes
Average time spent logged
in
No
Yes
Yes
# posts
Yes
Yes
Yes
# posts generating replies
Yes
Yes
Yes
# content views/ visitors
Yes
Yes
Yes
# likes
No
Yes
No
Average user connections
No
Yes
No
# new members in time
period
Yes
Yes
Yes
# users
Yes
Yes
Yes
Max concurrent users per
hour
No
No
Yes
5.1.2.
_connect
Analysis of TiddlyWiki structural factors
WP5 carried out an initial analysis of the TiddlyWiki11 as a benchmark of
structural statistics that contribute to the health of a community against which
other communities could be compared. This relates to the TiddlyWiki
11
2011_April_Tiddlywiki_Analysis.pdf
32/45
Report D9.2, V1.0
community from 2005 until the end of 2011, and includes three separate subsections:

WikiGroup: this is the core TiddlyWiki development community

WikiDevGroup: a development community for TiddlyWiki software

WebGroup: a development community focused on associated web
technologies
The basic statistics of the community over all time steps are shown below in
Table 12: Basic statistics for the TiddlyWiki groups. These statistics are the
number of users in the community (represented by “nodes”), the number of
relationships between these users (calculated as the occurrence of direct
communication between two users and represented by “edges”) and the
number of postings in the community (represented by “emails”).
Table 12: Basic statistics for the TiddlyWiki groups
Data set
Nodes (users)
Edges (replies)
Emails
WikiGroup
2774
16804
51662
WikiDevGroup
698
4345
14703
WebGroup
77
464
3166
With the goal of analysing the sub-networks, a 12 months window size turned
out to be the most suitable, as it was then possible to detect any significant
communities in the other sequences. Given the WebGroup data-set spanned
only two years and no significant community structure was detected, it is not
discussed here furthermore.
The partitions that were discovered were, for the most part, associated with
the relatively low modularity 12 value 0:3. As a result, the results of that
clustering method are not presented here due to that quality. Further
investigation of the community structure using the OSLOM method, which
finds only statistically significant communities, confirmed that there is only
weak community structure with the majority of nodes not being member of any
community.
In terms of WikiGroup, visual inspection of the community structure of the
WikiGroup dataset suggested that there is apparently only a single stable
community (Table 13: WikiGroup and WikiDevGroup sizes by time-slice). All
the other communities were significantly smaller and all of them either
dissolved or disappeared completely. In other words, the vertices of the subgraph disappear from the network in the following step-graph. This suggests
that there is one stable community of core users discussing amongst
themselves, a couple of ad-hoc short-living communities on the periphery,
Modularity is a quality function with values approaching 1 indicating good
community structure, whereas values approaching 0 indicating poor or
completely missing community structure.
33/45
12
Report D9.2, V1.0
while the majority of users are not forming any significant cluster of frequently
mutually communicating users. The size of the WikiGroup community (and
WikiDevGroup) over time-slices is shown below.
Table 13: WikiGroup and WikiDevGroup sizes by time-slice
Time Slice
WikiGroup Size
WikiDevGroup Size
2005-06-15 – 2006-06-15
602
273
2006-06-15 – 2006-07-15
693
210
2007-06-15 – 2006-08-15
715
142
2008-06-15 – 2006-09-15
691
151
2009-06-15 – 2006-10-15
574
101
2005-10-15 – 2006-11-15
285
61
By contrast WikiDevGroup was constantly shrinking over time. The community
structure itself is similar to the one observed in the WikiGroup data-set: a core
stable community with a handful of small short-lived communities on the
periphery where the majority of users are not members.
The results from this study form the basis for further study. Having the data
annotated both from a network perspective and a linguistic perspective should
allow us to view the correlations between the two. The suggestion is that
various forms of language cause adaptations not only in the health of the
language used, but also in the way that users congregate into networks of cooperation. We hope to address this in D9.6.
Further network statistics for the WikiGroup are shown in the appendix (39).
6. Project Integration
Many of the outputs from the work described above have fed into software
deployed to the ROBUST platform, and are in the process of exploitation by
Polecat.
6.1.
The dJST topic by sentiment extraction model
This is currently being implemented onto the ROBUST platform. It is being
used by Polecat to monitor a stream of tweets for growth in negative topics
surrounding a particular query. In most cases, and certainly in the case of
eventually exploitable code, this query pertains to a company or product
name.
Community content is buffered by a bespoke component. Then, every n hours
(referred to hereafter as an epoch), a topic model is extracted from that
content using the dJST module that has been deployed as a service. The
34/45
Report D9.2, V1.0
topic model contains the “size” of the topic, and this metric is sent to a
predictive algorithm that uses gibbs sampling to infer predictions. If any of the
negative topics look as though they are likely to grow significantly in the near
future, the gibbs sampler alerts a service of the problem so that appropriate
action can be taken.
The flow of the application is shown below:
Figure 11: Flow of negative topic prediction architecture
35/45
Report D9.2, V1.0
6.2.
Graphic Equalizer
The graphic equalizer (as described in D9.3 and 9.4) is software designed to
present the health of an on-line community both visually and aurally. It
displays a number of different health metrics, from a potentially disparate
number of sources, and offers a single view of these. It performs two primary
user functions: firstly, it allows users to watch the different elements of the
health of a community over a set time-frame, and secondly it allows users to
monitor, in real time, health as it is changing and adapting in the community. It
should be noted that, since the health metrics from the different sources are
normalised, the software can in theory be used to describe any temporal data.
The data used by the graphic equalizer to data has been the role analysis
work from WP3 described above under behavioural analysis, and structural
network data from WP5. It uses a selection of remote services from various
partners (WP3 and WP5) to achieve this, utilizing the enterprise service bus.
The architecture, and how these technologies have been deployed to the
ROBUST image, are shown in the figure below:
Figure 12: Graphic equalizer architecture overview
6.3.
Metaphor base visualisation
Whilst still in development, as part of D9.4 and D9.5 Polecat are currently
developing a metaphor based visualisation to display the health metrics
described above. The current prototype works from the same services as the
graphic equalizer (indeed is utilizes the same UI service), but displays the
metrics as part of an intuitive, easily understood metaphor.
36/45
Report D9.2, V1.0
7. Conclusion
This document has outlined the work performed to benchmark the structural,
behavioural and linguistic signifiers of on-line communities.
Benchmarking linguistic signifiers was approached from three directions.
Firstly, a manual analysis of community data was performed by Polecat
experts, and the core linguistic aspects identified. Secondly, the accuracy of
traditional classification techniques over community data to measure the
heath of the language was undertaken. Thirdly, in accordance with the
specific Polecat use case, an analysis of explicit and implicit topic modelling
was described.
The benchmarking of behavioural signifiers followed a more linear path, from
the identification of key indicators through to techniques that measured this
behaviour and extracted roles and assigned these to users.
Because structural signifiers are the most tangible of the three community
metrics, and therefore least in need of algorithmic investigation, the
benchmarking focused on the needs of the community leaders and which
metrics they already monitor. It also included a benchmark of network
statistics against which linguistic correlations could be discovered, leading (it
is hoped) to insights about how use of language can affect user interaction.
The next phase, building on the results outlined above, will examine statistical
and linguistic processes that are created specifically for social media, and
improve on some of the baselines. This will include research into if and how
the document result-set from an information retrieval query can be truncated
so that any aggregated results of information extracted from the results is of a
similar of better quality to those aggregated results from the entire result -set.
37/45
Report D9.2, V1.0
8. Appendix
8.1.
User classification training set errors
Below are the error rates from the classification of roles derived from Polecat’s
linguistic analysis of community data. The error rate, in this context, is the
percentage of documents that were predicted incorrectly.
C45
Core Participant
Newbie
Elder
8.2.
Decision Tree
MaxEnt
2.46%
0.19%
31.26%
50.64%
0.40%
13.10%
Naïve Bayes
0.00%
54.84%
0.00%
2.08%
45.70%
1.19%
Sub topic density using cosine similarity (example)
Below is an example of five sub-topics of the topic “north sea oil”, and the
average cosine similarity measure between the sub-topic and the documents
in the result-set. This average cosine similarity is calculated for differing
numbers of terms that describe the sub-topic. For example, “5” means that the
sub-topic is treated as 5 terms, and the cosine similarity is between these
terms and each document in the result-set.
Topic
1
5
10
15
20
25
30
8.3.
Oil
platform
3.30
1.94
5.03
4.24
3.18
3.89
3.31
Offshore
oil and gas
in the
United
States
19.06
16.91
10.50
20.67
15.11
13.28
12.55
Rhum
gasfield
0.00
5.64
4.82
4.25
4.91
5.88
5.03
Bahar
oilfield
0.00
1.18
10.32
11.39
16.18
18.53
18.81
Pallas gas
field
0.00
0.09
3.16
3.74
2.77
3.23
4.43
Snapshots of the WikiGroup Network
Below are pictorial representations of the WikiGroup network for each of the
years of the data that were analysed.
38/45
Report D9.2, V1.0
8.4.
2005-2006
2006-2007
2007-2008
2008-2009
2009-2010
2010-2011
WikiGroup Network Statistics
Shown below are some of the key network statistics extracted by the work of
WP5 on the TiddlyWiki data. These statistics are calculated for each year of
the community (shown on the x-axis for each graph). The y-axis represents
the scale of each individual metric.
39/45
Report D9.2, V1.0
Clustering coefficient and density
Connected components and avg
degree
Modularity
Network Distances
40/45
Report D9.2, V1.0
List of Figures
Figure 1: Success language patterns ........................................................8
Figure 2: Motivating language patterns ......................................................9
Figure 3: Less healthy indicators............................................................. 11
Figure 4: dJST by number of reviewers ................................................... 17
Figure 5: dJST by average rating ............................................................ 17
Figure 6: dJST perplexity and classification .............................................. 19
Figure 7: Graph structure for Wikipedia topic service ................................. 21
Figure 8: Topic density for taxonomy “energy reputation” ........................... 24
Figure 9: Topic density for taxonomy “finacial distress” .............................. 25
Figure 10: feedback prototype for topic density evaluation ......................... 28
Figure 11: Flow of negative topic prediction architecture ............................ 35
Figure 12: Graphic equalizer architecture overview ................................... 36
41/45
Report D9.2, V1.0
List of Tables
Table 1: User types identified by linguistic analysis ................................... 11
Table 2: Health bands across communities .............................................. 12
Table 3: Sentiment classifier results ........................................................ 14
Table 4: Sentiment split for on-line forums ............................................... 14
Table 5: Classified user types for the TiddlyWiki community ....................... 14
Table 6: Precision for topic query expansion ............................................ 22
Table 7: Recall for topic query expansion................................................. 22
Table 8: F-Measure for topic query expansion .......................................... 22
Table 9: Evaluation of sub-topic retrieval algorithms.................................. 26
Table 10: Derived cluster labels .............................................................. 31
Table 11: Structural health signifier matrix................................................ 32
Table 12: Basic statistics for the TiddlyWiki groups ................................... 33
Table 13: WikiGroup and WikiDevGroup sizes by time-slice ....................... 34
42/45
Report D9.2, V1.0
List of Abbreviations
Abbreviation
Explanation
ROBUST
Risk and Opportunity management of
huge-scale BUSiness communiTy
cooperation
dJST
Dynamic joint sentiment topic model
WP
Work package
TF-IDF
Term frequency – Inverse document
frequency
43/45
Report D9.2, V1.0
References
[1] Evgeniy Gabrilovich and Shaul Markovitch (2007). Computing semantic
relatedness using Wikipedia-based Explicit Semantic Analysis. Proc. 20th Int'l
J. Conf. on AI (IJCAI).
[2] Angeletou, Rowe, Alani (2011): Modelling and analysis of user behaviour
in online communities: The Semantic Web-ISWC 2011
[3] T. Mostyn (2112) Polecat Use Case Data and Requirements: WP9-D9.1Report_v2.0.docx
[4] V. Belak (2011): Analysis of Community Structure and Dynamics in
Tiddlywiki Email Fora :2011_April_Tiddlywiki_Analysis.pdf
[5] S. Staab, T,. Gottron (2010):
ROBUST_Description of Work.pdf
Robust
Description
of
Work
[6] A. Hogan, M. Kunstedt (2012): Suite for behaviour anlaysis and
topic/sentiment tracking: WP5-behaviour-analysis.docx
44/45
Report D9.2, V1.0
Version history
Version
0.1
0.2
1.0
Date
11/10/2012
29/10/2012
30/10/2012
Author
Toby Mostyn
Toby Mostyn
Toby Mostyn
Comments
Initial draft
Response to feedback
Release version
Acknowledgement
The research leading to these results has received funding from the European
Community's Seventh Framework Programme (FP7/2007-2013) under grant
agreement n° 257859, ROBUST
45/45

D9.2: Benchmark Report on Structural

Transcription

Similar documents

How to prepare succesful projects

mustelids - The Vincent Wildlife Trust

AVS24 2.4 GHZ Wireless Audio/Video Sender

HEMINEVRIN - Alcohol and Alcoholism

No trophy could ever fully express our

A golden opportunity to A golden opportunity to

Building Capacity, Creating Jobs, Discovering and Developing

New User FFM Marketplace Testing

R9044130608 Raaco Kufry na náradie SCT Key SERV

FNU Highlights from the - Frontier Nursing University