View - International Journal of Software and Informatics

Transcription

View - International Journal of Software and Informatics
Int J Software Informatics, Volume 6, Issue 1 (2012), pp. 43–59
International Journal of Software and Informatics, ISSN 1673-7288
c
2012
by ISCAS. All rights reserved.
E-mail: [email protected]
http://www.ijsi.org
Tel: +86-10-62661040
Iterative Visual Clustering for Learning Concepts
from Unstructured Text
Qian You1,2 , Shiaofen Fang2 , and Patricia Ebright3
1 (Computer
Science Department, Purdue University, West Lafayette, IN 47907, USA)
2 (Department
of Computer and Information Science,
Indiana University-Purdue University Indianapolis, Indianapolis, IN 46202, USA)
3
(Indiana University School of Nursing, Indiana University-Purdue University Indianapolis,
Indianapolis, IN 46202, USA)
Abstract Discovering concepts from vast amount of text is is an important but hard
explorative task. A common approach is to identify meaningful keyword clusters with interesting temporal distributive trends from unstructured text. However, usually lacking clearly
defined objective functions, users’ domain knowledge and interactions need to be feed back
and to drive the clustering process. Therefore we propose the iterative visual clustering
(IVC), a noval visual text analytical model. It uses different types of visualizations to help
users cluster keywords interactively, as well as uses graphics to suggest good clustering options. The most distinctive difference between IVC and traditional text analytical tools is,
IVC has a formal on-line learning model which learns users’ preference iteratively: during
iterations, IVC transforms users’ interactions into model training input, and then visualizes
the output for more users interactions. We apply IVC as a visual text mining framework to
extract concepts from nursing narratives. With the engagement of domain knowledge, IVC
has achieved insightful concepts with interesting patterns.
Key words:
processing
text and document visualization; iterative visual analytics; nursing data
You Q, Fang SF, Ebright P. Iterative visual clustering for learning concepts from unstructured text. Int J Software Informatics, Vol.6, No.1 (2012): 43–59. http://www.ijsi.org
/1673-7288/6/i126.htm
1
Introduction
Inferring higher-level concepts is an important application of discovering knowledge from unstructured text data. This is often achieved by identifying meaningful
keyword clusters and their temporal distributions within texts. For example, interpreting the dynamics of relevant words usage from the large amount of on-line
unstructured or loosely structured narratives can greatly help people to understand
the changes of web users mainstream opinions. Research efforts have been dedicated
to analyzing themes and concepts[1−3] and interpreting their in-text distribution patterns.
Corresponding author: Qian You, Email: [email protected]
Received 2010-12-30; Revised 2012-03-08; Accepted 2012-03-12.
44
International Journal of Software and Informatics, Volume 6, Issue 1 (2012)
Detecting such concepts from the vast quantity of text is an explorative task,
therefore have posed a number of difficulties for traditional text analysis community.
Analyzing the text using Natural Language Processing (NLP) techniques can fall
short, due to the lack of grammatical structure and syntactic rules; deriving concepts
using supervised learning algorithms requires relative complete prior knowledge, yet in
our explorative task the intended concepts are generally unknown beforehand. However starting with limited information or examples, semi-supervised learning models,
such as on-line learning, can refine learnt concepts when newer instances are input.
Nevertheless, there are very few visual interfaces where users could interact with
the learning process, including observing the visual phenomenon of the intermediate
results, and inputing their preferences to the process.
Meanwhile, many novel information visualization techniques and interaction schemes map text to visual patterns or “plots”[4−6]. Typically, the distributions of keywords are visualized as trends. Or correlations among the keywords are visualized as
graphs. So domain experts can interpret the visual cues to gain insights on the text
in a way automatic text analysis can hardly emulate. However, in most of the current
visual methods, the formed insights, hypotheses of the the text content usually remain
at a descriptive level. There are few visual text analysis platforms that users can use
to quantitatively feedback their preferences, so to guide the text analysis process.
Health care applications particular needs such visual platforms, due to two major
reasons: firstly, the large amount of unstructured clinical and patient care text records
and narratives have been accumulating over decades and become a rich source to
investigate daily nursing tasks and nurses’ working patterns. Effective understanding
and re-designing the nurses’ assignments can be very helpful to improve nurses’ daily
performance; secondly, the analysis of patient care related text requires intensive
involvement of medical care experts. Their experiences and domain expertise are the
keys to drive the text analysis process to converge to meaningful concepts. But few
frameworks has considered a formal model to feedback their preferences, so the loop
of “visual reasoning cycle” has not been fully established in text analytic applications,
especially in health care domains.
In this paper we propose an interactive visual text analytics framework, the iterative visual clustering (IVC) method. It aims to derive meaningful high-level concepts
from unstructured text by an iterative clustering process, which is driven by users’
interactions with visualizations. Therefore we propose several visualization methods,
including concept trend visualization, concept visual layout and concept terrain surface visualizations, in IVC to present visual cues to assist users to understand and
cluster keywords into candidate concepts: concept trend visualization and concept visual layout represent the formed concepts in the current step of clustering so users
can understand the overview of concepts, as well as identify interesting candidate
concepts; concept terrain surface visualizations, leveraging the terrain surface visualization technique, use the concept visual layout to suggest good clustering options for
concept as landmark features. To learn users’ interactions with those visualizations,
IVC uses an on-line learning model, a multi-class discriminant function (MCD). Users’
interactions mainly change the training samples and their properties; therefore drive
the training of the learning model. The output of the updated learning model is then
visualized and interacted by users again. The process continues as iterations and the
Qian You, et al.: Iterative visual clustering for learning concepts from ...
45
model continuously learns to generate and visualizes newer concepts clusters.
The major contribution of IVC is that it is a novel and formalized visual unstructured text mining model that enables on-line learning from users’ interactions
with visualizations. Compared to existed visual text analytics systems[4,7−9] , IVC
is advantageous because it models and feeds back users’ preferences to the underlying text mining model. It can also continuously learn when new input is available,
differinf from traditional text mining methods where training is off-line. The second contribution of IVC is we leverage existed visualization techniques to highlight
the recommended solutions from the learnt model. We extend the terrain surface
visualization[10,11] to visualize the suggested the best possible clustering score for
clustering neighbouring candidate concepts. The third contribution is IVC provides
interactions with different types of visual metaphors – multiple perspectives for collecting visual evidence for analysis. The final contribution is we apply IVC on health
care applications, i.e. to identify concepts from nursing narratives. The identified
concepts and their visual patterns shed valuable insights on nurses’ daily working
patterns and workflows.
We will walk through the main method of IVC using extract concepts from the
nursing narratives as an example. The next section briefs the relevant work in text
mining and visual text analytics. Then we introduce nursing narratives and applications in Section 3. The next section describes concept trend visualization, concept
visual layout and concept terrain surface visualizations, which visualizes candidate
concept clusters and enables users’ visual analytics. In concept terrain surface visualization, we also in-depth discuss the choices of multiple criteria functions to evaluate
the clustering, and how the best clustering score are visualized to suggest good clusters. Section 5 first presents a percepteron based iterative learning model for IVC to
learn clustering from users interactions. We then in the same section, describe how
users’ interactions with visualization can be transformed to change the training of the
learning model therefore driving the iterative clustering process to generate newer
clusters. In section 6 we discuss the results of applying IVC onto nursing narratives
data sets, and then we conclude the paper with additional remarks and future work.
2
Related Work
Text Mining the term-document model[12] is widely used in text analysis tasks[13]
to model context by keywords features. Those keywords could then be clusterd by
unsupervised learning methods[14,15] . However defining an objective function for clustering is still an open research topic. In addition to non-parametric clustering methods, factor analysis techniques, such as LSI[16] , PLSI[17] , LDA[18] and Hidden Markov
Model[19] have also been proposed to detect potential conceptsfrom text. Essentially,
those techniques focus on the frequently occurred keywords which are not necessarily
contributes to human’s comprehension of text. There are also supervised learning approaches where the best labels of the keyword clusters are sort given trained models
with known concepts or topic classes. Typical models include maximum likelihood
models[20] , or Baysian Models[21] . However, they usually require relatively heavy
prior knowledge on training set and sample distributions which are infeasible in explorative text analysis. To overcome the difficulty of limited background information,
recently on-line learning methods are used in text mining[22,23] to iteratively update
46
International Journal of Software and Informatics, Volume 6, Issue 1 (2012)
the detected concepts by using newly streamed in training samples.
Visual Text Analytics understanding the content by reading is not feasible
for massive amount of text data, automatic analysis methods such as computational
linguistic models can easily fail due to the noise. Therefore keywords/events graphical patterns[24] are strong cues to understand the overview of free texts. A number
of existed works use layered graphs to visualize the trends of groups of keywords altogether, to present the overview of the texts: ThemeRiver[7] is one of the first to
explore the computations to enhance the layered graph representation; CareView [25]
uses the layered graph to visualize the personal medical care report; Stacked Graph [26]
uses a mathematical model to optimize the geometry of the layered graphs, in terms
of legibility and aesthetics; TIARA[8] is a layered graph based user interface which
presents a visual summary of topics extracted from large text corpora. Parallel coordinates are also widely used in visualizing dynamics of the content of large text
data sets. Parallel Tag Clouds[27] uses traditional tags as points on the time axes
to provide a rich documents overview as well as a entry point of exploration on the
individual texts; Rose et al.[28] have used a similar representations to both show and
link essential information from streaming news.
Graph visualization are used to represent in-text correlations among phrases
and words to reveal the semantics of the text corpora: Phrase Net[29] are graphs
where words are nodes and user-specified syntactic or lexical relations are edges. It
provides overviews of in-text concepts in different perspectives; Chen et al.[30] use an
example-based approach to visualize the keyword clusters to reduce the visual clutters
when representing large scale text data sets. Essentially it first provides a compact
low-dimension approximation of the clusters, and then provides details upon users’
selection on examples or desired neighbourhoods. Building on top of the clusters
of words, other systems, such as IN-SPIRE[31,32] , VxInsights[33] and ThemeScape[32]
use interpolated surfaces to represent results of content keywords clustering and local
densities.
While the aforementioned works successfully render insightful visual representations, they usually require the results of statistical text analysis as a previous step.
For example, Word Tree[9] and Phrase Net[29] visualizations use the n-gram model[34]
to extract frequently occurring phrases; TIARA[8] uses LDA[35] and Lucene[36] models
to extract and index topics; RAKE[37] and Sorensen similarity coefficients[38] are used
to identify and cluster major themes in the stream text for[39] . However, to choose the
right type and parameters for the text analysis model requires sufficient knowledge on
the text, which is usually not available for explorative text analysis. Although most of
the visual text analytics systems support interactive visual explorations to a certain
extent, users’ preferences of certain visual results largely remained at a descriptive
level and could not be feedback.
Recently a few applications in explorative text analysis present innovative ways
to engage users in the analytical loop: our previous work[40,41] visualize the patterns
of a great many keywords, and then forms the keyword clusters using an interactive
version of genetic algorithm; in visual evaluation proposed by Oelke et al.[42] , text
features can be refined by comparing intermediate rendering results and changing
parameter threshold; in LSAView[43] , users can adjust parameters for SVD decomposition to investigate the resulted document clusters mosaics. A number of formalized
Qian You, et al.: Iterative visual clustering for learning concepts from ...
47
schemes are proposed to complete the “visual reasoning cycle”[44] by feeding back
users insights drawn from visual evidence. Vispedia[45] allows users to construct desired visualizations, based on which the system could recommend relevant contents
by conducting A* search on a semantic graph. Koch et al.[46] develop an interesting
process for visual querying system on the patents database. Schreck et al. [47] feedback users preferences as certain shapes for supervised learning. Wei et al.[52,53] the
topic evolution trend from topic mining of multiple text copora, as well as use certain
users’ interactions to trigger the process of updating the backend text mining model.
However, very few established visual analytics framework, when attempting to derive
general themes or high-level concepts from the massive amount of unstructured text,
is able to learn from users’ preferences.
3
Nursing Narratives Processing
A procedure for manual recording of direct observations of Registered Nurses
(RN) work was developed to help working nurses and nursing domain experts to
investigate and understand the numerous activities RNs engaged in to manage the
environment and patients flow. Observation data was recorded on legal pads, line by
line, using an abbreviated shorthand method as unstructured text. A segment of a
sample session of nursing narratives is shown below:
• Walks to Pt #1 Room
• Gives meds to Pt #1
• Reviews what Pt #1 receiving and why
• Assesses how Pt#1 ate breakfast
• Teaches Pt#1 to pump calves while in bed
• Explains to Pt#1 fdngs- resp wheezes
• Reinforces use of IS Pt#1
• Positions pillow for use in coughing Pt#1
• .....................
what to the outside casual observer appears to be single task elements becomes a
much more complicated array of overlapping functions with inter-related patterns
and trends. There are two basic anticipated applications to analyzing RN work: (1)
identification of work patterns related to non-clinical work or basic nursing work (2)
staffing and assignment implications based on work patterns across time.
4
Iterative Visual Clustering Visualizations
Iterative visual clustering is essentially a users-interaction driven clustering process. Visualizations in IVC serve two main purposes: first it represents the formed
concepts in the current step of clustering so users can understand the overview of concepts, as well as identify interesting candidate concepts; second, it is also an interface
where users’ interactions with clusters are collected and input to drive the clustering
process. In this section we describe several visualization methods for candidate concepts. Those visualizations do not only help users to understand the current concepts
extracted from text, but also guide users to generate more clusters.
48
International Journal of Software and Informatics, Volume 6, Issue 1 (2012)
Figure 1.
Stacked trend visualizations of candidate concept “interactions” “documentation” and
“procedures”
4.1
Candidate concepts trend visualization
A candidate concept is a keyword or a group of keywords. IVC iteratively clusters
candidate concepts xi into larger groups to form higher level concepts, i.e. x1 ∨x2 ∨xp .
The clustering process is non-partitioning and non-exhaustive: not all of the candidate
concepts will participate in larger concepts and one candidate concept can appear in
more than one larger concept.
From unstructured text like nursing narratives, we identify representative keywords as the basic candidate concept to be clustered. Stopping words are filtered, and
the remained words go through the standard procedures of tokenization and stemming. We then rank words according to their overall occurrences and use a percentile
threshold (in this study, 80%) on the occurrences to keep representative keywords.
The choice of threshold ensures that each keyword has a sufficiently large occurrence.
We also make sure that informative domain keywords, such as “iv” “stethoscope” are
kept.
We visualize the trend of candidate concepts to represent its dynamics and progressions. In text preprocessing step, each candidate concept in each document has
an occurrence vector depending on the segmentation points of the documents (in this
study, each line of the narratives). The occurrences of the keywords in the concept are
counted as the occurrences of the concept itself. We then accumulate its occurrence
vectors of all data sets into one, by averaging their frequency domain representations
using the Discrete Fourier Transform[48] and then the reverse transform. The accumulated occurrence vector for a concept is then smoothed by the Gaussian filter and
visualized as a trend over time line.
To study the progressing differences of several concepts, we could stack the trends
of three concepts vertically by ThemeRiver[7] style, filled with different colors. For
example, we cluster “tell” “explain” “answer” and “listen” as the concept of nursepatients interactions, “write” “review” “chart” as the concept of documentation, “iv”
“tube” “pump” as the concept of daily procedures.The different progressions of trend
Qian You, et al.: Iterative visual clustering for learning concepts from ...
49
patterns over the time line (X axis) enable a better understanding of the three types
of behaviors nurses perform daily.
4.2
Candidate concepts visual layout
Candidate concepts visual layout uses graph visualization to arrange the thumbnails of candidate concept trends. Each candidate concept is a node and the similarity
between the candidate concepts defines the distance between nodes (see Fig. 2). With
a distance measure between any two candidate concepts, graph drawing algorithms,
e.g. spring-embedder graph drawing[49] and Multi-dimensional Scaling[50] , can be
used to layout all concepts in a two dimension plane. Thumbnails of concept trend
patterns are placed upon the 2D position of the concept (Fig. 2 left).
Figure 2.
Candidate concepts visual layout
We compute a category vector pp(x1 ) for each candidate concept xi , and define
the distance between concept as the distance between their category vectors. Concept
categories are usually what users are interested in when investigating the text, and
can be predefined based on domain knowledge. Thus a category vector is pp(x 1 ) =
{p(c0 |x1 ), p(c1 |x1 ), ...p(cm |x1 )}, where p(c1 |x1 ) is the posterior probability indicating
probability that that xi belongs to that category ci . Those probabilities are estimated
by an on-line trained multi-class discriminant function (MCD) introduced later.
The layout will change in the clustering process, as the the category vectors of
candidate concepts will be updated when the estimation of each category probability changes. Thus the layout provides an in time overview of concepts thumbnails
generated by the current step of the clustering process. Zooming onto individual
thumbnails, users can inspect detailed trend and the keywords of the candidate concept, following the scheme of “detail on demand”[51] .
The layout thumbnails also provide visual cues for users to interactively manipulate the formed clusters: two candidate concepts with similar patterns may indicate
parallel occurrences over time and can be clustered further. The visual layout also
partially supports scaffolding the history of clusters generation. We use transparency
to indicate the age of clusters: the more transparent a thumbnail pattern is, the older
the candidate concept is. So the visual layout tends to draw users’ attention to the
newly formed candidate concepts.
50
4.3
International Journal of Software and Informatics, Volume 6, Issue 1 (2012)
Candidate concepts terrain surface visualization
Trend and graph visualization can help users to manipulate existed clusters,
however users can only qualitatively evaluate the clustering after their manipulations.
Therefore we also propose the terrain surface visualization to suggest good options of
clustering neighbouring concepts. Terrain surface visualization[10,11] renders a surface
profile over a 2D base network (Fig. 3(a)), by treating a numeric attribute of nodes
as response variable, and interpolating the variable into elevations from the every
point of the 2D plane (Fig. 3(b)). It has the advantage of exposing continuous
global changing patterns over one network, and can assist user to identify interesting
local regions with pre-attentative landscape features, such as characteristic peaks and
valleys[10,31,33] .
Figure 3.
Terrain surface formed by interpolating response variable on a base network
In IVC we render a terrain surface to help find good clusters of candidate concepts, by treating the candidate concepts visual layout graph as the base network, and
Qian You, et al.: Iterative visual clustering for learning concepts from ...
51
treating the best available clustering score of a candidate concept as the responsible
variable. Finding the best clustering score first requires a certain objective function
that current candidate concept can be evaluated against, when being clustered with
it’s near by concepts. Because the objective function of clustering is in general subjective, and is highly dependent on the semantics of the context in our case, we do
not limit IVC to a single objective. Instead, we use multiple criteria, and for each of
the criterion we render a terrain surface based the best clustering scoring evaluated
against of this criterion. Figure 4(a) shows a panel of three contours of the terrain
surfaces of three criteria functions used:
Pattern templates: The pattern templates are interesting patterns that are
found in candidate concepts (shown in Fig. 4(c) Pattern Templates). When using
pattern templates as objective functions, it evaluates similarity between the patterns
of templates and a candidate concept by calculating a cosine score between the two
occurrence vectors. Pattern templates can be enriched or deleted during the iterations.
Terrain surface of best clustering scores regarding to the templates are shown in
Criteria 1 (Fig. 4(a)). We use a saturation-hue model to color the scalar value of the
surface height.
Statistical dependencies: bigram is used to evaluate the statistical dependencies. A large bigram value indicates a strong temporal dependency between two
concepts in the text, and therefore is a good indicator that the two can be clustered
to be a higher-level concept. We use bigram because n-gram evaluation can be unreliable as the n grows larger than three. The terrain surface of using bigram as
objective functions is shown in Fig. 4(a) above Criteria 2.
Posterior probability to the predefined categories: using this objective
function, we try to cluster the current candidate concept with its neighbors such that
the largest score in the feature vector pp(x1 ) = {p(c0 |xi ), p(c1 |x1 ), ...p(cm |x1 )} of the
resulted cluster is maximized.
Figure 4.
The Iterative Visual Clustering (IVC) user interface and interactive visualizations
A number of other different criteria functions, especially statistical measure
widely used in NLP, can be included to help identify semantically meaningful clusters.
52
International Journal of Software and Informatics, Volume 6, Issue 1 (2012)
Finding the best clustering score also requires searching for one or several neighbors to cluster with the current concept, so that the clustering score against object
functions are maximized. We use a best-first heuristic process to maximizing the clustering score to cluster with the current candidate concept: we first sort each of the
nearest neighbours in descending order, according to the criteria function; we then
keeps merging with the sorted neighbours one by one until the value of clustering
results does not increase.
The resulted shape of concept terrain surface visualizes the local best clusters
for individual concepts, as well as highlights the regions where good clustering could
occur. This is because the terrain surface is rendered using the best clustering score
as the response variable for a single thumbnail node, and uses the global layout as the
base network. Users can identify landmark peaks and compare the shapes and scalar
color-encoding of the landscape featuresm, in order to investigate the region and the
quality of clustering.
5
5.1
Iterative Visual Clustering Process
Perceptron based iterative learning model
The iterative clustering process is driven by users’ interactions with the visual
representations described in the previous section. At the same time users’ interactions
need to be transformed as input learnt by IVC iteratively. And the learning model
need to output a category vector pp(x1 ) = {p(c0 |x1 ), p(c0 |x1 ), · · · p(cm |x1 )} for each
candidate concept xi , indicating its probabilities belong to each certain predefined
category. So the category vectors can be used to define the distance measure for visual
layout. And as the model learns, the output will be updated in iterative clustering,
which will cause the visualizations to change therefore to reflect users input.
Therefore we propose a multi-class discriminant function (MCD) as the learning model, which learns users’ interactions, and outputs the category vector. MCD
consists of multiple Perceptrons[21], each of which estimates a posterior probability of one category p(c1 |x1 ). Each Perceptron is essentially a hyper plane classifier
represented by a high-dimensional multiplicative vector . It has been used as a biclass classifier that can be trained by batched training samples as well streamed-in
new training samples. The training procedure and pseudo code for one perceptron
is presented in Table 1. The multiplicative vector of this perceptron is trained by
iterating through all training samples labeled for this category. Using MCD for evaluating an unlabeled sample, we use a sigmoid transform which transcorms a very
large w = {p(c1 |x1 )} value to nearly 1 and a very small value into 0. We treat the
largest among p(c1 |x1 ) = 1 as the category of the sample.
In IVC, candidate concepts assigned with categories are training samples. We
extract visual features and in-context features to represent a sample, i.e. a candidate
concept. Each sample has the following seven feature values that depict the characteristics of the trend pattern of the concept: the number of peaks, the maximum
difference among the magnitude of peaks, the least difference among the magnitudes,
the position of the largest peak as in the accumulated occurrence vector, the position
of the smallest peak, the position of the first peak, the position of the second. The
in-context features are normalized occurrences this concept has in each position of
Qian You, et al.: Iterative visual clustering for learning concepts from ...
53
the text.
Table 1
Pseudo code for training a Perceptron
IVC PERCEPTRON (wi , xi , yi , C)
wi = wj1 , wj2 , . . . , wjN is the multiplicative
weight vector for this perceptron
Each xi is a training feature vector
Each yi is binary number for expected
label for xi
ρ is the learning rate
C is a constant number
Initialize each wj1 , wj2 , . . . , wjN to 1, initialize wj0
to −N
For each xi = xi1 , xi2 , . . . , xiN (
P
0
f (xi ) = N
j=1 xij ∗ wij , y(xi ) =
1
if f (x) − C < 0
if f (x) − C > 0
if y(xi ) == yi ,// prediction is right, does nothing
else loss = −|f (xi ) − C|, wij = wij + ρ ∗ loss ∗ xij
After the multiple perceptrons of MCD are trained, we use them to estimate the
category vector for any candidate concept and its most probable category.
MCD has the following advantages that are desired by IVC: training a multiplicative weight vectors w in one Perceptron only involves linear complexity; the learning
process can be broken into iterations and each iteration can be interactive, i.e. waiting
for and collecting users interactions; the learning is on-line so that it can continuously
learn as new training samples are input in during each iteration.
5.2
User-Interaction driven iterative learning
Users’ interactions reflect their preference and intentions. IVC provides an rich
user interface (Fig. 4) with integrating the visualizations described in Section 4. Users’
interactions with visualizations would mainly change the number and category of candidate concepts. Therefore users’ preference drives the clustering process via affecting
the training samples of the iterative learning model. The output are the updated category vectors which is then transformed as the distance measures visualized by the
IVC.
IVC supports the following interactions where users can directly change the training samples of the learning model:
Create/Delete new candidate concepts candidate concepts are added to or
deleted from the present candidate concept training sample pool. A newly created
concept, if without a user specified concept category, goes through the evaluation of
MCD to be assigned with a label.
Change candidate concepts’ categories after investigating the candidate
concepts, user can also decide whether a concept is a good example of a specific
category. Once a desired category of a candidate concept is set or changed by the
user, it will enforce this concept being a newer training sample solely for that category.
IVC also supports other interactions, such as changing the criteria function, which
could also indirectly change the input of the learning model. For example, users can
create/delete template patterns criteria function. It will cause the change of best
54
International Journal of Software and Informatics, Volume 6, Issue 1 (2012)
clustering scores therefore the shape of the rendered surface. Observing differing
visual shapes, users may make different decisions on creating or deleting concepts,
clustering or assigning categories.
Upon the first iteration and candidate concepts visual layout, we train MCD with
very limited number of training examples for each of the concept category. As the
iteration proceeds, the training samples are enriched by users’ manipulated clusters.
During iterations, users can interactively cluster neighbouring candidate concepts
then assess the quality on terrain surfaces. Or users can also pick a consensus cluster,
a cluster with strong visual cues in all terrain surfaces, indicating that it has high
best clustering scores when evaluated against all criteria function. Such a cluster can
be found by correlating visual cues on several terrain surfaces isoline visualization
together. For example, in Fig. 4(a) the cycled regions in three isoline visualizations
show that there is the same spot with very saturated color. This phenomenon indicates
the spot represent one cluster that is relevant with regard to all criterions. As we
locate and highlight the cluster in Fig. 5(b) Candidate Concept Layout, the enlarged
temporal trend and the keywords “gather” “towel” are shown in Fig. 4b Candidate
Concept. We can also stack several temporal patterns in Concept River (Fig. 4(c)) to
desrive insights whether this candidate concept, when stacked with others, generates
implications. It is up to the users’ subjectivity to interactively change the clusters,
which is then learnt by the model.
6
Results and Discussions
We applied IVC onto 43 data sets of nursing narratives. After pre-processing,
163 keywords remained. We asked nursing experts to give two to three representative
examples for each of the three concepts categories, i.e. “procedures” “interactions”
“documentation”. These examples are used to train the initial MCD. On choosing
pattern templates, the number of peaks, and the general trends among differing peaks
are two major characteristics.
Figure 5 shows several pre-defined concept examples with interesting patterns.
The concept “gurney” “hall” “supply” (Fig. 5(a)) are procedures of fetching supplies
from hall and has roughly 5 peaks alternating between smaller ones and stronger ones;
“nurse”“physician”“secretary”(Fig. 5(b)) represents interactions with colleagues, and
has 7 peaks with 3 periodical intervals; “explain” “answer” “reply” are nurses’ verbal communications and has only two distinctive peaks. They are used for initially
training the learning model, and their patterns also become initial pattern templates
in the criteria function. Their patterns are shown in Fig. 6(a)–(c).
More interesting patterns are detected during iterations. Several of them are
included into Pattern Templates, shown in Fig. 6(d)–(f): (d) has distinctively three
peaks with decreasing intensities; (e) has only one peak at the beginning of a session;
(f) has almost symmetric peaking patterns.
In the experiments, we set three criteria functions: interesting pattern templates,
bi-gram in the context and value associated with any of the predefined categories. We
investigated the three terrain surfaces where peaking areas indicate newer better clusters with regard to the criteria function. We also explore the candidate concepts near
suggested good ones and interactively create larger concepts to feedback. As a result,
we found clusters that can infer high-level concepts with interesting patterns. Figure
Qian You, et al.: Iterative visual clustering for learning concepts from ...
55
7 shows several identified concepts: (a) “mother” “staff” “family” is a concept indicates nurses’ interactions with non-medical professionals; (b)-(c),(f) are four concepts
for four specific procedures; Fig. 7(e) indicates the pattern for a document concept ,
label\/document\/record.
Comparing the patterns of identified concepts in Fig. 5, Fig. 7, and Fig. 8 with
examples in Fig. 5, we could draw three conclusions. First, we could identify concepts
with similar patterns as well as similar semantics. Figure 8a shows two stacked
patterns, where the purple concept (the same as Fig. 7(d)) has similar trends as the
orange red one (a representative example, Fig. 6(a)), because their number of peaks
are the same and the trends of peak intensities are consistent. Both of the concepts
represent specific procedures. The second conclusion is semantic similar concepts
might have differing patterns. In Fig. 8(b), both patterns are concept for interactions,
orange red pattern (an example, Fig. 6(b)) has decreasing peaks in each of its periodic
intervals whereas the purple (detected concept in Fig. 7(a)) has two strong peaks in
each interval. Also, Fig. (b)(c)(d)(f) are all procedure concepts. However, they share
essentially differing patterns. These discrepancies discovered during the iterations
imply criteria functions based on pattern templates are also very necessary to detect
diverse shapes for similar concepts. The third conclusion is introducing later identified
patterns into templates can assist prioritizing and detecting meaningful clusters with
newer shapes. As in Fig. 8(c), the cornflower blue pattern is introduced later (as in
Fig. 7(e)) however help us to find cluster in Fig. 7.
Figure 5.
Representative examples with interesting patterns (a) an example of “procedures”(b)(c)
examples of “interaction”
Figure 6.
Template patterns with different number of peaks and differing trends
56
7
International Journal of Software and Informatics, Volume 6, Issue 1 (2012)
Figure 7.
Template patterns with different number of peaks and differing trends
Figure 8.
Template patterns with different number of peaks and differing trends
Conclusion and Future Work
In this paper we present iterative visual clustering (IVC), a visual analytics framework for unstructured text mining. It aims to identify keyword clusters that can infer
higher-level concepts with interesting temporal patterns. Due to the explorative nature of the problem, the clustering process lacks objective functions and the domain
experts need to find a way to engage in the clustering process. To address these
difficulties, we propose IVC with an on-line model that can be continuously trained
by users’ interactions. To interface with users’ interactions, IVC uses a number of
interactive visualizations, including concept trend visualizations, concept visual layout and concept terrain surface visualizations. These visualization techniques do not
only present indivisual current candidate concepts, but also suggest good clustering
by rendering them as graphics features. IVC is a novel visual text analytical model
because it has the following advantages: it has a formal model that learns users’ interactions and preferences; it supports on-line learning so users’ interactions can be
learnt in iterations of the clustering; it also has a variety of interactive visualizations
which assists users to manipulated clusters. After applying IVC to unstructured text
such as nursing narratives, we found concepts of semantics and patterns relevant to
examples and pattern templates.
Qian You, et al.: Iterative visual clustering for learning concepts from ...
57
We are now working on extending IVC to a more general framework by applying it
to other type of unstructured text. Better features to indicate semantic relationships
can also be investigated. Also formal evaluations or user studies needs to be set up
for a comprehensive assessment of IVC.
References
[1] Hearst M. TileBars: visualization of term distribution information in full text information
access. Proc. of the SIGCHI Conference on Human Factors in Computing Systems. ACM
Press/Addison-Wesley Publishing Co. New York, USA. 1995. 59–66.
[2] Miller NE, Wong P, Brewster M, Foote H. TOPIC ISLANDST M /-a wavelet-based text visualization system. Visualization’98. Proc. NC, USA. 1998. 189–196.
[3] Olsen K, Korfhage R, Sochats K, Spring M, Williams J. Visualization of a document collection:
The VIBE system. Information Processing & Management, 1993, 29: 69–81.
[4] Fisher D, Hoff A, Robertson G, Hurst M. Narratives: A visualization to track narrative events
as they develop. IEEE Symposium on Visual Analytics and Technology (VAST 2007). 2008.
115–122.
[5] Zhu W, Chen C. Storylines: Visual exploration and analysis in latent semantic spaces. Computers & Graphics, 2007, 31: 338–349.
[6] Akaishi M, Hori K, Satoh K. Topic tracer: a visualization tool for quick reference of stories
embedded in document set. Information Visualization. 2006. 101–106.
[7] Havre S, Hetzler E, Whitney P, Nowell L, Div BPN, Richland WA. ThemeRiver: visualizing
thematic changes in large documentcollections. Visualization and Computer Graphics. IEEE
Trans. on, 2002, 8: 9–20.
[8] Wei F, Liu S, Song Y, Pan S, Zhou M, Qian W, Shi L, Tan L, Zhang Q. TIARA: a visual
exploratory text analytic system. Proc. of the 16th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining. ACM New York, NY. 2010. 153–162.
[9] Wattenberg M, Viégas F. The word tree, an interactive visual concordance. IEEE Trans. on
Visualization and Computer Graphics, 2008: 1221–1228.
[10] You Q, Fang S, Chen J. GeneTerrain: visual exploration of differential gene expression profiles
organized in native biomolecular interaction networks. Information Visualization, 2010, 9(1):
1–12.
[11] Tory M, Swindells C, Dreezer R. Comparing dot and landscape spatializations for visual memory
differences. IEEE Trans. on Visualization and Computer Graphics, 2009, 15: 1033–1040.
[12] Salton G, Wong A, Yang C. A vector space model for automatic indexing. Communications of
the ACM, 1975, 18: 613–620.
[13] Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. Addison-Wesley Harlow, England, 1999.
[14] Duda RO, Hart PE, Stork DG. Pattern Classification and Scene Analysis (2nd ed.). WileyInterscience. 2000.
[15] Zhang J. Visualization for information retrieval. Springer-Verlag, 2008.
[16] Jolliffe I. Principal component analysis. Springer-Verlag, 2002.
[17] Hofmann T. Probabilistic latent semantic indexing. Proc. of the 22nd Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval. 1999. 50–57.
[18] Blei DM, Ng AY, Jordan MI, Lafferty J. Latent Dirichlet Allocation. Journal of Machine
Learning Research, 2003, 3: 993–1022.
[19] Charniak E. Statistical language learning. The MIT Press, 1996.
[20] Rissanen J. Modeling by shortest data description. Automatica, 1978, 14: 465–471.
[21] MacKay D. A practical Bayesian framework for backpropagation networks. Neural Computation, 1992, 4: 448–472.
[22] Cohen W, Singer Y. Context-sensitive learning methods for text categorization. Trans. on
Information Systems (TOIS), 1999, 17: 141–173.
[23] Littlestone N. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 1988, 2: 285–318.
58
International Journal of Software and Informatics, Volume 6, Issue 1 (2012)
[24] Feldman R, Dagan I, Hirsh H. Mining text using keyword distributions. Journal of Intelligent
Information Systems, 1998, 10: 281-300.
[25] Mamykina L, Goose S, Hedqvist D, Beard DV. CareView: analyzing nursing narratives for
temporal trends. Conference on Human Factors in Computing Systems. 2004. 1147–1150.
[26] Byron L, Wattenberg M. Stacked graphs-geometry & aesthetics. IEEE Trans. on Visualization
and Computer Graphics. 2008. 1245–1252.
[27] Collins C, Viégas F, Wattenberg M. Parallel tag clouds to explore and analyze faceted text
corpora. Visual Analytics Science and Technology. VAST 2009. IEEE Symposium on. 2009.
91–98.
[28] Rose S, Butner S, Cowley W, Gregory M, Walker J. Describing story evolution from dynamic
information streams. Visual Analytics Science and Technology. VAST 2009. IEEE Symposium
on. 2009. 99–106.
[29] Van Ham F, Wattenberg M, Viégas F. Mapping text with phrase nets. Visualization and
Computer Graphics. IEEE Trans. on, 2009, 15: 1169–1176.
[30] Chen Y, Wang L, Dong M, Hua J. Exemplar-based visualization of large document corpus.
IEEE Trans. on Visualization and Computer Graphics, 2009, 15: 1161–1168.
[31] Press W, Teukolsky S, Vetterling W, Flannery B. Numerical recipes in C. Cambridge university
press Cambridge. 1992.
[32] Wise J, Thomas J, Pennock K, Lantrip D, Pottier M, Schur A, Crow V. Visualizing the nonvisual: Spatial analysis and interaction with information from text documents. Information
Visualization. Proc. 1995. 51–58.
[33] Davidson G, Hendrickson B, Johnson D, Meyers C, Wylie B. Knowledge mining with VxInsight:
discovery through interaction. Journal of Intelligent Information Systems, 1998, 11: 259–285.
[34] Brown P, Desouza P, Mercer R, Pietra V, Lai J. Class-based n-gram models of natural language.
Computational linguistics, 1992, 18: 467–479.
[35] Blei D, Ng A, Jordan M. Latent dirichlet allocation. The Journal of Machine Learning Research,
2003, 3: 993–1022.
[36] Blei D, Lafferty J. Topic Models. Taylor and Francis. 2009.
[37] Rose DES, Cramer N, Cowley W. Text Mining. John Wiley and Sons, Ltd. 2009.
[38] Sørensen T. A method of establishing groups of equal amplitude in plant sociology based on
similarity of species content and its application to analyses of the vegetation on Danish commons.
1948.
[39] Rose S, Butner S, Cowley W, Gregory M, Walker J. Describing Story Evolution from Dynamic
Information Streams. IEEE Symposium of Visual Analytics Science and Technology. Atalantic
City, NJ. 2009.
[40] You Q, Fang S, Ebright P. Visualizing unstructured text sequences using iterative visual clustering. LNCS, 2007, 4781: 275–284.
[41] You Q, Fang S, Ebright P. Iterative visual clustering for unstructured text mining. Proc. of the
International Symposium on Biocomputing. Calcuit, Kalara, India, 2010. Article No.26.
[42] Oelke D, Bak P, Keim D, Last M, Danon G. Visual evaluation of text features for document
summarization and analysis. Visual Analytics Science and Technology, 2008. VAST ’08. IEEE
Symposium on. 2008. 75–82.
[43] Crossno PJ, Dunlavy DM, Shead TM. LSAView: A Tool for Visual Exploration of Latent
Semantic Modeling. IEEE Symposium on Visual Analytics Science and Technology. Atalantic
City, NJ. 2009. 83–90.
[44] Thomas JJ, Cook KA. Illuminating the path: the research and development agenda for visual
analytics. IEEE Computer Society. Los Alamitos, CA, United States. 2005.
[45] Chan B, Wu L, Talbot J, Cammarano M, Hanrahan P. Vispedia: interactive visual exploration
of wikipedia data via search-based integration. IEEE Trans. on Visualization and Computer
Graphics, 2008, 14: 1213-1220.
[46] Koch S, Bosch H, Giereth M, Ertl T. Iterative integration of visual insights during patent search
and analysis. IEEE Symposium of Visual Analytics Technology and Science. Atalantic City,
New Jersey, US. 2009.
[47] Schreck T, Bernard J, Von Landesberger T, Kohlhammer J. Visual cluster analysis of trajectory
Qian You, et al.: Iterative visual clustering for learning concepts from ...
59
data with interactive Kohonen maps. Information Visualization, 2009, 8: 14–29.
[48] Oppenheim A, Willsky A, Hamid S. Signals and systems. 1997.
[49] Fruchterman TMJ, Reingold EM. Graph drawing by force-directed placement. Software- Practice and Experience, 1991, 21: 1129–1164.
[50] Gansner E, Koren Y, North S.Graph Drawing by Stress Majorization. Proc. 12th Int. Symp.
Graph Drawing (GD’04). LNCS, 2004, 3383:239–250.
[51] Shneiderman B. The eyes have it: A task by data type taxonomy for information visualizations.
Visual Languages, 1996. Proc. IEEE Symposium on. 2002. 336–343.
[52] Cui W, Liu S, Tan L, Shi C, Song Y, Gao Z, Qu H, Tong X. TextFlow: towards better understanding of evolving topics in text. IEEE Trans. on Visualization and Computer Graphics.
2011. 2412–2421.
[53] Wu Y, Liu S, Wei F, Zhou MX, Qu H. Context preserving dynamic word cloud visualization.
Proc. of the 2010 IEEE Pacific Visualization Symposium (PacificVis). 2011. 121–128.