- National University of Singapore

Transcription

Generating Incremental Length Summary Based on Hierarchical
Topic Coverage Maximization
JINTAO YE, MOZAT PTE.LTD of Singapore
ZHAO YAN MING, Digipen Institute of Technology
TAT SENG CHUA, National University of Singapore
Document summarization is playing an important role in coping with information overload on the Web.
Many summarization models have been proposed recently, but few try to adjust the summary length and
sentence order according to application scenarios. With the popularity of handheld devices, presenting key
information first in summaries of flexible length is of great convenience in terms of faster reading and
decision-making and network consumption reduction. Targeting this problem, we introduce a novel task
of generating summaries of incremental length. In particular, we require that the summaries should have
the ability to automatically adjust the coverage of general-detailed information when the summary length
varies. We propose a novel summarization model that incrementally maximizes topic coverage based on
the document’s hierarchical topic model. In addition to the standard Rouge-1 measure, we define a new
evaluation metric based on the similarity of the summaries’ topic coverage distribution in order to account
for sentence order and summary length. Extensive experiments on Wikipedia pages, DUC 2007, and general
noninverted writing style documents from multiple sources show the effectiveness of our proposed approach.
Moreover, we carry out a user study on a mobile application scenario to show the usability of the produced
summary in terms of improving judgment accuracy and speed, as well as reducing the reading burden and
network traffic.
Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and
Retrieval—Information filtering
General Terms: Algorithms, Performance, Experimentation
Additional Key Words and Phrases: Multi-document summarization, data reconstruction
ACM Reference Format:
Jintao Ye, Zhao Yan Ming, and Tat Seng Chua. 2016. Generating incremental length summary based on
hierarchical topic coverage maximization. ACM Trans. Intell. Syst. Technol. 7, 3, Article 29 (February 2016),
33 pages.
DOI: http://dx.doi.org/10.1145/2809433
1. INTRODUCTION
Summarization is a great way to conquer information overload by compressing long
document(s) into a few sentences or paragraphs. With the popularity of handheld
devices, the need for summarization is even greater in the face of inherent limits of
screen size and wireless bandwidth [Zhang 2007; Otterbacher et al. 2006]. To cater to
specific summarization scenarios, it is desirable that a model can generate summaries
Authors’ addresses: J. Ye, MOZAT PTE.LTD of Singapore, 23 West Coast Crescent Blue Horizon, Tower B,
#06-09. Singapore 108246; email: [email protected]; Z. Yan Ming (corresponding author), Department
of Computer Science, Digipen Institute of Technology, 510 Dover Road, #02-01, Singapore 139660; email:
[email protected]; T. S. Chua, Department of Computer Science, National University of Singapore,
AS6, #05-08, NUS, Singapore 117417; email: [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or [email protected].
c 2016 ACM 2157-6904/2016/02-ART29 $15.00
DOI: http://dx.doi.org/10.1145/2809433
ACM Transactions on Intelligent Systems and Technology, Vol. 7, No. 3, Article 29, Publication date: February 2016.
29
29:2
J. Ye et al.
Fig. 1. Scenario of an incremental length summary model be applied to help a user distinguish interesting
articles from some candidates on the topic “MH370.” The server at the top right side supplies a service to
generate summaries of incremental length for a specific article. The MS diamonds stands for the “More
Sentences” requirement made by the user, the Decision diamond represents the user making a final decision
on her interest in the target article. Each gray circle represents an interaction between user and server,
where the user requests more sentences and the server delivers several subsequent sentences to the user.
of various lengths for the same target document(s). To ensure flexibility, it is better that
the summary is generated in an incremental manner so that the users can choose to
read any length they like in different scenarios. Putting together the basic requirement
of high topic coverage and low redundancy for summaries, the goal of incremental
length summarization is to facilitate users in easily consuming information with both
high speed and high accuracy while easing the burden on network traffic.
The idea of arranging text content ordered by its importance, from high to low levels,
is not new. In journalism, the inverted pyramid structure is widely adopted in which
the overview of an event is usually put at the beginning, followed by some supporting
details arranged in order from important to trivial. This structure is intended for
readers who may stop reading at any point. In this work, we follow a similar idea in
generating varying length summaries for a broad spectrum of text genres and document
sets on focused topics. Figure 1 shows a scenario using incremental length summaries
(cf. Section 2.4) to preview a set of search results on the topic “MH370.” Instead of
showing a short snippet for each article, the user can swipe down the page to request
more sentences until he or she can identify their level of interest in the target article.
Such information consumption gives the user more flexibility than either reading a
snippet of unknown representativeness or the full article. A key point here is that the
summary sentences need to be ordered to cover information from general to detailed.
Otherwise, the user is prone to incorrectly judge the article’s interest by reading some
trivial details first and then stopping reading. In a different case, reading time can
be prolonged because digesting trivial details may cost the user more time and waste
network resources as well. To achieve efficiency, we need to generate a set of summaries
of incremental lengths, rather than a fixed-length summary, for any target document
Generating Incremental Length Summary
29:3
Fig. 2. Process of generating incremental length summarization for documents about Microsoft Products.
Nodes and lines in the hierarchy denote the topics and their relationships. The size of a node reflects the
information it contains. Gray parts of a circle represents information uncovered, black parts information
covered. The next sentence is selected from the node surrounded by a square.
(set). In this setting, the longer summaries grow from the shorter ones; thus the onscreen display never needs to be totally refreshed.
To generate an incremental length summary, a model that can adjust the coverage of
general-detailed information and automatically order the sentences is needed. Various
summarization models [Mani and Bloedorn 1998; Radev 2000; Erkan and Radev 2004;
He et al. 2012; Wang et al. 2013] have been developed to ensure high topic coverage
and low redundancy of content in the summary. However, few of them try to adapt the
length of the summaries for specific scenarios. What’s worse, almost none of the current
models explicitly considers the order of sentence arrangement in the summary. None
tries to solve both issues of sentence order with varying length. This usually results
in machine summaries that achieve high scores in standard evaluations but have low
interpretability or readability. Therefore, the research questions we are trying to solve
in this work are:
—What principles should we follow to add and order sentences in the summary?
—What is a good model to incorporate the content coverage, order, and length requirements in order to generate incremental length summaries?
These two questions are explored in works in the broad area of summarization,
including those on learning and cognitions. Studies [Endres-Niggemeyer et al. 1995;
Brandow et al. 1995] on human abstraction process have shown that, usually, a human
abstractor extracts sentences to compose a summary according to a document’s topic
structure (e.g., the hierarchical topic structure at the top right side of Figure 2). It is
understood that a single integrated document and multiple documents under the same
topic contain some subtopics [Brandow et al. 1995; Harabagiu and Lacatusu 2005]. A
high-quality summary should cover as many subtopics as possible [Hearst and Plaunt
1993], as has been done in some of the latest summarization methods [Harabagiu and
Lacatusu 2005; Arora and Ravindran 2008; Wang et al. 2009]. Moreover, the topic
and subtopic structure also provide valuable information that can be explored when
arranging the content and order of sentences in the summary.
In this article, we propose a new hierarchical topic coverage-based summarization
model. At the intratopic level, it is preferable to pick out sentences close to the topic
cluster centroid. At the intertopic level, sentences about more general topics are
29:4
J. Ye et al.
selected prior to those associated with detailed topics. For example, under the topic
Microsoft Product, the subtopic Office is more general than Excel. Thus, sentences
closely related to Office are selected into the summary earlier than their counterparts
under Excel. When Office has been covered to a certain extent, sentences about Excel
will have a chance to be selected.
Our proposed framework first constructs a hierarchical topic structure and assign
each sentence to its most related topic. Then we restrict each sentence’s coverage scope
according to the position of the subtopic in the topic structure it belongs to. To generate
the summary, we extract sentences one by one, maximizing the coverage for all topics
by restricting these sentences’ coverage scope. Figure 2 illustrates the summarization
process of our model. During sentence selection, sentences that can maximize all topics’
coverage are picked out. From the figure, we see that sentences from general (top levels)
topics are selected into the summary ahead of those about detailed (bottom levels) ones.
We conduct both quantitative and qualitative evaluations of the proposed model and
several state-of-the-art summarization models. It is worth mentioning that our method
can be applied on both single and multiple documents. For qualitative evaluation, we
perform experiments on Wikipedia pages for single document summarization and on
DUC1 2007 data for multidocument summarization. In addition, a general noninverted
writing style collection from multiple sources is adopted to eliminate the influence of
the inverted pyramid writing style during the summarization process. Moreover, we
evaluate the performance measured by a ROUGE-N [Lin 2004] score and the similarity of topic coverage distribution measured in a novel method proposed by us. For
qualitative evaluation, we carry out a user study that aims to help users identify the
level of interest of an article. The user study was performed using both inverted and
noninverted writing style document sets to evaluate the usability of the generated
summaries on four indicators: user’s reading burden, network traffic, efficiency, and
accuracy for making judgments. The experimental results show the effectiveness of our
proposed model. In summary, the contribution of this work is fourfold:
(1) To the best of our knowledge, our model is the first to treat document summarization as a hierarchical topic coverage problem. Our model also pioneers a method
that tries to comply with the order that a human abstractor follows during the
summarization process.
(2) We introduce a new task for summarization that generates summaries of varying
lengths and allows the automatic adjustment of general-detailed information from
the content. The summary is well suited for applications where summary length is
dynamically decided by the user in order to identify the interest of document(s).
(3) We propose a novel summarization model that incrementally maximizes topic coverage based on the underlying document’s topic hierarchy and has the ability to
automatically adjust the coverage of high and low level information when generating summaries of varying length.
(4) We define a novel summarization evaluation method for measuring the similarity of topic coverage distribution on a hierarchical topic structure between two
summaries.
The remainder of the article is organized as follows. Section 2 introduces related
work. Our problem formulation is detailed in Section 3. We describe our document
summarization framework in Section 4. Section 5 presents the experimental results
along with some discussion, and Section 6 concludes the article.
1 http://duc.nist.gov.
29:5
2. RELATED WORK
In this section, we first introduce the general summarization and update summarization. Then we shift our attention to topic coverage and sentence order in summarization. Finally, we focus on incremental length summarization, which is the subject of
this article.
2.1. General Summarization
Document summarization techniques have been studied for a long time. Earlier works
used some heuristic methods such as sentence position, keywords, and cue words
[Edmundson 1969]. More recently, a centroid-based method [Wang et al. 2008] utilized
clustering techniques to assign salience scores to each sentence based on the cluster it
belongs to and its distance from the cluster centroid. Then, sentences with top salience
scores are selected out into the summary. A graph-based method [Mihalcea 2004; Wan
and Yang 2006] is inspired by Page-Rank [Brin and Page 1998]. Utilizing some redundancy reduction techniques such as Maximal Marginal Relevance [Carbonell and
Goldstein 1998] and Cross-Sentence Information Subsumption [Radev 2000], it has
been shown that the graph-based model usually achieves better performance than the
centroid-based model [Radev et al. 2000]. With the popularity of machine learning
techniques, some machine learning-based summarization models [Kupiec et al. 1995;
Li et al. 2009] are proposed. These models are generally limited by the lack of available
training data. Whereas most of the summarization models are extractive, Cohn et al.
[2013] propose using an abstractive approach to sentence compression.
In recent years, more works [Gong and Liu 2001; Haghighi and Vanderwende 2009]
are concentrating on summarization according to the topic level information about
the original document(s) and show that summaries of high quality can be obtained.
More recently, He et al. [2012] propose to extract sentences as summary from a data
reconstruction perspective, where sentences that can best reconstruct the original
document(s) are selected out. Our proposed model also adopts a data reconstruction
perspective.
2.2. Update Summarization
After being piloted in DUC 2007, TAC 2008 formally proposed the update summarization task. Update summarization works on two datasets, A and B, that both focus
on the same topic or event but where all articles in A are timestamped earlier than
those in B. The summarizer is requested to produce a summary about B, under the
assumption that he or she has already digested all articles in A. The most challenging
issue for update summarization is to include novel information about B that is not
expressed by A, while avoiding redundancy of information between A and B. Following
its proposal, summarization soon drew plenty of attention from both researchers and
practitioners. Various kinds of models have been proposed in recent years, such as
graph-based models [Wenjie et al. 2008; Li et al. 2011] and models that work on latent
semantic space [Steinberger and Ježek 2009; Kogilavani and Balasubramanie 2012;
Delort and Alfonseca 2012]. As well, some works [Wang et al. 2009; Ming et al. 2014;
Wang and Li 2010] focus on generating an update summary in real time for scenarios
where articles arrive in sequence. The difference between update summary and our
proposed incremental length summary is described in Section 2.4.
2.3. Topic Coverage and Sentence Order in Summarization
Most existing summarization techniques do not take the document’ topic distribution
into consideration. However, a human abstractor usually extracts sentences according
to the document’s topic structure, moving from top level to low level, until enough
29:6
J. Ye et al.
information has been extracted [Endres-Niggemeyer et al. 1995; Brandow et al. 1995].
Topic models such as Hierarchical Dirichlet Processes [Teh et al. 2005] and the
Pachinko Allocation Model [Li and McCallum 2006; Mimno et al. 2007] are both based
on LDA [Blei et al. 2003] and support hierarchical topic structure. Considering the
latent topics discovered by topic models, topic-modeling-based summarization method
have been proposed. Arora and Ravindran [2008] propose a summarization model that
combines a centroid-based method with LDA. However, it does not explore the use of
topics with hierarchical structure.
Although all summarization models can generate summaries of varying lengths, they
seldom explicitly consider the order of information being covered. For the purpose of
helping users effectively navigate the related documents under a main topic that are
returned by a web search, Lawrie and Croft [2003] construct a hierarchical summary
instead of the traditional ranked list. For browsing document in small mobile devices,
a summary with hierarchical structure is generated in Otterbacher et al. [2006] in
which sentences related with the top-level information in a document are exhibited to
the user first. After choosing one sentence, those sentences that describe in more detail
the information expressed by the chosen sentence will be delivered. Zhang [2007] also
proposes organizing a summary of web pages in a hierarchical structure according to
the Document Object Model (DOM)-tree of the web page and successfully adapts the
summary for mobile handheld devices.
The the most closely related work is by Yang and Wang [2003], where a document’s
literal tree is taken into account and fractal theory is utilized during the summarization
process. The root element of the tree is allocated a quota that equals summary length.
For each element in the tree, its quota is inherited by all its child elements proportional
with their importance. The most salient sentence under elements with only one quota
will be selected into the summary. With summary length increases, quotas can be
passed to deeper elements because usually elements located more deeply in the literal
tree express more detailed information. As a result, with the summary length increases,
more sentences with low-level information will be selected into the summary.
In the process of generating summaries of varying lengths, this fractal theory-based
model selects out sentences from different elements independently. What’s more, a
summary generated with a large quota may only convey low-level information. Our
proposed method both considers a document’s topic structure and adopts a global perspective to figure out the exact amount of high- and low-level information to be covered
for a specific summary length.
2.4. Incremental Length Summary
Differing from some traditional fixed-length summaries, length, incremental-length
summary provides the user with the flexibility of changing length by appending new
sentences to a summary. During the summarization process, sentences are selected out
one by one, and a short summary is a proper subset of a longer one. To generate an
incremental-length summary of high quality, the sentence order should be considered
explicitly and sentences should be generated based on level of importance.
Some existing summarization models produce incremental-length summaries, such
as LexRank [Erkan and Radev 2004] and LDA [Arora and Ravindran 2008]. But those
generated by DSDR with a non-negative reconstruction model [He et al. 2012] and the
fractal summarization model [Yang and Wang 2003] are of non-incremental length.
For DSDR, sentence order is unidentified because all sentences in the summary are
considered as a whole and selected out at the same time. For the fractal summarization
model, because summary length increases by 1, one sentence in a short summary
may be replaced with two sentences from deeper elements. In this case, summaries
generated by the fractal summarization model are non-incremental length; the method
29:7
simply delivers new sentences to users to make the summaries appear to be incremental
length.
Although update-summarization and incremental-length summarization both supply novel information based on already generated summaries, there are some important
differences:
(1) Update summarization is applied on two datasets, where all articles in one dataset
are chronologically earlier than articles in the other. However, incremental-length summarization only deals with a singe dataset.
(2) Update summarization aims to supply novel information that is not covered by
the earlier dataset or the summary for it, whereas the purpose of incremental-length
summarization is to provide high-level and informative information that has not been
covered sufficiently.
(3) Update summarization does not explicitly consider sentence order. Incrementallength summarization, in contrast, concentrates on supplying sentences from high to
low level and creates a generated summary in an inverted pyramid writing style.
The incremental-length summary is extremely important for applications where sentences are consumed one after another, and it is the user who decides if it is necessary
to generate more sentences for the summarization model once he or she has read the
latest generated sentences.
3. PROBLEM FORMULATION
3.1. Preliminary and Problem Definition
First, the input and output of the incremental-length summarization task are defined
as follows:
Input: A collection of documents D on a topic t, and the incremental summary lengths
in terms of the number of sentences M : {m1 , m2 , . . . , mi , . . . , mn}, where mj < mk when
j < k.
Output: A series of summaries with incremental lengths for D. The jth summary
contains mj sentences. If we view a summary as a sentence set containing all sentences
in it, the jth summary is a proper set of the kth summary for any j < k.
To generate such a set of summaries so that general information is covered before
more detailed information, we first need to analyze D in terms of the subtopics of t and
their relations. Next, we introduce the concepts needed for developing our method.
Preliminary 1. Given a collection of documents, D, and a main topic mt, we define a
Topic Hierarchy (TH) for D as a a tree-like hierarchical structure where each node
corresponds to a unique subtopic st and a child node can be shared by different parent
nodes. The root node whose level is 0 is the most general subtopic2 in the whole tree.
Every child node is a subtopic of its parent node.
Preliminary 2. For a document set D and the topic hierarchy TH based on it, each
sentence s in D is allocated to a most related node in TH.3 For a node, its exclusive
data are the sentences being allocated to it and its subsumed data are the sentences
being allocated to the biggest (with most nodes) subhierarchy rooted at it. Thus, the
root node’s subsumed data are all sentences in D.
2 To avoid confusion between the node’s level in tree and the level of information related to a subtopic (one
subtopic whose corresponding node has a smaller or lower level in tree expresses higher level information),
we use the terms “general subtopic” and “detailed subtopic” in the remainder of the article. We also use
“general sentence/detailed sentence” to represent a sentence from general/detailed subtopic.
3 In the remainder of this article, we use the term “topic hierarchy” to refer to both the hierarchical topic
structure and the structure allocated with all sentences of document set.
29:8
J. Ye et al.
For example, we construct a topic hierarchy from a set of documents about Microsoft
Product. The root, or main topic, Microsoft Product has several subtopics, among which
Office also has some subtopics, such as Excel. The sentence “Microsoft Office is an
office suite of desktop applications, servers and services for Microsoft Windows and OS
X operating systems” is in the exclusive data of Office, as well as in the subsumed
data of the Microsoft Product node. However, it’s not a part of the data for the Excel
node. In other words, a node contains more sentences than all of its subtopic nodes.
The hierarchical structure at the upper right in Figure 2 gives an instance of a topic
hierarchy in a real application.
With the topic hierarchy for a set of documents, we are now able to make use of
subtopic relations to generate summaries. In a sense, the topic hierarchy is already a
high-level or topic-level summary of the documents, and the summary we are going
to generate embodies this outline with real sentences. Before proposing our summarization method, we first point out the desirable properties that an incremental length
summary possesses based on topic hierarchy.
An incremental-length summary for document(s) that can fully take advantage of its
topic hierarchy should have the following properties:
—Each summary for document(s) maximizes hierarchical topic coverage within its
specific summary length limitation.
—Sentences in the summary have a dynamic balanced distribution on subtopics according to summary length. As the summary length increases, the most related sentences
in general subtopics are selected out first, followed by ones in detailed topics.
For the first property, here we formally define the phrase “hierarchical topic
coverage”:
Definition 3.1. A summary’s Hierarchical Topic Coverage for a collection of
documents organized in a topic hierarchy TH is defined as the sum of information
expressed by the summary for all subtopic nodes in TH. Given two sentence sets, V
and X, the information in V expressed by X is measured with a function IE(V, X) . The
more information in V expressed by X, the
higher the value of IE(V, X). So a summary’s
hierarchical topic coverage for a TH is st∈TH IE(SDst , S), where SDst is the subsumed
data for a specific subtopic node in TH and S represents all sentences in the summary.
In this work, we take a data reconstruction perspective to implement the measure
of expressed information between two sentence sets, as detailed in Section 4.3.1.
4. INCREMENTAL SUMMARIZATION FRAMEWORK
The proposed framework consists of two major steps. In the first step, we construct a
topic hierarchy from original documents. In the second step of summary generation,
for each sentence associated with a node in the topic hierarchy, we first define the scope
of data that can be covered or approximated by it, and then we propose a hierarchical
topic coverage maximization algorithm to select out sentences.
4.1. Topic Hierarchy Construction
The hierarchical topics in the documents guide the generation of incremental-length
summaries. In the first step of our proposed framework, we capture the subtopics and
their relations in the original documents in the form of a topic hierarchy. With the
topical structure, each sentence from those documents is assigned to a unique subtopic
node.
Of the many other topic hierarchy construction proposed in recent years, the Hierarchical Pachinko Allocation Model (hPAM) [Mimno et al. 2007] satisfies all requirements
29:9
Fig. 3. The generative structure in a three-level hPAM model 2. hPAM is a directed acyclic graph, each
node (represented by gray circle) corresponds to a topic, and one node at a given level has a distribution over
all nodes on the child level. The black square represents all the words under a topic. The thin arrow line
in the hierarchy illustrates the process of sampling a subtopic from a specific parent-level topic according
to a multinomial distribution. The thick gray arrow line represents the process of sampling a word from a
specific topic according to a different multinomial distribution.
for topic hierarchy construction. Like the more general Hierarchical Dirichlet Processes
(HDP) [Teh et al. 2005], hPAM can capture general to detailed information by constructing a hierarchical structure for topics. In addition, hPAM improves HDP by allowing a
child topic to be shared by different parent topics. This is a desirable property because
it allows more flexible topic relations. Other topic hierarchy generation methods are
also available [Ming et al. 2010b].
In particular, we adopt the three-level hPAM model 2 [Mimno et al. 2007] to construct
a topic hierarchy for document(s). Model 1 needs an additional process for sampling
the level of the topic for a target word, which is not essential in our task. Model 2 is
thus preferred in terms of efficient implementation and more interpretable analysis of
evaluation results. The choice of the three-level model is based on our empirical study,
detailed in Section 5.2.3.
To show how the three-level hPAM model 2 works, Figure 3 illustrates the generative
structure and the sampling process. During the Gibbs sampling process in this model,
for a word ω, a topic τ with n subtopics samples ω according to an n + 1 dimensional
multinomial distribution that τ is sampled from a Dirichlet distribution with a hy→
perparameter −
α : <α1 , . . . , αn+1 >. Among these n + 1 dimensions, only one dimension
corresponds to τ and enables τ directly to sample ω. Otherwise, the same sampling
logic will be imposed on one of the n subtopics until the word is sampled. Therefore, if
→
τ has a large number of subtopics and all dimensions in −
α are equal (symmetric), it
will be less possible for τ to directly sample a word.
During the sentence allocation process, for each <sentence, topic> pair we find a topic
t with the highest P(t|s) that denotes how probable it is that sentence s belongs to t in
the topic hierarchy. Here, we adopt the bag-of-words model to represent sentences in a
document. More sophisticated approaches can easily fit into the framework, however;
for example, domain-specific term weighting [Ming et al. 2010a] and the semanticbased approach [Moon and Erk 2013]. Given a sentence s in document d, for s to belong
to t, two conditions must be satisfied at the same time: (1) all words in s belong to the
topic t, and (2) topic t appears in the document d. This is expressed in the following
equation:
w∈s P(w|t) ∗ P(t|d)
P(t|s) =
,
(1)
t ∈T P(t |d) ∗
w∈s P(w|t )
29:10
J. Ye et al.
where T stands for all topics that appear in the document d, P(w|t) is the probability
that the word w belongs to topic t, and P(t|d) is the probability that the topic t appears
in document d.
During Gibbs sampling, we accumulate sampled results and use them to compute
P(w|t) and P(t|d):
αl + nld
αlt + nldt
βw + nw
t
P(t|d) ∝
∗
, (2)
,
P(w|t)
∝
l
l ∈L αl + nd
w ∈V βw + nt
t ∈T αlt + nd
l∈L
where L stands for all topics that locate at one level lower than target topic t, and T stands for all topics whose levels are the same as t’s. V is the vocabulary of target
documents. αl , αlt are the dimensions of hyperparameters of Dirichlet distributions
for topics sampling: αl for sampling l from the root topic and αlt for sampling t from
the parent topic l. β is the hyperparameter of Dirichlet distributions for a sampling
word. The hyperparameters for Gibbs sampling work as prior probability and assume
a certain number of subtopics or words being sampled to avoid 0 values of P(w|t) and
P(t|d) for words and topics not sampled at all. Moreover, nldt is the number of words
sampled by topic t that itself is sampled by a parent topic l in document d. nw
t is the
number of words sampled by topic t in all documents.
With the hierarchical topic structure constructed by hPAM, we have a topic hierarchy
outline for the documents to be summarized, and all the sentences from the documents
have been assigned to subtopics in the hierarchy.
4.2. Linear Data Reconstruction for Coverage Measurement
Now we start to define the topic coverage based on the generated topic hierarchy.
Summaries are then generated by selecting sentences that maximize the hierarchical
topic coverage.
4.2.1. Linear Data Reconstruction Perspective. We view the hierarchical topic coverage
with a data reconstruction perspective [He et al. 2012]. For linear data reconstruction,
a sentence Vi in a document set D can be approximated by k sentences X : {x1 , x2 , . . . , xk}
through the linear combination:
Vi ≈ P(Vi , X) = (X, ai ) =
k
ai j X j ,
(3)
j=1
where P(Vi , X) is the projection of Vi on X, and ai : {ai1 , . . . , aik} is the corresponding
linear combination parameters. As in He et al. [2012], the reconstruction error for a
vector Vi and a set of vectors X is defined as the square of L2 -norm for the difference
between Vi and P(Vi , X). For text summarization, the summary is used to reconstruct
the whole document. So, the summary’s overall reconstruction error is the sum of
reconstruction errors for all sentences in documents.
L(V, X, A) =
|V |
i=1
||Vi − (X, ai )||22 =
|V |
i=1
||Vi ||22 −
|V |
||(X, ai )||22 ,
(4)
i=1
where || · ||2 is the L2 -norm, V is the all n sentences in documents to be summarized,
and X is the extracted summary including m sentences, A = [a1 , . . . , ai , . . . , an ]T , with
ai ∈ Rm. In He et al. [2012], the sentence set that minimizes the whole documents’
overall reconstruction error is selected as the summary. As an improvement, in our
framework, we add some restrictions for sentences in both X and V .
29:11
4.3. Hierarchical Topic Coverage Measurement
4.3.1. From Reconstruction Error to Coverage Measurement. Instead of minimizing the reconstruction error directly, we propose to maximize the hierarchical topic coverage as
the summarization criteria. In other words, the hierarchical topic coverage measurement is based on both the reconstruction error and the hierarchical topic structure.
First, we measure a sentence set’s coverage for topics in the topic hierarchy. According
to the definition of reconstruction error in Equation (4), for a set of sentences V , we
can define the information it contains as:
info(V ) =
||v||22 .
(5)
v∈V
We can then rewrite the reconstruction errors in Equation (4) as info(V ) −
info(P(V, X)), where P(V, X) is a vector set containing all projections on X for column
vectors in V .
We define the information covered (or information reconstructed) by a sentence set
X for another sentence set V as follows:
RC(V, X) =
|V |
||(X, ai )||22 ,
(6)
i=1
where RC stands for the reconstruction contribution. In this work, we adopt RC as a
specific implement of IE described in Definition 3.1. When V is the whole subsumed
data of a subtopic st node in a topic hierarchy, then RC(V, X) is X’s coverage for st.
According to the hierarchical topic coverage definition, a sentence set X’s hierarchical
topic coverage for a main topic mt is:
T HC(X, mt) =
RC(SDst , X),
(7)
st∈THmt
where THmt is the topic hierarchy for the main topic, and SDst is the subsumed data of
st in the topic hierarchy.
Based on this formulation, the size of the target document set may affect the reconstruction contribution. We thus normalize the reconstruction contribution with total
information in the target data. In particular, we further introduce the Reconstruction
Ratio (RR) to measure the proportion of information in sentence set V covered by
sentence set X:
RC(V, X)
.
(8)
RR(V, X) =
info(V )
A higher RR of X for V indicates higher representativeness of X for V .
4.3.2. Scope-Restricted Coverage. The topic hierarchy integrates all information about
relations among subtopics and sentence allocation; thus, it enables us to further restrict
the scope of topic coverage for a sentence (that might be selected into the summary).
Taking a bottom-up view of the topic hierarchy, sentences belonging to a node are part
of the subsumed data of all its ancestor nodes. The ancestor nodes will also have their
exclusive data that is not from the descendants. When applying data reconstruction in
such a structure, we can impose some restrictions on the scope that a sentence set can
cover.
Before we introduce the formal definition of scope-restricted coverage, we first clarify our assumptions. We assume that only high-level information can cover low-level
information about the same topic. In our defined topic hierarchy, a sentence allocated
to a subtopic node st covers some information about the node’s descendants, but not
29:12
J. Ye et al.
vice versa. What’s more, the sentence makes no contribution to covering subtopics that
are not descendants of st.
Based on this assumption, we can concisely define the coverage scope restriction: A
sentence in a node’s exclusive data can only be used to reconstruct the node’s subsumed
data.
With this restriction, we can redefine Equation (6) as the scope-restricted coverage:
zgiven a topic hierarchy TH, a subtopic node st in TH and its subsumed data SDst , the
information of st covered by a set of sentences S is defined as follows:
CovInfo(TH, st, S) =
|SD
st |
RC(SDsti , Xi ),
(9)
i=1
where Xi is a subset of S that contains all sentences in S that are able to cover SDsti .
A sentence’s coverage scope enables the data reconstruction method to make use of
the hierarchical topic structure and keeps the overall order of sentences selected into
the summary running from general to detailed.
4.4. Incremental-Length Summarization Algorithm
The basic assumption of incremental-length summarization is that a short summary
must first cover the high-level information about the target topic and that low-level
information is added when more length allowance is given.
An incremental-length summarization algorithm is able to automatically adjust the
high-level–low-level information and cover as much information as possible; here, we
propose the following two properties:
(1) Uncovered general information should be covered by the summary first. As summary length increases, more detailed information is covered.
(2) For sentences of a similar general-detailed level, those that express more uncovered
information should be selected first.
Our summarization approach is outlined in Algorithm 1. After the initialization in
Lines 1–4, Lines 5–21 pick out sentences one by one in a greedy manner.
For the incremental-length summary, the next sentence is the one that maximizes the
corresponding scope-restricted hierarchical topic coverage after being combined with
already selected sentences. In this manner, with respect to a sentence in a subtopic’s
exclusive data, all currently uncovered subsumed data of this subtopic are its candidate
supporters. Among them, those that fall into the intrinsic space of the sentence set that
contains both this sentence and already selected ones are effective supporters. Finally,
our greedy method will pick out those sentences whose effective supporters contain
the most information and append them to the incremental-length summary. Once a
new sentence is picked out and appended to the incremental-length summary, the
summary’s coverage for all nodes in a topic hierarchy will be updated as well.
The overall order for selected sentence is from general to detailed. Because a subtopic
located at a low level in the topic hierarchy is a general subtopic, sentences allocated
to it are viewed as being able to express or cover some general information. Moreover,
a low-level subtopic node has more subsumed data than its children, so sentences allocated to the subtopic can be used to reconstruct more data. As a result, sentences
for a low-level subtopic have some priority on being picked out first by our algorithm.
However, the inverse order—that a sentence depicting detailed information followed
by one about general information—is also possible. In this situation, the general sentence contributes less uncovered information than the detailed one because some other
general sentences containing redundant information have already been selected. In
this case, the detailed sentence will be selected out first by our algorithm. Afterward, if
29:13
ALGORITHM 1: Incremental-Length Summary Generation Algorithm According to
Hierarchical Topic Coverage Maximization
Input: TH: Vi are sentences belonging to Nodei (the subsumed data of Nodei ), V1
contains all the sentences. M{m1 , . . . , mi , . . . , mk}: summary lengths in terms of the
number of sentences in ascending order.
Output: S{S1 , . . . , Si , . . . , Sk}: summaries of length M, and the reconstruction ratio
RRi ∈ Rn for all nodes in TH (n is the number of node in TH).
1 S = {}, RR = {}, NI = {} ;
// NI: information for all nodes
2 for each node in TH do
3
NInode = inf o(Vnode );
4 end
5 for each mi in M do
6
while |Stemp| < mi do
7
max ci = 0, next s = null, NCI = {};
/* max ci: the maximum covered information
/* next s: the next sentence to be selected into summary
/* NCI: covered information for all nodes
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
for each v in V1 do
ci = 0, NCItemp = {}, TS = v ∪ Stemp;
for each node in TH do
NCItempnode = CovInfo(TH, node, TS);
ci = ci + NCItempnode ;
end
if max ci < ci then
max ci = ci, next s = v, NCI = NCItemp;
end
end
Stemp = Stemp ∪ next s, RRtempnode = NCInode / NInode ;
end
Si = Stemp, RRi = RRtemp;
end
return (S, RR);
*/
*/
*/
the unselected general sentence now can contribute the most uncovered information,
it will be selected, and an inverse order occurs.
Hierarchical topic coverage is represented as the sum of all subtopics’ coverage, which
makes our algorithm able to discriminate nodes that locate even at the same level in
topic hierarchy. As a non-leaf node in topic hierarchy subsumes all its children’s data,
a summary’s reconstruction contribution for some exclusive data of high-level nodes,
when in the subsumed data of all ancestor nodes, may be added several times into
the final hierarchical topic coverage. Thus, our method prefers to select out sentences
allocated to subtopics that can be divided into sub-subtopics recursively and contain
substantial data. In real applications, these subtopics are usually the most general
and important. Child nodes shared by different parent nodes in the topic hierarchy can
also contribute more than once to the final hierarchical topic coverage. Usually, such
subtopics are more general than their siblings that only have one parent. This bias is
also captured in the process of hierarchical topic coverage computation and result in
nodes shared by many parents being able to contribute more candidate supporters to
sentences allocated to their ancestor nodes.
The time complexity for Algorithm 1 is analyzed as follows. Given a document set
of n sentences, the term size is d and m is the max summary length. Lines 5–21 iteratively select out sentences that can maximize topic hierarchy coverage under the sentence’s coverage-scope limitation when combined with already selected sentences; the
29:14
J. Ye et al.
complexity is O(nm2 (m + d)). While analyzing the time complexity, for simplicity, instead of Equation (9), we adopt Equation (6) without considering the sentences’ coverage scope limitation in Algorithm 1. The time complexities of our proposed algorithm
are same for these two different methods for measuring the covered information for a
specific subtopic in the topic hierarchy by sentences in a summary. It should be noted
that the number of nodes in the topic hierarchy cannot exceed the number of sentences
in the original document(s). This can be guaranteed by removing nodes allocated to no
sentence from the topic hierarchy.
When k sentences have already been selected out, in Equation (3) we need to compute
X(XT X)−1 XT vi ; that is vi ’s reconstruction approximation by X. The time complexity of
matrix multiplication for X(XT X)−1 XT vi is O(k2 + kd). Based on these k selected sentences, we need to compute an inverse matrix for each remaining candidate sentence
to find the next sentence according to topic hierarchy coverage maximization. This requires us to compute the inverse matrixes for a series of matrixes whose ith element has
the form of XiT Xi , where Xi : [x1 , x2 , . . . , xk, xi ] ∈ Rd×(k+1) , <x1 , x2 , . . . , xk> corresponds
to k sentences that have been selected out, and xi represents the ith candidate sentence.
During kth iteration, the time complexity of computing the inverse matrix for a candidate sentence is O(k2 ) and for all sentences is O(nk2 ). In summary,
for2all m iterations,
3
the time complexity of computing all inverse matrixes is O(n m
k=1 k ) = O(nm ). As
a result, combined with the time
complexity
of
matrix
multiplication,
the
time
com
2
3
2
plexity for Lines 5–21 is O(n m
(k
+
kd))
=
O(nm
+
nm
d).
Cost
for
initialization
k=1
is trivial, thus the overall time cost is O(nm2 (m + d)). Because any method that is
able to construct the topic hierarchy defined in this article can be integrated into our
framework,4 here, we will not discuss the complexity for constructing them.
5. EVALUATION
In this section, we first describe the datasets adopted for our experiments. Then we
introduce our evaluation methods, including three suites of experiments and five stateof-the art models compared against our proposed model. The subsequent sections cover
two kinds of evaluations, based on traditional ROUGE-N score and our proposed similarity of topic coverage distribution, respectively. Finally, we analyze the influence of
topic hierarchy on our summarization framework.
5.1. Datasets
We evaluate our methods on three collections, including data for single-document and
multidocument summarization and for data in the typical inverted pyramid and noninverted writing style.
The first is a Wikipedia page set. This collection is for the single-document experiment. The Wikipedia articles are usually written in the inverted pyramid writing style
and have a multilevel hierarchial outline. In total, we collected 110 Wikipedia pages
that span a variety of categories as summarized in Table I. The Wikipedia corpus was
collected from March 10, 2014 to March 20, 2014.
The second is the DUC 2007 main summarization task data, for the multipledocument experiment. There are 45 topics in the DUC 2007 main summarization task
data,5 and each topic contains 25 articles, as summarized in Table II. Most articles in
this collection are in the inverted pyramid writing style.
The third is a noninverted writing style collection from multiple sources. We collected
100 articles from BBC, CNN, Discover Magazine, ESPN, and Google Blog. Table III
4 For
some of our experiments, we apply a heuristic topic hierarchy construction method on Wikipedia pages.
for detailed topic description.
5 http://www-nlpir.nist.gov/projects/duc/data/2007_data.html
29:15
Table I. Statistics on the Wikipedia Corpus
Category
Animal
Botany
Building
Company
Computer Science
Event
Geography
Geophysics
Person
Publication
Transport
#Topic
Avg #Sentence
10
10
10
10
10
10
10
10
10
10
10
712.6
783.1
509.4
419.3
571.4
1,173.8
1,281.0
632.9
1,063.3
640.7
1,563.3
Topics
Horse, · · ·
Tomato, · · ·
Forbidden City, · · ·
Google, · · ·
Data mining, · · ·
Renaissance, · · ·
Shanghai, · · ·
Tornado, · · ·
Isaac Newton, · · ·
Bible, · · ·
Automobile, · · ·
Table II. Statistics on the DUC 2007 Corpus
Topic Name
#Sentence
Topic Name
#Sentence
Southern Poverty Law
Center
1679
Art and music in
public schools
1276
Amnesty International
307
Basque separatism
439
Turkey and the
European Union
388
Israel/Mossad “The
Cyprus Affair”
Pakistan and the
Nuclear
Non-Proliferation
Treaty
Jabiluka Uranium Mine
Unemployment in
France in the 1990s
US missile defense
system
Iran’s nuclear
capability
462
400
World-wide chronic
potable water
shortages
Microsoft’s antitrust
problems
Napster
345
388
505
Interferon
Linda Tripp
1224
789
Acupuncture treatment
in U.S.
1170
Deep water exploration
748
Round-the-world
balloon flight
Earthquakes in
Western Turkey in
August 1999
679
608
925
Topic Name
#Sentence
345
Steps toward
introduction of the
Euro
Burma government
change 1988
Angelina Jolie
1280
856
Salman Rushdie
344
1372
International Land
Mine Ban Treaty
458
Fen-phen lawsuits
Oslo Accords
941
593
1001
1041
Senator Dianne
Feinstein
Al Gore’s 2000
Presidential
campaign
Eric Rudolph
Kenya education
developments
Reintroduction
program for wolves
in U.S.
Mining in South
America
Day trader killing
spree
Organic food
1276
Starbucks Coffee
Matthew Shepard’s
death
Obesity in the
United States
Newt Gingrich’s
divorce
Line item veto
Public programs at
Library of Congress
Oprah Winfrey TV
show
885
1267
367
After “Seinfeld”
1292
2224
John F. Kennedy, Jr.,
dies in plane crash
OJ Simpson
developments
1153
788
972
308
1490
1133
265
655
1259
1007
969
764
illustrates samples of topics in the corpus. Because each topic contains one single
article, this collection is for single-document summarization.
For preprocessing, we remove the HTML tags from Wikipedia pages and append
some content from linked pages. For all documents, we conduct stop-word removal and
stemming. Each sentence is then represented using a term-frequency vector.
29:16
J. Ye et al.
Table III. Statistics on the Multisource Corpus of Non-Inverted Pyramid Writing Style
#Topic
Avg #Sentence
Economics
Category
10
399.6
Environment
Food
Health
History
Nature
Politic
Sports
Technology
10
10
10
10
10
10
10
10
413.8
349.0
424.4
410.4
268.4
203.3
404.7
262.2
Travel
10
161.2
Topics (Source)
How an independent PayPal creates great prospects for
payments and long-term value (Google Blog), · · ·
Buy a fish, save a tree (Discover Magazine), · · ·
What’s in a name (Discover Magazine), · · ·
The science of sleep (BBC), · · ·
WW1 was it really the first world war (BBC), · · ·
Battle of the Ants (BBC), · · ·
Is Obama tarnishing his legacy (CNN), · · ·
Who deserves All-Rookie honors (ESPN), · · ·
Superbooks high-tech reading puts you inside the story
(CNN), · · ·
The last unexplored place on Earth (Discover Magazine), · · ·
5.2. Evaluation Methods
5.2.1. Ground Truth Labeling. Six volunteers were involved in generating summaries of
incremental lengths. Each was given a target document (set) and a series of summary
lengths (4, 8, 12, 16, 20 sentences in our experiments). Each target document (set)
was worked on by three volunteers to generate three versions of summaries. A fourth
volunteer consolidated a final set of summaries for each target document (set) given
the three inputs. In particular, we did not require a short summary to be a subset of a
longer one.
Based on our observations, there are usually two ways for volunteers to generate
a summary of longer length based on a summary of shorter length: (i) append sentences that present uncovered/detailed information; (ii) completely rewrite the longer
summary without consulting the shorter ones. The first approach is more commonly
adopted.
5.2.2. Evaluation Process. We designed three suites of experiments for evaluation.
First, we performed standard evaluation by comparing our method with a set of
baseline methods. The ROUGE [Lin 2004] score is adopted as the evaluation metric.
ROUGE provides various kinds of measures, and it has been shown that the unigrambased ROUGE-1 score best agrees with human judgment. Although ROUGE is a recalloriented metric, it can supply evaluation scores based on both recall and precision, as
well as the F1 score that is the harmonic mean of precision and recall. For evaluating
model generated summaries of fixed length, as in the DUC main summarization task,
the recall-based score is preferred. However, the F1 score is more suitable for evaluating
summaries of different lengths, as in our experiments. Therefore, the ROUGE-1 F1
score is adopted in our evaluations.
Second, we proposed a topic coverage-based evaluation metric and perform another
set of evaluation. Since the standard evaluation does not consider sentence order and
the actual topic coverage, we conduct our evaluation based on the topic coverage distribution on the topic hierarchy. This reveals more insight into the quality of the generated
summaries.
Third, we conducted a user study to compare our methods and some of the baselines.
Since incremental-length summarization is intended to have practical impact, the
usability of the generated summaries is the focus of this evaluation. We consider a few
dimensions of usability and compare summaries generated by different methods under
these dimensions.
5.2.3. Comparing Methods. We compare our proposed method with the following state-
of-the-art methods.
29:17
—Leading [Wasson 1998]: After ordering the documents chronologically, the leading
sentences are selected out one by one. For a single document, the leading sentences
are selected one by one directly. Because this model greatly benefits from inverted
pyramid writing style, we include it to determine the fluctuation of performance on
documents in different writing styles.
—DSDR [He et al. 2012]: One of the state-of-the-art summarization model, DSDR is a
data reconstruction-based summarization method in which sentences that linearly
reconstruct the original document best are selected out into the summary. We chose
this method to observe the result of maximizing coverage of the whole data in the
original document.6
—Sum-LDA [Arora and Ravindran 2008]: A Latent Dirichlet Allocation (LDA) topic
modeling-based summarization method. This method runs in the latent topic space
instead of the original term space. The most related sentences for each latent topic
are selected into the summary iteratively.
—Sum-LDA-DSDR: This method is a combination of the topic modeling-based approach
and the DSDR approach. Each sentence is assigned to a unique latent topic first.
Then, for a specific latent topic, DSDR is adopted to select sentences that maximize
the coverage of this topic rather than pick out the most related sentences in SumLDA.
—FS [Yang and Wang 2003]: One of the state-of-the-art varying-length summarization
models. This is a fractal theory-based summarization method, as introduced in Section 2. Here, the fractal theory is adopted to analyze document structure and decide
how many sentences should be selected out into the summary for each element in
the document structure. Sentences with a top salience score under the elements will
be picked out. The score of a sentence is computed from four features: the term’s thematic feature (TF-IDF used), heading feature (keywords, titles), cue phrase feature,
and the sentence’s position. In this study, we use our topic hierarchy to represent the
document structure because documents may not have a structural outline as needed
in the original implementation.
Our method is denoted as ILS-HTCM: Incremental-Length Summarization based on
Hierarchical Topic Coverage Maximization.
To construct the topic hierarchy for Wikipedia articles, we adopt the section outline
as the topic hierarchy. The topic identifiers in the hierarchy are generated from the
section names in an outline after the removal of Wikipedia-exclusive sections such as
See also, References, and External links. All paragraphs under a section are allocated
to the corresponding subtopic node in the topic hierarchy.
For DUC 2007 and the noninverted pyramid writing style corpus, a literal hierarchical topic structure does not exist. We make use of the hPAM mode 2 as described in
Section 4.1 to construct the topic hierarchy. Mallet toolkit7 is adopted as the basis for
the implementation of hPAM. Because the hPAM method is general, we also apply it
to the Wikipedia corpus.
To denote the models based on the source of the topic hierarchies, for the DUC
2007 and noninverted pyramid writing style corpus, we use ILS-HTCM_hPAM and
FS_hPAM that are both based on the topic hierarchy constructed by hPAM. For the
Wikipeida dataset, in addition to ILS-HTCM_hPAM and FS_hPAM, we also use ILSHTCM_wiki and FS_wiki that are based on the literal section outline.
6 There are two models for DSDR: DSDR-lin (DSDR with linear reconstruction) adopts a greedy algorithm and
generates a summary by selecting sentences one by one; DSDR-non (DSDR with non-negative reconstruction)
restricts the parameters for linear reconstruction to only non-negative and generates a non-incremental
length summary. DSDR-lin is adopted in this article.
7 http://mallet.cs.umass.edu/.
29:18
J. Ye et al.
XXX #Sentence
X
Method XXX
X
Leading
DSDR
Sum-LDA
Sum-DSDR-LDA
FS_hPAM
FS_wiki
Table IV. Wikipedia Corpus
4
8
12
16
20
0.5324
0.2954
0.2850
0.2917
0.3255
0.4737
0.5751
0.3614
0.3584
0.3493
0.3693
0.4987
0.5086
0.4046
0.3796
0.3952
0.3509
0.4713
0.4861
0.4164
0.3978
0.4132
0.3459
0.4330
0.4635
0.4414
0.4236
0.4324
0.3442
0.4191
ILS-HTCM_hPAM
0.3429
0.3882
0.4401
0.4782 0.4964†
ILS-HTCM_wiki
0.5028
0.5151
0.5293‡
0.5543† 0.5555†
Comparison of the proposed summarization method with five baselines on
summaries of 4, 8, 12, 16, 20 sentences, in terms of ROUGE-1 F1 Score. †
and ‡ denote significant differences (t-test, p-value < 0.05) over all baselines
and all baselines but leading, respectively.
XXX #Sentence
X
Method XXX
X
Leading
DSDR
Sum-LDA
Sum-DSDR-LDA
FS_hPAM
Table V. DUC 2007 Corpus
4
8
12
16
20
0.4978
0.2692
0.2598
0.2643
0.2633
0.4241
0.3163
0.3093
0.3108
0.2871
0.3745
0.3478
0.3406
0.3342
0.2873
0.3415
0.3830
0.3640
0.3694
0.2884
0.3306
0.4023
0.3780
0.3871
0.2789
ILS-HTCM_hPAM
0.2822
0.3705‡
0.4022‡
0.4249† 0.4440†
summaries of 4, 8, 12, 16, 20 sentences, in terms of ROUGE-1 F1 Score. †
and ‡ denote significant differences (t-test, p-value < 0.05) over all baselines
and all baselines but leading, respectively.
Table VI. On Non-Inverted Pyramid Writing Style Corpus
XX
#Sentence
XX
XX
X
X
Method
Leading
DSDR
Sum-LDA
Sum-DSDR-LDA
FS_hPAM
4
8
12
16
20
0.1844
0.2866
0.2737
0.2847
0.2946
0.2029
0.3337
0.3213
0.3245
0.3320
0.2215
0.3856
0.3467
0.3638
0.3423
0.2154
0.4089
0.3807
0.3977
0.3399
0.2137
0.4358
0.4070
0.4221
0.3329
ILS-HTCM_hPAM
0.3320
0.3770
0.4201
0.4514† 0.4732†
summaries of 4, 8, 12, 16, 20 sentences, in terms of ROUGE-1 F1 score.
†denotes significant difference (t-test, p-value < 0.05) over all baselines
5.3. Standard Evaluation
In our experiments, we evaluated summaries of 4, 8, 12, 16, 20 sentences for all three
kinds of corpus: Wikipedia pages, DUC 2007, and the noninverted pyramid writing
style dataset. Tables IV, V, and VI present the average ROUGE-1 F-1 scores for our
proposed model and five baselines, on Wikipedia, DUC 2007, and noninverted pyramid
writing style datasets, respectively. By comparing the results of these models, we make
the following observations:
(1) Our methods perform generally better than the baseline systems over the whole
range of summary lengths, on both single- and multiple-document summarization, as
well as on datasets of various kinds of writing styles. For DUC 2007 and Wikipedia,
29:19
which mainly consist of articles of inverted pyramid writing style, among the baselines,
Leading performs very well, especially when the summary length is small (up to eight).
This happens because Leading greatly benefits from the inverted pyramid writing style
in our corpora. By selecting the first several sentences, it is highly likely that the most
important and general information is covered in the Leading method. Thus, at different
length cuts, Leading can be seen to be as good as a human summarizer in generating
an incremental-length summary of a topic. However, our methods that do not make use
of such writing style are also capable of extracting the same high-quality summary.
For corpus of noninverted pyramid writing style, as expected, Leading degenerates
severely and results in the worst performance. In contrast, our model prevails against
all baselines now, making it more suitable for general scenarios where the inverted
pyramid writing style is not adopted.
(2) The Wikipedia topic structure and the hPAM-generated topic structure have
different effects on those models based on topic structure: ILS-HTCM and FA. On
Wikipedia data, using Wikipedia’s literal hierarchical topic structure, ILS-HTCM and
FS are significantly better than the others except Leading when summary length
is small (up to eight sentences). When based on the topic hierarchy constructed by
hPAM, both ILS-HTCM and FS degenerate in performance. This indicates that the
original Wikipedia article structure is better than that generated by hPAM, which is
not surprising as the Wikipedia structure is manually built and the contents of sections
in a Wikipedia page comply with its literal topic structure well. On DUC 2007 and the
corpus of noninverted pyramid writing style, FS with hPAM is one of the worst methods
for its deep reliance on structure-based sentence scoring. The term’s heading feature
is invalid, and the sentence’s location feature is not indicative of its importance.
(3) ILS-HTCM’s global perspective is better than FS’s independent perspective. When
determining the exact amount of general and detailed information to be covered, ILSHTCM adopts a global perspective by maximizing the overall topic hierarchy coverage
with data linear reconstruction. But for FS, neither a node’s quota inheritance logic
according to the weights of all its children nor the selection of most salient sentence
under a node is under global consideration. During the summarization process for FS,
a selected sentence under a node has no influence on sentence selection for other nodes
in the topic hierarchy.
(4) ILS-HTCM’s performance rises with the increment of summary lengths, while
both Leading and FS degenerate. In a Wikipedia page, the most important and general information is expressed first, but other less general information is distributed
uniformly throughout the whole remaining article. This results in the Leading method
covering only a little of the less general information. For FS, with more sentences to
be selected out, detailed sentences will replace the most general ones. This deviates
from the human extractor’s habit, where the most general sentences are kept and
summaries are appended with details.
(5) Among the models that do not use topic structures, DSDR takes the lead because
it maximizes coverage for the whole document, and all sentences’ links are considered
during the summarization process. Sum-LDA classifies sentences, and the most related
sentence is picked out for each class. However, during this process, the relations between sentences are not considered. Integrated with DSDR, SUM-LDA-DSDR shows
better performance than Sum-LDA because it is able to capture the relations between
sentences under the same latent topic. But SUM-LDA-DSDR still cannot compete with
pure DSDR. Sum-LDA-DSDR’s performance shows that the benefits of sentence classification cannot compensate for its disadvantages.
5.3.1. Compared with DUC Submissions. We test out the performance of our method when
applied to traditional fixed-length summarization. Although this is not the focus of our
29:20
J. Ye et al.
Table VII. DUC 2007
Method
Best system in DUC 2007 (ID 15)
Leading
DSDR
Sum-LDA
Sum-DSDR-LDA
FS_hPAM
ROUGE-1
ROUGE-2
ROUGE-3
0.4451
0.3174
0.3733
0.3467
0.3524
0.2862
0.1245
0.0612
0.0737
0.0687
0.0699
0.0428
0.0460
0.0174
0.0206
0.0178
0.0182
0.0139
ILS-HTCM_hPAM
0.4212†
0.1084†
0.0381†
Comparison of the proposed summarization method with five baselines and
the best system in DUC 2007 main task on summary of 250 words, in terms
of recall scores for ROUGE-1, ROUGE-2, ROUGE-3. †denotes significant difference (t-test, p-value < 0.05) over all baselines.
work, it may help to evaluate the general strength of our method. Because the length
of the summary is fixed for DUC 2007, here we adopt the ROUGE recall score instead
of the F1 score.
Table VII shows the average recall score results of ROUGE-1, ROUGE-2, ROUGE-3
on a summary of 250 words for all peer models and the best system in DUC 2007.
Apart from the best system in DUC 2007, our model is still much better than the
five baselines at evaluation on the incremental-length summary. However, it cannot
compete with the best system in DUC 2007. This is not surprising. Our target applications differ from traditional ones in that our model is adapted for applications
in which the summary length is dynamically decided by the user, for example, to
help the user efficiently distinguish interesting articles by only reading a summary.
In such applications, the user’s reading burden, network traffic, judgment time, and
accuracy are more important indicators. Although these attributes make our model
adapt perfectly for some applications, they also put it at a disadvantage when generating a summary with a predefined fixed length, as the DUC 2007 base summarization
task.
5.3.2. Per Category Results. To see whether our method is stable across different categories, we break down the result and present them in Figure 4. Here, from the
ROUGE-1 F1 score for all 11 categories in our Wikipedia corpus, we can see that
the performance is relatively stable across categories. This indicates the stability of
our method.
On the other hand, we also checked the average document length of the different
categories shown in Table I. Here, we find no obvious associations between document
length and performance.
5.3.3. Example Summaries. Figures 5 and 6 show the summaries generated by our
proposed model ILS-HTCM on a specific topic for DUC 2007 and Wikipedia. In these two
figures, “Exclusive/Subsumed Data Info” is the information (Equation (5)) of exclusive
data and subsumed data (cf. Section 3.1) for a topic, respectively. From these figures,
we can see that the overall order of selected sentences is from general topics to detailed
ones, which is usually the order that a human extractor follows.
5.4. Proposed Topic Coverage Distribution-Based Evaluation
5.4.1. Sentence Order in Summaries. The incremental feature of the incremental-length
summarization requires us to pay explicit attention to sentence order. However, using the ROUGE score-based evaluation makes it hard to differentiate the quality of
summaries that order sentences differently. Although there are some works on sentence order optimization, most of them focus on making the summary more coherent
29:21
Fig. 4. Wikipedia. Per category ROUGE-1 F1 score of ILS-HTCM_hPAM. The summary length is set to
20 sentences. The numbers of super- and sub-topics are set to 20 and 50, respectively.
Fig. 5. DUC 2007, a summary of 20 sentences generated by ILS-HTCM_hPAM for [D0701A] topic. The
numbers of super- and sub-topics are set to 20 and 50, respectively, for a three-level hPAM.
by checking consecutive sentences. Few of them try to make sentence order follow the
general-to-detailed scheme or the inverted pyramid writing style.
We take a novel perspective on solving this evaluation issue. Because the target
content has been organized in a topic hierarchy, we can easily get the general-detailed
relations of the subtopics. On the other hand, we assume that a human abstractor will
generate a summary by covering the subtopics from general to detailed. Therefore, by
comparing the subtopics covered by the system summaries and the human summaries,
we can determine those system summaries that follow the human abstraction order.
Note that our proposed methods naturally generate summaries that are properly
ordered whether the input document (set) is in inverted pyramid writing style or
not. Although we can design a classifier to distinguish inverted pyramid from general
29:22
J. Ye et al.
Fig. 6. Wikipedia, a summary of 20 sentences generated by ILS-HTCM_wiki for Isaac Newton.
noninverted styles, the noninverted pyramid writing style document (set) still needs to
be handled.
5.4.2. Topic Coverage Distribution Definition. To find good system summaries that are consistent with the human summaries in terms of subtopic covering order, we first define
the topic coverage distribution based on the topic hierarchy structure. This results
in a novel evaluation method that is proposed specifically for the incremental-length
summarization task.
We first define the importance or weight of any subtopic in a topic hierarchy. Given
a subtopic st, SDst is the subsumed data for st in the topic hierarchy (cf. Section 3.1).
We define
the weight of st as the information contained in SDst , which is denoted
as Wst = v∈SDst ||v||22 if we adopt the Equation (5) to measure the information of a
sentence set. For all n subtopics in topic hierarchy, we can get a topic weight vector
< log (1 + Wst1 ), . . . , log(1 + Wstn ) >. In practice, we use log (1 + Wst ) to moderate the
extremely large weights of low-level subtopic nodes (those near the top of the hierarchy)
because their subsumed data are usually large.
S
S
For a summary S, we can get a topic coverage ratio vector <CRst
, . . . , CRst
>, where
1
n
S
the RR defined in Equation (8) is adopted as CRst .
S
The vector for a summary S’s topic coverage distribution TCDTH
on a topic hierarchy
TH is thus defined as the inner product of the topic weight vector and the topic coverage
ratio vector:
S
S
S
TCDTH
: <CRst
· log (1 + Wst1 ), . . . , CRst
· log (1 + Wstn )>.
1
n
Finally, the similarity of two summaries S1 and S2 in terms of their topic
S1
S2
and TCDTH
:
coverage distribution is defined as the cosine similarity of TCDTH
S1
S2
S1
S2
SIM TCD(S1 , S2 , TH) = TCDTH
· TCDTH
/ ( ||TCDTH
||2 · ||TCDTH
||2 ).
5.4.3. Evaluation Based on Topic Coverage Distribution. In this suite of experiments, we
also evaluate summaries of 4, 8, 12, 16, and 20 sentences for Wikipedia pages, DUC
2007, and the noninverted writing style dataset. Figures 7–9 show the similarity of
topic coverage distribution between automatically generated summaries and the gold
standard for our proposed model and five baselines on the three datasets, respectively.
29:23
Fig. 7. Wikipedia corpus. The similarity of topic coverage distribution between system summaries and the
gold standard.
Fig. 8. DUC 2007 corpus. The similarity of topic coverage distribution between system summaries and the
gold standard.
Fig. 9. Noninverted pyramid writing style corpus. The similarity of topic coverage distribution between
system summaries and the gold standard.
29:24
J. Ye et al.
The gold standards in these experiments are the same as those used in the standard
evaluation. From the result we can see that:
The overall result for all competitive models measured with SIM TCD is similar to
the result based on the ROUGE-1 score. ILS-HTCM takes the lead on all three kinds
of corpus. Leading can only perform well on document sets of inverted pyramid writing
style when the summary length is small. The analogous results between our proposed
SIM TCD and ROUGE source indicate that the SIM TCD can also be used in general
summarization evaluation tasks.
Some particular observations based on the new metric are:
First, with summary length increases, the performances of Leading and FS drop
much more sharply here than in the ROUGE score-based evaluation. This suggests that
our proposed measurement is sensitive to topic coverage distribution. The deviation of
the topic coverage distribution between two summaries can be detected effectively.
Second, among the remaining three baselines, Sum-LDA-DSDR performs best. DSDR
is the weakest because it performs in the term space without considering any topic
information. Sum-LDA’s topic coverage distribution is closer to the gold standard than
is DSDR because it selects sentences based on the latent topics. Sum-LDA-DSDR
inherits the advantage of Sum-LDA and incorporates sentence relations when selecting
sentences from the same latent topic.
Third, the summary quality measured by topic coverage distribution is Sum-LDADSDR > Sum-LDA > DSDR. This differs from the result of the standard evaluation
where DSDR > Sum-LDA-DSDR > Sum-LDA. Combined with the analysis in Section 5.3, we see that the consideration of topic level information for sentences plays a
main role in SIM TCD-based evaluation, while capturing links between sentences is
more important for ROUGE score-based evaluation.
5.5. Analysis on Topic Hierarchy
5.5.1. Number of Levels of hPAM Model. In practice, we fix the number of levels of the
hPAM model at three, according to empirical observations. Specifically, we analyze the
reasons as follows:
(1) For all three kinds of datasets used in our experiments, a three-level hierarchical
topic structure is the most common. Both the section outline that we directly adopt
as a topic hierarchy for Wikipedia articles and the hierarchical topic structure
manually constructed by our volunteers show that the three-level hierarchical
topic structure prevails over all others. It should be pointed out that a node can
have no child node in hPAM, thus the structure of hPAM is quite flexible.
(2) Our summarization model prefers to select sentences at nodes of low levels (the
level of root node is level 0) in a topic hierarchy. In our experiments, even for
the longest summary with 20 sentences, almost all sentences are selected out from
nodes located at the top two levels of a three-level hPAM. Therefore, it demonstrates
that three levels are enough for our experiments. On the other hand, if there are
only two levels for hPAM, it’s quite possible that almost all sentences are allocated
to one level. In this case, our model will degenerate to one of the baselines. Because
the numbers of super- and sub-topics in three-level hPAM determine how many
sentences are allocated to each level, these two parameters are much more worth
optimizing than the number of levels for hPAM.
5.5.2. Tuning the Super- and Sub-Topic Numbers. For all three kinds of datasets, we adopt
the three-level hPAM model 2 to construct the topic hierarchy. In this model, apart
from a root topic at level zero, we need to specify the number of super-topics at level
one and sub-topics at level two.
PP #super
PP
#sub
P
P
29:25
Table VIII. Wikipedia
5
10
20
30
40
100
10
0.4761
0.4836
0.4854
0.4903
0.4853 0.4778
20
0.4803
0.4872
0.4952
0.4937
0.4929 0.4789
50
0.4824
0.4938
0.4964
0.4955
0.4968 0.4835
100
0.4880
0.4968
0.4957
0.4966
0.4936 0.4851
200
0.4850
0.4861
0.4881
0.4881
0.4860 0.4782
1,000
0.4799
0.4679
0.4613
0.4542
0.4483 0.4419
The effect of the numbers of super- and subtopics of three-level hPAM on ILSHTCM_hPAM. Results reported are ROUGE-1 F1 score for summaries of 20
sentences (highest score is in bold).
During the Gibbs sampling process, with all dimensions of hyperparameters for
hPAM fixed, when the number of supertopics increases, fewer words are directly sampled by root topic. Moreover, the increment of subtopic leads to the number of words
directly sampled by supertopic decreases. As a result, some sentences will flow down
from the root topic to the supertopic and from supertopic to subtopic.
From Table VIII, we can see that performance improves first and stabilizes later with
increased numbers of super- and subtopics. However, when the super- and subtopic are
both extremely large (100 and 1,000 in our experiment), performance drops severely.
A small number of supertopics will lead to excessive sentences allocated to the root
topic. Because sentences allocated to the root topic have the largest coverage scope and
enjoy some priority in being selected into a summary, too many sentences allocated
to the root topic will result in a summary consisting of sentences only from the root
topic. In this case, our model degrades to DSDR. The first column of five supertopics in
Table VIII shows this situation.
On the other hand, excessive super- and subtopics both will result in almost all
sentences being allocated to subtopics, and our model degrades to Sum-LDA-DSDR.
The bottom right of Table VIII presents this situation. However, excessive supertopics
but few subtopics do little harm to performance (the top right of Table VIII). In this
case, although many sentences are allocated to the supertopic, more sentences are still
in a subtopic. This reduces the three-level hPAM to two levels, where a two-level hPAM
still can reveal the hierarchical relations for topics in the topic hierarchy.
Because only a small number of sentences would be selected into the summary,
fewer sentences being allocated to root topic and super topic is acceptable to a certain
extent. From the preceding analysis, we suggest that, in a real application, setting the
number of super- and subtopics slightly high is preferred for good performance. We
can draw similar conclusions on the other two kinds of datasets and from the topic
coverage distribution evaluation, but we do not report the results here due to the space
limitations.
5.5.3. Sentence Distribution on Topic Hierarchy. We now take a closer look at sentence
distribution on the topic hierarchies. In Table IX, we present the average sentence
numbers for each level in the topic hierarchies, organized by categories.
To make sense of the sentence distribution statistics, consider Figure 4, where differences among categories exist but not significantly. We see that the slight difference
in performance is closely related to the number of sentences distributed on various
levels. Too many sentences allocated to the root topic and the supertopic can degrade
our model to DSDR, which results in bad performance, such as in “Animal” category.
On the other hand, in theory, few sentence will be located at the top two levels when
we set the numbers of super- and subtopics extremely high. As in Table VIII, when
29:26
J. Ye et al.
Table IX. Wikipedia
Category
Sentence in level 0
Sentence in level 1
Sentence in level 2
Animal
218.5
436.1
58
Botany
152
295.4
335.7
Transport
144.9
311.2
1,107.2
Geophysics
133.4
215.6
283.9
Geography
166.2
319.3
795.5
Company
75.6
161.3
182.4
Publication
95.3
223.1
322.3
Person
87.1
155.7
820.5
Event
107.3
224
842.5
Computer Science
45.5
140.3
385.6
Building
16.7
65.8
426.9
For all 11 categories, the number of sentences allocated to each level in the three-level
hierarchy generated by hPAM Model 2. The Numbers of super- and subtopics are set to
20 and 50, respectively.
Fig. 10. DUC 2007. The effect of the number of subtopics in three-level hPAM on the running time of
ILS-HTCM_hPAM. The summary length is set to 20 sentences, and the supertopic number is set to 20.
super- and subtopic numbers are set to 100 and 1,000 (the largest in the table), the
performance is the worst. In this case, the model degenerates to Sum-DSDR-LDA.
A moderate number of sentences across the levels results in good performance. We
thus can conclude that a proper topic hierarchy structure will benefit the incrementallength summarization. Here, the proper structure means that the numbers of superand subtopics are balanced without going to extremes. In such a structure, our method
will distribute sentences properly into the topics, which will guide the summarization
process to include the right amount of general and detailed content.
5.5.4. Topic Number and Efficiency. Figure 10 shows the trend of running time by varying
the subtopic number for summaries of 20 sentences generated by ILS-THCM with the
supertopic number set to 20. We see that there is a linear correlation between the
running time of our proposed model and the number of subtopics.
According to the analysis in Section 4.4, the algorithm’s running time is independent of topic hierarchy structure. Therefore, there is no explicit correlation between
subtopic number and the running time of our proposed algorithm. However, the number of subtopics has an impact on the time complexity of topic hierarchy construction.
29:27
It also plays a part in both the Gibbs sampling process included in hPAM and in sentence allocation. Analyzing the hPAM implementation in the Mallet toolkit, we find
that the time complexity of hPAM is directly proportional to the number of subtopics.
The complexity of sentence allocation (cf. Equation (1)) also has a linear correlation
with the number of subtopics. The preceding analysis is consistent with our observation in Figure 10 that the algorithm running time is proportional to the number of
subtopics.
In summary, small super- and subtopic numbers are good for efficiency. Finally, after
considering both performance and efficiency, we set the number of super- and subtopics
at 20 and 50, respectively, for all our datasets.
6. USER STUDY
To study the usability of the proposed method, we designed a user study to see how
it improves efficiency in information consumption and judgment-making. We designed
an application scenario in which users use incremental summaries to identify article
interest level. An example is depicted in Figure 1.
6.1. Evaluation Indexes
In particular, we are interested in some new requirements that are related to the real
application of the incremental-length summary. In general, the goal is that (i) the
summary can accurately reflect the topic information of the original article; (ii) only
the short summary needs to be read instead of longer ones, and the user may stop
reading at an early point when the summary is short; (iii) the set of incremental
summaries can help efficient judgment-making in terms of time; and (iv) the amount
of content (the incremental summary) transmitted on the network is minimized. In
detail, we have the following four evaluation indexes:
Judgment accuracy: To evaluate whether judgment is accurate, we set a main
topic for each article. This gold standard main topic is selected after reading the whole
article. The annotator selects a topic based on the summary, which will be compared
against the gold standard topic.
Reading burden: This is measured by the number of words read up to the point
when the user makes her decision (at the cutoff point). To lower the users’ reading
burden, the summary must deliver the most important and general information first.
Judgment efficiency: The total time spent on making the decision is measured.
Assuming that the reading speed is constant, the time for reading is linearly correlated
with the length of the summary. However, the coherence and understandability of the
summary also play a crucial role in determining the time spent because users may
have to read multiple times if the content is badly presented.
Network traffic: To measure the network traffic evoked in sending the summary,
an objective measure is the number of bytes sent by the server. Since the data format
(JSON, XML, etc.) and transfer protocol take up some space, in this work, we adopt
the number of words sent by the server to represent network traffic. This is not the
same as the reading burden. In the case that the server sends more sentences than the
user actually read, only network traffic would increase while the user’s reading burden
remains untouched.
Among these four indexes, according to the purpose of the target application, judgment accuracy is the prerequisite. All other three indexes are meaningless if there’s a
huge deviation between a user’s judgments based on the incremental-length summary
and the full article. In summary, a good incremental summarization method should be
one that produces high judgment accuracy, low reading burden, fast judgment-making,
and low network traffic.
29:28
J. Ye et al.
Table X. Statistics on the Inverted Pyramid Writing Style and General Non-Inverted Document Sets
for User Study
Writing Style
Inverted pyramid
General non-inverted
Article Num
Avg Sentences
100
100
187.4
329.7
Source
BBC, CNN, Wikipedia
BBC, CNN, Discover Magazine, ESPN,
Google Blog
Table XI. When User Makes a Judgement After Reading Some Sentences
of an Incremental Length Summary for an Article, Information
being Recorded and its Corresponding Evaluation Index
Recorded Information
Corresponding Evaluation Index
Article’s main topic
Article’s interestingness value
Number of words read by user
Total time for making judgment
Number of words sent by server
Judgement accuracy
Judgement accuracy
Reading burden
Judgement efficiency
Network traffic
6.2. User Study Design
For the target document styles, we carried out our user study on both inverted pyramid
and general noninverted writing style documents. Because our method is not sensitive
to sentence order in original document(s), it can be applied to articles with an arbitrary
structure. The inverted pyramid writing style make news articles adapt to various
kinds of readers who would spend varying amounts of time on reading. This is also the
preferred sentence ordering in the incremental summary. However, a portion of news
articles do not follow the inverted pyramid style, as pointed out by Yang and Nenkova
[2014]. For example, some news articles have openings that are creative instead of
informative. Table X briefly summarizes these two kinds of document sets adopted in
this user study. In particular, for the inverted pyramid writing style document set, we
only choose articles that strictly follow this writing style.
For a specific article, a user will be supplied with an incremental-length summary, as
depicted in Figure 1. The user will keep on reading sentences from the summary one
after another until he or she is able to confidently identify both the main topic of the
article and determine its interest (supposing that he or she is interested in this topic).
The interest is an integer value that ranges from 0 to 5 (inclusive); the higher value
indicates a higher level of interest. When a user can makes a judgment, he or she stops
reading more sentences. Table XI shows all recorded information and its corresponding
evaluation index. Among four evaluation indexes, three of them are directly observed.
We only need to manually generate ground truth for judgment accuracy, which is made
based on reading the full article.
Five volunteers took part in our user study. Each was responsible for 20 different articles in inverted pyramid and noninverted styles, respectively. We carried out the user
study in a good network environment. Here, the delay between the user’s one-moresentence request and the sentence appearing on a mobile phone screen is negligible.
Therefore, the user’s judgment time roughly equals the time spent reading the summary. To measure the number of words sent by the server more accurately, the server
returns only one new sentence each time it receives a request.8
8 In
a real application, if network condition is poor, the server can send back several sentences on a request
in order to reduce the number of interactions.
29:29
Table XII. Comparison of the Proposed Summarization Method with the Five Baselines on
Judgment Accuracy on Inverted Pyramid Writing Style and General Noninverted Style Document Sets
Method
Inverted Pyramid document set
Main Topic
Judgment
Accuracy
Accuracy
General Noninverted document set
Main Topic
Judgment
Accuracy
Accuracy
Leading
DSDR
Sum-LDA
Sum-DSDR-LDA
FS_hPAM
0.990
0.810
0.830
0.840
0.710
0.694
0.528
0.568
0.588
0.404
0.560
0.730
0.780
0.750
0.730
0.340
0.452
0.490
0.470
0.374
ILS-HTCM_hPAM
0.940
0.630
0.870
0.580
6.3. User Study Result
In this user study, we compared our method with the same methods in the quantitative
experiments in Section 2.4. Among the five methods, FS is unable to generate an
incremental-length summary. We adjust it as follows: when there is a user request for
more sentence, we deliver the most salient sentence that is newly selected into the FS
summary. In this way, users are consuming an incremental-length FS summary in the
same way that they consume other incremental-length summaries.
The interest of an article to a user is valid only after he or she correctly identifies
the main topic. Therefore, for judgment accuracy evaluation, we consider both the
main topic and interest value. Because an article’s interest value falls in the range
{0, 1, 2, 3, 4, 5}, the similarity of judgments based on the summary and the full article
is defined as follows:
0
if MTS = MT A
SIM J (JS , JA) =
|IS −IA|
1− 5
if MTS = MT A
where JS and JA are judgments based on the summary and the full article, respectively;
MTS , IS are the main topic and the interest value identified in JS ; and MT A, IA are
the main topic and the interest value identified in JA. The main topic accuracy is the
ratio of articles whose main topics in JS and JA are the same. The judgment accuracy
is averaged on all articles’ SIM J (JS , JA).
Judgment accuracy results are presented in Table XII for both inverted pyramid and
general non-inverted writing style document sets. From the table, we make the following observations. First, ILS-HTCM performs quite well on both main topic accuracy
and judgment accuracy, independent of the document’s writing style. Thus, ILS-HTCM
is able to convey the most important information first for articles of any writing style.
Second, for inverted pyramid documents, the Leading method is best. However, its
performance drops severely and becomes the worst among all six methods for general
noninverted documents. Third, another trend we can find is that all methods perform
better on inverted pyramid than on general noninverted documents. Compared with an
inverted pyramid article, subtopics in a general noninverted one are less strongly related with its main topic. These properties make it harder for a user to correctly identify
the main topic of a noninverted article, thus leading to performance degradation.
For user’s reading burden, we measure the number of words read by a user until a
judgment is made. From Figure 11(a) we can see that, first, ILS-HTCM brings about
little reading burden on both inverted and noninverted documents. Second, for inverted
pyramid documents, the Leading method performs best on reducing reading burden
for the user. However, when applied to noninverted documents, the reading burden
increases sharply and becomes much worse than ILS-HTCM. What’s more, Leading
method leads user to make many wrong judgments on noninverted documents. This
29:30
J. Ye et al.
Fig. 11. Helping identify article interest. Comparison of the proposed summarization method with the five
baselines on four aspects, on inverted pyramid writing style and general document sets.
means that when a user only consumes some trivial information of an article, the
Leading method is prone to mislead her into believing that the main topic has already
been caught, and the user stops requesting more sentences. Third, FS brings about the
largest reading burden. This is because FS delivers detailed information even before
sending adequate general information.
The judgment efficiency is evaluated by the overall time spent by the user before
making a judgment. From Figure 11(b), we can draw similar conclusions as in the
reading burden evaluation. Because our user study is conducted in a good network
environment, the judgment time is closely related to the user’s judgment burden.
Moreover, judgment time also depends on a user’s reading speed. Figure 11(c) shows
that users read summaries generated by the Leading method fastest. There is no
significant difference between the other remaining methods.
Finally, network traffic is evaluated by the number of words sent by the server
in Figure 11(d). As Figure 11(a) showed the result about the number of words that
users really read, here we focus on the wasted network traffic that is measured by the
number of words sent by the server without being read by users. For all six methods,
users always consume an incremental-length summary; therefore, the waste of network
traffic is small regardless of document writing styles. Figure 11(d) well supports this
point. However, we can also see that the waste on network traffic still exists for all
methods. According to our understanding, this waste occurs in two ways. First, when
reading a summary on a mobile phone screen, the user is used to scrolling up the
29:31
Table XIII. Wasted Network Traffic of DSDR with Nonincremental Summarization Approach on Inverted
Pyramid Writing Style and General Noninverted Document Sets
# Words sent by server - #
Words read by user
Inverted pyramid document set
47.86
General noninverted document set
56.39
sentence to be read to the middle of the screen. To keep this process smooth for the
user’s reading experience, subsequent sentences in the summary are sent to fill the
space in the lower part of the screen. Second, the last sentence of the summary is
always partially read.
In addition, we conducted a study on the network traffic introduced by a
nonincremental-length summarization method. Table XIII shows the wasted network
traffic for a nonincremental-length summarization method: DSDR with nonnegative
reconstruction.9 The wasted network is extremely high when compared with the result
in Figure 11(d) for incremental-length summarization methods. It is evidential that the
nonincremental-length summary is not suitable for the application in our user study.
7. CONCLUSION
In this work, we proposed a model to generate an incremental-length summary where
the summary length is dynamically decided by the user. Making use of a topic hierarchy constructed from the original document(s), our model is able to generate an
incremental-length summary by maximizing hierarchical topic coverage. We define a
coverage scope for each sentence according to the position of its corresponding subtopic
in the topic hierarchy. Based on hierarchical topic coverage maximization under the
limitation of sentence’s coverage scope, the overall order of sentence selection in our
model is to cover more general information before details. Our proposed summarization
model answers the two questions in the introduction. First, the next sentence to be appended to a summary is the one that expresses the highest level and novel information.
Second, to generate incremental-length summaries of good quality, a summarization
model should generate summaries of varying lengths by continuously maximizing the
coverage scope limited by hierarchical topic coverage.
We utilized two metrics, the ROUGE-1 score and our proposed similarity of topic
coverage distribution, to evaluate the performance of an incremental-length summary
generated by our model. Our experiments on the DUC 2007 main summarization task
data, Wikipedia pages, and a general non-inverted writing style multisource dataset
demonstrated the effectiveness of the proposed model. We also carried out a user study
on a handheld device-based application that aimed to help users identify the interest
of articles. The user study further indicated that our model was able to improve the
accuracy and speed of judgment, as well as reduce the reading burden and network
traffic, for articles in inverted pyramid and general writing styles.
For future work, we will explore more topic hierarchy construction models. The quality of the underlying topic hierarchy has an important influence on our summarization
results. With the development of language processing techniques, our model will definitely benefit from more accurate topic hierarchy modeling.
REFERENCES
Rachit Arora and Balaraman Ravindran. 2008. Latent Dirichlet allocation based multi-document summarization. In Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. ACM,
91–97.
9 For FS, we only deliver one of the new sentences in the summary to the user for each request for more
sentences.
29:32
J. Ye et al.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine
Learning Research 3, 1 (2003), 993–1022.
Ronald Brandow, Karl Mitze, and Lisa F. Rau. 1995. Automatic condensation of electronic publications by
sentence selection. Information Processing & Management 31, 5 (1995), 675–685.
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30, 1 (1998), 107–117.
Jaime Carbonell and Jade Goldstein. 1998. The use of MMR, diversity-based reranking for reordering
documents and producing summaries. In Proceedings of the 21st Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval. ACM, 335–336.
Trevor Cohn and Mirella Lapata. 2013. An abstractive approach to sentence compression. ACM Transactions
on Intelligent Systems and Technology (TIST) 4, 3 (2013), 41.
Jean-Yves Delort and Enrique Alfonseca. 2012. DualSum: A topic-model based approach for update summarization. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. ACL, 214–223.
Harold P. Edmundson. 1969. New methods in automatic extracting. Journal of the ACM 16, 2 (1969), 264–285.
Brigitte Endres-Niggemeyer, Elisabeth Maier, and Alexander Sigel. 1995. How to implement a naturalistic model of abstracting: Four core working steps of an expert abstractor. Information Processing &
Management 31, 5 (1995), 631–674.
Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based lexical centrality as salience in text
summarization. Journal of Artificial Intelligence Research 22, 1 (2004), 457–479.
Yihong Gong and Xin Liu. 2001. Generic text summarization using relevance measure and latent semantic
analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval. ACM, 19–25.
Aria Haghighi and Lucy Vanderwende. 2009. Exploring content models for multi-document summarization.
In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American
Chapter of the Association for Computational Linguistics. ACL, 362–370.
Sanda Harabagiu and Finley Lacatusu. 2005. Topic themes for multi-document summarization. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval. ACM, 202–209.
Zhanying He, Chun Chen, Jiajun Bu, Can Wang, Lijun Zhang, Deng Cai, and Xiaofei He. 2012. Document
summarization based on data reconstruction. In Proceedings of the 26th AAAI Conference on Artificial
Intelligence. AAAI, 620–626.
Marti A. Hearst and Christian Plaunt. 1993. Subtopic structuring for full-length document access. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval. ACM, 59–68.
A. Kogilavani and P. Balasubramanie. 2012. Update summary generation based on semantically adapted
vector space model. International Journal of Computer Applications 42, 16 (2012).
Julian Kupiec, Jan Pedersen, and Francine Chen. 1995. A trainable document summarizer. In Proceedings
of the 18th Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval. ACM, 68–73.
Dawn J. Lawrie and W. Bruce Croft. 2003. Generating hierarchical summaries for web searches. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in
Informaion Retrieval. ACM, 457–458.
Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha, and Yong Yu. 2009. Enhancing diversity, coverage
and balance for summarization through structure learning. In Proceedings of the 18th International
Conference on World Wide Web. ACM, 71–80.
Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 577–584.
Xuan Li, Liang Du, and Yi-Dong Shen. 2011. Graph-based marginal ranking for update summarization. In
SDM. SIAM, 486–497.
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization
Branches Out: Proceedings of the ACL-04 Workshop. ACL, 74–81.
Inderjeet Mani and Eric Bloedorn. 1998. Machine learning of generic and user-focused summarization. In
Proceedings of the 15th National Conference on Artificial Intelligence. AAAI, 821–826.
Rada Mihalcea. 2004. Graph-based ranking algorithms for sentence extraction, applied to text summarization. In Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions. ACL, 20.
David Mimno, Wei Li, and Andrew McCallum. 2007. Mixtures of hierarchical topics with pachinko allocation.
In Proceedings of the 24th International Conference on Machine Learning. ACM, 633–640.
29:33
Zhao-Yan Ming, Tat-Seng Chua, and Gao Cong. 2010a. Exploring domain-specific term weight in archived
question search. In Proceedings of the 19th ACM International Conference on Information and Knowledge
Management. ACM, 1605–1608.
Zhao-Yan Ming, Kai Wang, and Tat-Seng Chua. 2010b. Prototype hierarchy based clustering for the categorization and navigation of web collections. In SIGIR. ACM, New York, NY, 2–9.
Zhao Yan Ming, Jintao Ye, and Tat Seng Chua. 2014. A dynamic reconstruction approach to topic summarization of user-generated-content. In Proceedings of the 23rd ACM International Conference on Conference
on Information and Knowledge Management. ACM, 311–320.
Taesun Moon and Katrin Erk. 2013. An inference-based model of word meaning in context as a paraphrase
distribution. ACM Transactions on Intelligent Systems and Technology (TIST) 4, 3 (2013), 42.
Jahna Otterbacher, Dragomir Radev, and Omer Kareem. 2006. News to go: Hierarchical text summarization
for mobile devices. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval. ACM, 589–596.
Dragomir R. Radev. 2000. A common theory of information fusion from multiple text sources step one:
Cross-document structure. In Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue. ACL,
74–83.
Dragomir R. Radev, Hongyan Jing, and Malgorzata Budzikowska. 2000. Centroid-based summarization of
multiple documents: Sentence extraction, utility-based evaluation, and user studies. In Proceedings of
the 2000 NAACL-ANLP Workshop on Automatic Summarization. ACL, 21–30.
Josef Steinberger and Karel Ježek. 2009. Update summarization based on novel topic distribution. In Proceedings of the 9th ACM Symposium on Document Engineering. ACM, 205–213.
Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. 2005. Sharing clusters among related
groups: Hierarchical Dirichlet processes. In Advances in Neural Information Processing Systems 18.
NIPS, 271–278.
Xiaojun Wan and Jianwu Yang. 2006. Improved affinity graph based multi-document summarization. In
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short
papers. ACL, 181–184.
Chi Wang, Xiao Yu, Yanen Li, Chengxiang Zhai, and Jiawei Han. 2013. Content coverage maximization
on word networks for hierarchical topic summarization. In Proceedings of the 22nd ACM International
Conference on Conference on Information & Knowledge Management. ACM, 249–258.
Dingding Wang and Tao Li. 2010. Document update summarization using incremental hierarchical clustering. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 279–288.
Dingding Wang, Tao Li, Shenghuo Zhu, and Chris Ding. 2008. Multi-document summarization via sentencelevel semantic analysis and symmetric matrix factorization. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 307–314.
Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Gong. 2009. Multi-document summarization using
sentence-based topic models. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. ACL,
297–300.
Mark Wasson. 1998. Using leading text for news summaries: Evaluation results and implications for commercial summarization applications. In Proceedings of the 36th Annual Meeting of the Association for
Computational Linguistics and 17th International Conference on Computational Linguistics. ACL, 1364–
1368.
Li Wenjie, Wei Furu, Lu Qin, and He Yanxiang. 2008. PNR 2: Ranking sentences with positive and negative reinforcement for query-oriented update summarization. In Proceedings of the 22nd International
Conference on Computational Linguistics-Volume 1. ACL, 489–496.
Christopher C. Yang and Fu Lee Wang. 2003. Fractal summarization for mobile devices to access large
documents on the web. In Proceedings of the 12th International Conference on World Wide Web. ACM,
215–224.
Yinfei Yang and Ani Nenkova. 2014. Detecting information-dense texts in multiple news domains. In Proceedings of the 28th AAAI Conference on Artificial Intelligence. AAAI, 1650–1656.
Dongsong Zhang. 2007. Web content adaptation for mobile handheld devices. Communications of the ACM
50, 2 (2007), 75–79.
Received December 2014; revised May 2015; accepted July 2015

- National University of Singapore

Transcription

Similar documents

Song: I Like The Way

The Art and Science of Writing . How to write easy-to-read - Cel

Student`s copy - SOS

Short Paper #2 Moalem`s `Ironing It Out` Due February 19, 2014

12752_Collins ks2 RG English_p12

Writing skills for young learners

Truth and Meaning

Category-specific video summarization - LEAR team

Unit 4 - Amazon Web Services

summarization of icelandic texts