Slides

Transcription

Slides
Filtering SpeakerSpecific Words from
Electronic Discussions
Project Overview
Context-sensitive automated help-desk systems
Context: memory of previous interactions
client-based
server-based: community of users
Strategy:
cluster a corpus of interactions
use centroids as memories
guide future interactions using memories
Yuval Marom
MULT Seminar
14 May, 2004
2
Project Overview (2)
Clustering features:
word-based features
bag-of-words, bigrams
extract topics
dialogue/conversational features:
initial request
number of turns
dialogue acts
outcome
Yuval Marom
extract “user/dialogue types”
MULT Seminar
14 May, 2004
3
Project Overview (3)
st
1 stage:
clustering
Yuval Marom
corpus of
discussions
MULT Seminar
clusters
14 May, 2004
4
Project Overview (3)
st
1 stage:
clustering
corpus of
discussions
query
match
nd
2 stage:
interaction
append
Yuval Marom
MULT Seminar
clusters
centroid /
set of
retrieve
documents
user
14 May, 2004
representative
features:
answer
summary
dialogue type
response
5
Document Clustering
Useful for focusing and speeding-up retrieval
Bag-of-words features with tf·idf scoring
ignore stop/function words
ignore infrequent and highly-frequent words
Data presented to a k-means clustering algorithm
k centroids
document-to-cluster assignments
Evaluation difficult
Yuval Marom
MULT Seminar
14 May, 2004
6
Test Domain
Newsgroups
good approximation to help-desk discussions
readily available
rich variety of topics
Evaluation:
1. coarse-level clustering
2. simple retrieval
Yuval Marom
MULT Seminar
14 May, 2004
7
An Example
Yuval Marom
MULT Seminar
14 May, 2004
8
An Example
You need to download it. It doesn't come fixed with any version.
Shoff1945 wrote:
> I just purchased an academic version of PS7. Do I need to download 7.0.1 or has
> my copy been fixed. How would I know? Is there a PS 7.0.1 that I should order
> instead? I haven't opened this copy yet.
> Many thanks.
-Comic book sketches and artwork:
http://www.sover.net/~hannigan/edjh.html
Yuval Marom
MULT Seminar
14 May, 2004
9
An Example
The fastest software company is Borland. When I called them to buy
JBuilder 5, I was told the current version was 6. But what I got from
mail is 7. A month later, I learned 9 was scheduled to release.
Tony G. Smith
Vizros – Realistic 3D page curl plug-ins and more
Demo at http://www.vizros.com/gallery.html
Yuval Marom
MULT Seminar
14 May, 2004
10
A Filtering Mechanism
st
1 pass:
Maintain a speaker-by-word frequency matrix
Count number of postings for each speaker
nd
2 pass:
Convert each word in each posting to a proportion
If proportion statistically higher than threshold,
then filter the word “signature words”
Yuval Marom
MULT Seminar
14 May, 2004
11
Coarse-Level Clustering
newsgroup 1
newsgroup 2
newsgroup n
doc 1
doc 2
doc 1
doc 2
doc 1
doc 2
dataset
doc 1-1
doc 1-2
doc 1-n
doc 2-1
doc 2-2
Yuval Marom
MULT Seminar
14 May, 2004
12
Coarse-Level Clustering
newsgroup 1
newsgroup 2
newsgroup n
doc 1
doc 2
doc 1
doc 2
doc 1
doc 2
dataset
doc 1-1
doc 1-2
doc 1-n
doc 2-1
doc 2-2
Yuval Marom
k clusters
clustering
MULT Seminar
Issues to resolve:
k≠n
cluster-newsgroup
correspondence
14 May, 2004
13
Coarse-Level Clustering
newsgroup 1
newsgroup 2
newsgroup n
doc 1
doc 2
doc 1
doc 2
doc 1
doc 2
doc 1-n
doc 2-1
doc 2-2
Yuval Marom
P =
ij
k clusters
clustering
n newsgroups
1
2
n
MULT Seminar
14 May, 2004
newsgroup j
cluster i
recall:
Rij =
cluster i
F-score:
dataset
doc 1-1
doc 1-2
cluster i
precision:
newsgroup j
newsgroup j
1 1 1
-1
{ 2 (P + R ) }
ij
ij
14
Coarse-Level Clustering
newsgroup 1
newsgroup 2
newsgroup n
doc 1
doc 2
doc 1
doc 2
doc 1
doc 2
dataset
doc 1-1
doc 1-2
doc 1-n
doc 2-1
doc 2-2
Yuval Marom
k clusters
clustering
n pooled
clusters
n newsgroups
1
2
pool
evaluate
n
MULT Seminar
14 May, 2004
15
Results (1)
Dataset 1:
lp.hp
comp.text.tex
filter off
comp.graphics.apps.photoshop
Yuval Marom
MULT Seminar
14 May, 2004
filter on
16
Results (2)
Dataset 2:
filter off
filter on
talk.politics.mideast
talk.politics.guns
talk.religion.misc
(from 20-newsgroups corpus)
Yuval Marom
MULT Seminar
14 May, 2004
17
Results (3)
Dataset 3:
talk.politics.mideast
rec.sport.hockey
filter off
sci.space
filter on
(from 20-newsgroups corpus)
Yuval Marom
MULT Seminar
14 May, 2004
18
Summary of Results
Quantitative benefit depends on:
topical similarity
clustering granularity
existence of dominant “signature” words
hp/tex/photoshop
Yuval Marom
mideast/guns/religion mideast/hockey/space
MULT Seminar
14 May, 2004
19
Simple Retrieval
n pooled
clusters
query
match
retrieve
documents
containing
query words
Evaluation: test ability to find all documents
relevant retrieved documents
all relevant documents
Yuval Marom
MULT Seminar
14 May, 2004
20
H FG E
CC
B
Retrieval Results
I
CD
H FG E
A@
= <
>
IJ
CD
@<
=
: 4
5
:; 7
6:4
5
# #$
2
23
,
1 /0 .
1 /0 .
,
+
query 2: “compile miktex”
(21 relevant documents)
9 78
654
<
@?
= <
>
query 1: “letter backend”
(25 relevant documents)
,,-
*)
& %
'
)%
&
Yuval Marom
MULT Seminar
14 May, 2004
# "!
%
)(
& %
'
query 3: “rgb colour”
(22 relevant documents)
21
Conclusions
Generally filtering has quantitative benefit
depends on topical similarity and granularity
Qualitative benefit – always!
Approach is general – outperforms more naïve
ones (eg email/URL filtering)
Risk of undesirable filtering
high threshold
comparing with other speakers
Yuval Marom
MULT Seminar
14 May, 2004
22
Future Work
Different clustering algorithms
more realistic
automatically suggest number of clusters
(granularity), based on resources
perhaps hierarchical
MML-based: eg “SNOB”
Analyse real help-desk data
need different evaluation strategies
Yuval Marom
MULT Seminar
14 May, 2004
23
Future Work (cont'd)
Extract dialogue features
Develop summarization and generation
components
Integrate the full system
Yuval Marom
MULT Seminar
14 May, 2004
24