Slides
Transcription
Slides
Filtering SpeakerSpecific Words from Electronic Discussions Project Overview Context-sensitive automated help-desk systems Context: memory of previous interactions client-based server-based: community of users Strategy: cluster a corpus of interactions use centroids as memories guide future interactions using memories Yuval Marom MULT Seminar 14 May, 2004 2 Project Overview (2) Clustering features: word-based features bag-of-words, bigrams extract topics dialogue/conversational features: initial request number of turns dialogue acts outcome Yuval Marom extract “user/dialogue types” MULT Seminar 14 May, 2004 3 Project Overview (3) st 1 stage: clustering Yuval Marom corpus of discussions MULT Seminar clusters 14 May, 2004 4 Project Overview (3) st 1 stage: clustering corpus of discussions query match nd 2 stage: interaction append Yuval Marom MULT Seminar clusters centroid / set of retrieve documents user 14 May, 2004 representative features: answer summary dialogue type response 5 Document Clustering Useful for focusing and speeding-up retrieval Bag-of-words features with tf·idf scoring ignore stop/function words ignore infrequent and highly-frequent words Data presented to a k-means clustering algorithm k centroids document-to-cluster assignments Evaluation difficult Yuval Marom MULT Seminar 14 May, 2004 6 Test Domain Newsgroups good approximation to help-desk discussions readily available rich variety of topics Evaluation: 1. coarse-level clustering 2. simple retrieval Yuval Marom MULT Seminar 14 May, 2004 7 An Example Yuval Marom MULT Seminar 14 May, 2004 8 An Example You need to download it. It doesn't come fixed with any version. Shoff1945 wrote: > I just purchased an academic version of PS7. Do I need to download 7.0.1 or has > my copy been fixed. How would I know? Is there a PS 7.0.1 that I should order > instead? I haven't opened this copy yet. > Many thanks. -Comic book sketches and artwork: http://www.sover.net/~hannigan/edjh.html Yuval Marom MULT Seminar 14 May, 2004 9 An Example The fastest software company is Borland. When I called them to buy JBuilder 5, I was told the current version was 6. But what I got from mail is 7. A month later, I learned 9 was scheduled to release. Tony G. Smith Vizros – Realistic 3D page curl plug-ins and more Demo at http://www.vizros.com/gallery.html Yuval Marom MULT Seminar 14 May, 2004 10 A Filtering Mechanism st 1 pass: Maintain a speaker-by-word frequency matrix Count number of postings for each speaker nd 2 pass: Convert each word in each posting to a proportion If proportion statistically higher than threshold, then filter the word “signature words” Yuval Marom MULT Seminar 14 May, 2004 11 Coarse-Level Clustering newsgroup 1 newsgroup 2 newsgroup n doc 1 doc 2 doc 1 doc 2 doc 1 doc 2 dataset doc 1-1 doc 1-2 doc 1-n doc 2-1 doc 2-2 Yuval Marom MULT Seminar 14 May, 2004 12 Coarse-Level Clustering newsgroup 1 newsgroup 2 newsgroup n doc 1 doc 2 doc 1 doc 2 doc 1 doc 2 dataset doc 1-1 doc 1-2 doc 1-n doc 2-1 doc 2-2 Yuval Marom k clusters clustering MULT Seminar Issues to resolve: k≠n cluster-newsgroup correspondence 14 May, 2004 13 Coarse-Level Clustering newsgroup 1 newsgroup 2 newsgroup n doc 1 doc 2 doc 1 doc 2 doc 1 doc 2 doc 1-n doc 2-1 doc 2-2 Yuval Marom P = ij k clusters clustering n newsgroups 1 2 n MULT Seminar 14 May, 2004 newsgroup j cluster i recall: Rij = cluster i F-score: dataset doc 1-1 doc 1-2 cluster i precision: newsgroup j newsgroup j 1 1 1 -1 { 2 (P + R ) } ij ij 14 Coarse-Level Clustering newsgroup 1 newsgroup 2 newsgroup n doc 1 doc 2 doc 1 doc 2 doc 1 doc 2 dataset doc 1-1 doc 1-2 doc 1-n doc 2-1 doc 2-2 Yuval Marom k clusters clustering n pooled clusters n newsgroups 1 2 pool evaluate n MULT Seminar 14 May, 2004 15 Results (1) Dataset 1: lp.hp comp.text.tex filter off comp.graphics.apps.photoshop Yuval Marom MULT Seminar 14 May, 2004 filter on 16 Results (2) Dataset 2: filter off filter on talk.politics.mideast talk.politics.guns talk.religion.misc (from 20-newsgroups corpus) Yuval Marom MULT Seminar 14 May, 2004 17 Results (3) Dataset 3: talk.politics.mideast rec.sport.hockey filter off sci.space filter on (from 20-newsgroups corpus) Yuval Marom MULT Seminar 14 May, 2004 18 Summary of Results Quantitative benefit depends on: topical similarity clustering granularity existence of dominant “signature” words hp/tex/photoshop Yuval Marom mideast/guns/religion mideast/hockey/space MULT Seminar 14 May, 2004 19 Simple Retrieval n pooled clusters query match retrieve documents containing query words Evaluation: test ability to find all documents relevant retrieved documents all relevant documents Yuval Marom MULT Seminar 14 May, 2004 20 H FG E CC B Retrieval Results I CD H FG E A@ = < > IJ CD @< = : 4 5 :; 7 6:4 5 # #$ 2 23 , 1 /0 . 1 /0 . , + query 2: “compile miktex” (21 relevant documents) 9 78 654 < @? = < > query 1: “letter backend” (25 relevant documents) ,,- *) & % ' )% & Yuval Marom MULT Seminar 14 May, 2004 # "! % )( & % ' query 3: “rgb colour” (22 relevant documents) 21 Conclusions Generally filtering has quantitative benefit depends on topical similarity and granularity Qualitative benefit – always! Approach is general – outperforms more naïve ones (eg email/URL filtering) Risk of undesirable filtering high threshold comparing with other speakers Yuval Marom MULT Seminar 14 May, 2004 22 Future Work Different clustering algorithms more realistic automatically suggest number of clusters (granularity), based on resources perhaps hierarchical MML-based: eg “SNOB” Analyse real help-desk data need different evaluation strategies Yuval Marom MULT Seminar 14 May, 2004 23 Future Work (cont'd) Extract dialogue features Develop summarization and generation components Integrate the full system Yuval Marom MULT Seminar 14 May, 2004 24