Informa`on Retrieval (IR)

Transcription

Informa`on Retrieval (IR)
Applica'onsintextmining
•  Informa'onRetrieval(IR)
•  Webaccessfiltering
•  Documentsummariza'on
•  Informa'onExtrac'on(IE)
•  Ques'on-Answering(QA)
•  Textclassifica'on
•  Spamdetec'on
•  Technology,Economicswatch
Etc.
Applica'onsintextmining
MasterDataMining
JulienVelcin
2
Informa'onRetrieval(IR)
Informa'onRetrieval(IR)
•  Mainlybasedondocumentindexing
=findtheimportantmeaningsandcreatean
internalrepresenta'onofthewebsites
•  Factorstoconsider:
–  Accuracytorepresentmeanings(seman'cs)
–  Exhaus'veness(coverallthecontents)
–  Facilityforcomputertomanipulate
•  Whatisthebestrepresenta'onofcontents?
– 
– 
– 
– 
3
Char.string(chartrigrams):notpreciseenough
Word:goodcoverage,notprecise
Phrase:poorcoverage,moreprecise
Concept:poorcoverage,precise
4
Informa'onRetrieval(IR)
Informa'onRetrieval(IR)
•  Matchingscoremodel
•  PageRankinGoogle:
–  DocumentD=asetofweightedkeywords
–  QueryQ=asetofnon-weightedkeywords
–  R(D,Q)=Σiw(ti,D)
I1
I2
wheretiisinQ.
A
B
PR( A) = (1 − d ) + d ∑
i
PR( I i )
C(Ii )
•  Assignanumericvaluetoeachpage
•  Themoreapageisreferredtobyimportantpages,themore
thispageisimportant
•  d:dampingfactor(0.85)
•  Manyothercriteria:e.g.proximityofquerywords
•  Booleanmodel
•  VSMmodel
•  Probabilis'cmodels
–  “…informa'onretrieval…”be_erthan“…informa'on…retrieval…”
5
ResultClustering
6
ResultClustering
•  Useanysearchenginetogetsnippets:
•  Textclusteringtoorganizesnippetsintoatree
•  A_achmeaningfullabelstothecategories
–  Frequentpa_erns
–  Nameden''es
•  Someexis'ngsystems:Clusty,Carrot2,Kartoo…
7
8
Webaccessfiltering
Webaccessfiltering
•  Websiteclassifica'onforparentalcontrol
–  2EuropeanResarchProjects:NetProtectI&II
–  EADS,OpteNet,MGN,Hypertech
•  Developpementofanewtextclassifier
–  Basedoncombina'onsofSupportVectorMachines
(SVM)
–  Designedtodealwith4classes(pornography,
violence,bombmaking,drugs),and8languages
(english,french,italian,portuguese,dutch,german,
spanish,greek)
[Grilheresetal.,2004]
Bombmaking
⌃Bcounter-examples
Drugs
⌃Dcounter-examples
Pornography
⌃Pcounter-examples
Violence
⌃Vcounter-examples
Childorientedwebsites(E)
Genericwebsites
9
Webaccessfiltering
10
Webaccessfiltering
•  Resultsonthededicatedcategories:
TEX:Ar'ficialneuronalnetworkofNetProtectI(text)
SVMi:(Biclass)SupportVectorMachine(text)
(takethenstrongestwordgiventheTF-IDFscore)
ADR:basedonregularexpressiononthename/address
IMG:machinelearningsystemusingonlypicture’sfeatures
•  Resultsonthe4categories:
11
12
Webaccessfiltering
Webaccessfiltering
combina'on
Blocking
Rateofharmfulpages
correctlyblocked
OR
AND
MajorityVo'ng(MV)
LogicalRules(RUL)
kNN
DSkNN
NB
MLP
SVM
Overblocking
Rateofharmless
pagesblocked
GCR
GoodClassifica'on
Rate
I(GCR,5%)
Confidenceinterva
at5%
13
Documentsummariza'on
14
Documentsummariza'on
•  Task:thetaskistoproduceshorter,summaryversionof
anoriginaldocument
•  Twomainapproachestotheproblem:
–  SelecEonbased–summaryisselec'onofsentencesfroman
originaldocument
–  Knowledgerich–performingseman'canalysis,represen'ng
themeaningandgenera'ngthetextsa'sfyinglengthrestric'on
15
•  Threemainphases:
–  Analyzingthesourcetext
–  Determiningitsimportantpoints(units)
–  Synthesizinganappropriateoutput
•  Mostmethodsadoptlinearweigh'ngmodel–eachtextunit
(sentence)isassessedbythefollowingformula:
–  Weight(U)=LocaEonInText(U)+CuePhrase(U)+StaEsEcs(U)+
AddiEonalPresence(U)
–  …lotofheuris'csandtuningofparameters(alsowithML)
•  …outputconsistsfromtopmosttextunits(sentences)
16
Exampleofsummariza'on
CracksAppearinU.N.TradeEmbargoAgainstIraq.
Human written
summary
CracksappearedTuesdayintheU.N.tradeembargoagainstIraqasSaddamHusseinsoughttocircumventtheeconomicnoosearoundhiscountry.Japan,meanwhile,announcedit
wouldincreaseitsaidtocountrieshardesthitbyenforcingthesanc'ons.Hopingtodefusecri'cismthatitisnotdoingitssharetoopposeBaghdad,Japansaidupto$2billionin
aidmaybesenttona'onsmostaffectedbytheU.N.embargoonIraq.PresidentBushonTuesdaynightpromisedajointsessionofCongressandana'onwideradioandtelevision
audiencethat``SaddamHusseinwillfail''tomakehisconquestofKuwaitpermanent.``Americamuststanduptoaggression,andwewill,''saidBush,whoaddedthattheU.S.
militarymayremainintheSaudiArabiandesertindefinitely.``IcannotpredictjusthowlongitwilltaketoconvinceIraqtowithdrawfromKuwait,''Bushsaid.Morethan150,000
U.S.troopshavebeensenttothePersianGulfregiontodeterapossibleIraqiinvasionofSaudiArabia.Bush'saidessaidthepresidentwouldfollowhisaddresstoCongresswitha
televisedmessagefortheIraqipeople,declaringtheworldisunitedagainsttheirgovernment'sinvasionofKuwait.SaddamhadofferedBush'meonIraqiTV.ThePhilippinesand
Namibia,thefirstofthedevelopingna'onstorespondtoanofferMondaybySaddamoffreeoil_inexchangeforsendingtheirowntankerstogetit_saidnototheIraqileader.
Saddam'sofferwasseenasanone-too-subtlea_empttobypasstheU.N.embargo,ineffectsincefourdaysawerIraq'sAug.2invasionofKuwait,bygexngpoorcountriestodock
theirtankersinIraq.ButaccordingtoaStateDepartmentsurvey,CubaandRomaniahavestruckoildealswithIraqandcompanieselsewherearetryingtocon'nuetradewith
Baghdad,allindefianceofU.N.sanc'ons.Romaniadeniestheallega'on.Thereport,madeavailabletoTheAssociatedPress,saidsomeEasternEuropeancountriesalsoaretrying
tomaintaintheirmilitarysalestoIraq.Awell-informedsourceinTehrantoldTheAssociatedPressthatIranhasagreedtoanIraqirequesttoexchangefoodandmedicineforupto
200,000barrelsofrefinedoiladayandcashpayments.TherewasnoofficialcommentfromTehranorBaghdadonthereportedfood-for-oildeal.Butthesource,whorequested
anonymity,saidthedealwasstruckduringIraqiForeignMinisterTariqAziz'svisitSundaytoTehran,thefirstbyaseniorIraqiofficialsincethe1980-88gulfwar.Awerthevisit,the
twocountriesannouncedtheywouldresumediploma'crela'ons.Well-informedoilindustrysourcesintheregion,contactedbyTheAP,saidthatalthoughIranisamajoroil
exporteritself,itcurrentlyhastoimportabout150,000barrelsofrefinedoiladayfordomes'cusebecauseofdamagestorefineriesinthegulfwar.Alongsimilarlines,ABCNews
reportedthatfollowingAziz'svisit,IraqisapparentlypreparedtogiveIranalltheoilitwantstomakeupforthedamageIraqinflictedonIranduringtheirconflict.SecretaryofState
JamesA.BakerIII,meanwhile,metinMoscowwithSovietForeignMinisterEduardShevardnadze,twodaysawertheU.S.-SovietsummitthatproducedajointdemandthatIraq
withdrawfromKuwait.Duringthesummit,BushencouragedMikhailGorbachevtowithdraw190SovietmilitaryspecialistsfromIraq,wheretheyremaintofulfillcontracts.
ShevardnadzetoldtheSovietparliamentTuesdaythespecialistshadnotrenegedonthosecontractsforfearitwouldjeopardizethe5,800Sovietci'zensinIraq.Inhisspeech,Bush
saidhisheartwentouttothefamiliesofthehundredsofAmericansheldhostagebyIraq,buthedeclared,``Ourpolicycannotchange,anditwillnotchange.Americaandthe
worldwillnotbeblackmailed.''Thepresidentadded:``Vitalissuesofprincipleareatstake.SaddamHusseinisliterallytryingtowipeacountryoffthefaceoftheEarth.''Inother
developments:_AU.S.diplomatinBaghdadsaidTuesdayupto800AmericansandBritonswillflyoutofIraqi-occupiedKuwaitthisweek,mostofthemwomenandchildrenleaving
theirhusbandsbehind.Saddamhassaidheiskeepingforeignmenashumanshieldsagainsta_ack.OnMonday,aplaneloadof164WesternersarrivedinBal'morefromIraq.
EvacueesspokeoffoodshortagesinKuwait,nighxmegunfireandIraqiroundupsofyoungpeoplesuspectedofinvolvementintheresistance.``Thereisnolawandorder,''said
Thuraya,19,whowouldnotgiveherlastname.``Asoldiercanrapeafather'sdaughterinfrontofhimandhecan'tdoanythingaboutit.''_TheStateDepartmentsaidIraqhadtold
U.S.officialsthatAmericanmalesresidinginIraqandKuwaitwhowereborninArabcountrieswillbeallowedtoleave.IraqgenerallyhasnotletAmericanmalesleave.Itwasnot
knownhowmanymentheIraqimovecouldaffect._APentagonspokesmansaid``someincreaseinmilitaryac'vity''hadbeendetectedinsideIraqnearitsborderswithTurkeyand
Syria.Hesaidtherewasli_leindica'onhos'li'esareimminent.DefenseSecretaryDickCheneysaidthecostoftheU.S.militarybuildupintheMiddleEastwasrisingabovethe$1
billion-a-monthes'mategenerallyusedbygovernmentofficials.Hesaidthetotalcost_ifnoshoo'ngwarbreaksout_couldtotal$15billioninthenextfiscalyearbeginningOct.
1.Cheneypromiseddisgruntledlawmakers``asignificantincrease''inhelpfromArabna'onsandotherU.S.alliesforOpera'onDesertShield.Japan,whichhasbeenaccusedof
respondingtooslowlytothecrisisinthegulf,saidTuesdayitmaygive$2billiontoEgypt,JordanandTurkey,hithardestbytheU.N.prohibi'onontradewithIraq.``Thepressure
fromabroadisgexngsostrong,''saidHiroyasuHorio,anofficialwiththeMinistryofInterna'onalTradeandIndustry.Localnewsreportssaidtheaidwouldbeextendedthrough
theWorldBankandInterna'onalMonetaryFund,and$600millionwouldbesentasearlyasmid-September.OnFriday,TreasurySecretaryNicholasBradyvisitedTokyoonaworld
tourseeking$10.5billiontohelpEgypt,JordanandTurkey.Japanhasalreadypromiseda$1billionaidpackageformul'na'onalpeacekeepingforcesinSaudiArabia,including
food,water,vehiclesandprefabricatedhousingfornon-militaryuses.Butcri'csintheUnitedStateshavesaidJapanshoulddomorebecauseitseconomydependsheavilyonoil
fromtheMiddleEast.Japanimports99percentofitsoil.Japan'scons'tu'onbanstheuseofforceinse_linginterna'onaldisputesandJapaneselawrestrictsthemilitaryto
Japaneseterritory,exceptforceremonialoccasions.OnMonday,Saddamoffereddevelopingna'onsfreeoiliftheywouldsendtheirtankerstopickitup.Thefirsttwocountriesto
respondTuesday_thePhilippinesandNamibia_saidno.Manilasaidithadalreadyfulfilleditsoilrequirements,andNamibiasaiditwouldnot``sellitssovereignty''forIraqioil.
VenezuelanPresidentCarlosAndresPerezdismissedSaddam'sofferoffreeoilasa``propagandaploy.''Venezuela,anOPECmember,hasledadriveamongoil-producingna'ons
toboostproduc'ontomakeupfortheshor{allcausedbythelossofIraqiandKuwai'oilfromtheworldmarket.Theiroilmakesup20percentoftheworld'soilreserves.Only
SaudiArabiahashigherreserves.ButaccordingtotheStateDepartment,Cuba,whichfacesanoildeficitbecauseofreducedSovietdeliveries,hasreceivedashipmentofIraqi
petroleumsinceU.N.sanc'onswereimposedfiveweeksago.AndRomania,itsaid,expectstoreceiveoilindirectlyfromIraq.Romania'sambassadortotheUnitedStates,Virgil
Constan'nescu,deniedthatclaimTuesday,callingit``absolutelyfalseandwithoutfounda'on.''.
CracksappearedintheU.N.tradeembargoagainstIraq.TheState
DepartmentreportsthatCubaandRomaniahavestruckoildeals
withIraqasothersa_empttotradewithBaghdadindefianceofthe
sanc'ons.Iranhasagreedto
exchangefoodandmedicineforIraqioil.Saddamhasoffered
developingnaEonsfreeoiliftheysendtheirtankerstopickitup.
Thusfar,nonehasaccepted.
Japan,accusedofrespondingtooslowlytotheGulfcrisis,has
promised$2billioninaidtocountrieshithardestbytheIraqitrade
embargo.PresidentBushhaspromisedthatSaddam'saggression
willnotsucceed.
Textclassifica'on
•  Automa'callyclassifydocumentsinto
predefinedclasses
Sport
Poli'cs
e.g.,digitallibraries
Entertain.
Economics
etc.
•  Applica'onareas:
–  EmailSPAMfiltering
–  Internetdirectoryconstruc'on(ex.:Yahoo!)
–  Automa'cindexing…
7800 chars, 1300 words
Averyclassicaldataset
SPAMdetec'on
Top 20 categories of Reuter news in 1987-91
•  SPAM=Junkemails
•  In2009:97%ofemailsareSPAM!
4000
h_p://news.bbc.co.uk/2/hi/technology/7988579.stm
3000
•  An'-spamisabout95%accuratetoday,but
canachieveabout99%ifcorrectlytrained
•  Numeroussystems:
2000
1000
ea
rn
on acq
ey
cr -fx
ud
gr e
a
t in
in rad
te e
r
whe st
ea
sh t
i
co p
rn
m
on o d
e y ils e lr
-s ed
up
p
s u ly
ga
g r
co n p
f
ve fee
goi
na go l l
t
so -g d
yb as
ea
n
bo
p
0
m
Number of Documents
17
Category
SpamAssassin(MessageLabs),BitdefenderAn'Spam,
POPFile,Spamihilator,CactusSpamFilter…
•  Alotofheuris'cs,partlyusingML
EmailFormat
ImageSPAM
•  From:Thee-mailaddress,andop'onallythenameofthesender
•  To:Thee-mailaddress[es],andop'onallyname[s]ofthemessage’s
recipient[s]
•  Subject:Abriefsummaryofthecontentsofthemessage
•  Date:Thelocal'meanddatewhenthemessagewaswri_en
•  Cc:Carboncopy
•  Bcc:BlindCarbonCopy
•  Received:Trackinginforma'ongeneratedbymailserversthathave
previouslyhandledamessage
•  Content-Type:Informa'onabouthowthemessagehastobedisplayed,
usuallyaMIMEtype
•  Reply-To:Addressthatshouldbeusedtoreplytothesender.
•  References:Message-IDofthemessagethatthisisareplyto,andthe
message-idofthismessage,etc.
•  In-Reply-To:Message-IDofthemessagethatthisisareplyto.
•  X-Face:Smallicon.
An'spamtechnologiesofBitdefender
• 
• 
• 
• 
• 
• 
• 
• 
• 
Bayesianfilters
Heuris'csfilters
NeuronalNetworks(ART,ARTMAP,NeuNet,ART+)
URLfilters
KNN
ASSL(scriptlanguage)
SURBL
Blacklistsandwhitelists
Imagefilters
NaiveBayesforemailclassifica'on
•  Basedonasimple(evensimplis'c)assump'on,NB
performsinteres'ngperformancesforthetaskof
emailclassifica'on[Koprinskaetal.,06]
•  Forthetaskoffilingemailsintofolders(4classes):
Informa'onExtrac'on(IE)
Ques'onAnswering
•  Iden'fyphrasesinlanguagethatrefertospecific
typesofen''esandrela'onsintext.
•  Nameden'tyrecogni'onistaskofiden'fyingnames
ofpeople,places,organiza'ons,etc.intext.
peopleorganiza'onsplaces
–  MichaelDellistheCEOofDellComputerCorpora'onand
livesinAus'nTexas.
•  Rela'onextrac'oniden'fiesspecificrela'ons
betweenen''es.
•  Directlyanswernaturallanguageques'ons
basedoninforma'onpresentedinacorpora
oftextualdocuments(e.g.theWeb).
–  WhenwasBarackObamaborn?
•  August4,1961
–  WhowaspresidentwhenBarackObamawas
born?
•  JohnF.Kennedy
–  Howmanypresidentshavetherebeensince
BarackObamawasborn?
•  9
25
26
Maineventsrelatedtotextmining
Na'onalandinterna'onalcontest
•  Ar'ficialIntelligence
•  Con'nueddevelopmentofcorporaand
compe''onsonshareddata:
–  IJCAI,AAAI,UAI,UM,ECAI
•  NaturalLanguageProcessing
TRECQ/A-SENSEVAL/SEMEVAL-CONLLSharedTasks(NER,
SRL…)-KDDcontest-etc.
–  ACL,CoNLL,EACL,EMNLP,IJCNLP,LREC
•  MachineLearning
•  Forinstance:SIAMTMCompe''on2007
–  ICML,ECML,ALT,COLT,NIPS,ICALT
•  DataMining/Database
–  Objec've:developTMalgofordocclassifica'on
–  Avia'onsafetyreportsdocumen'ngoneormore
problemsthatoccurredoncertainflights
–  Valida'onmeasures:precision/recall
–  ICDM,SIGKDD,PKDD,VLDB,ICDE,SDM,PAKDD,DAWAK
•  Informa'onRetrieval
–  SIGIR,TREC,ECIR,CIKM
•  Others
–  DocEng,ICWSM…
27
28