Composite Retrieval

Transcription

Composite Retrieval
Composite Retrieval applied to movie recommenda2on Sihem Amer-­‐Yahia, Eric Gaussier, Vincent Leroy (LIG/CNRS) ALICIA mee2ng Oct 2nd, 2014 Mo2va2on Search results beyond flat ranked lists –  Some items complement each other –  Opportunity to create rich results comprising several items: composite items ranked list of items composite items vs. … … Composite Item (CI) •  A set I of items (e.g. movies, ar2sts…) •  A composite item C is a subset of I that sa2sfies: –  Complementarity: each item contributes to a different aspect of the CI, no redundancy –  Consistency: all items have some similari2es and fit together –  Budget: total cost within bounds •  Ex: CI of movies –  Complementarity aVribute •  director, release year, genre, actors, … –  Consistency •  same raters, similar tags, … –  Budget •  #movies, rental cost, movie length, … Personalized Composite Retrieval (TKDE 2014) Retrieve a set of k CIs { S 1 ,...,
S k } that maximize: γ∑
∑
sim(a, b) + (1− γ )∑ (1− max sim(x, y))
i a,b∈Si
i< j
similarity of items in each CI Si
x∈Si ,y∈S j
diversity between CIs Si
Sj
Exis2ng approaches (TKDE 2014) •  NP-­‐hard, reduc2on from MAXIMUM EDGE SUBGRAPH •  Heuris2cs: CI genera2on expressed as 2 different steps in 2 different orders –  Cluster to achieve consistency (e.g. k-­‐means variants) –  Apply constraints to enforce budget and complementarity à Not fully integrated àPoten2ally sub-­‐op2mal results Integrated Algorithm •  State of the art –  K-­‐means with must link and cannot link constraints •  Objec2ve –  Support queries such as Generate k clusters comprising similar objects, 3 of which are of type A and 2 of type B à Focus on Complementarity (special case) and Consistency à No budget constraint Data Model •  Given: items, users, user ac2ons •  Item similarity (consistency): –  Based on user ac2ons –  a and b -­‐ similar if many users rated them similarly •  Item complementarity: –  Rely on items aVributes –  Each item belongs to a single category Sketch of the Algorithm •  Build 3 clusters with 2 red points and 3 green –  Ini2ate with fuzzy clustering i.e. all points considered when reposi2oning centroids –  Transi2on to only considering content of CI Datasets •  VK: scrape scrape … and not sure it’s complete –  15683422 movie_comments.tsv –  18350 movie_directors.tsv –  42417 movies_all.tsv –  2460027 user_ra2ngs.tsv –  18736 users_all.txt •  MovieLens (we used the 1M dataset) –  3883 movies.dat –  1000209 ra2ngs.dat (at least 20 ra2ngs per user) –  6040 users.dat (Very) Preliminary Output •  On VK datasets: –  2 Comédie, 2 Ac2on, 2 Drame –  2 movies <= 1980, 2 movies >1980 <=2000, 2 movies >2000 –  Runs in less than 1mn on my laptop (2me is not an issue) but we need to evaluate quality and validate the constraints with you •  On Movielens CIs in VK CI1: 2 Comédie, 2 Ac2on, 2 Drame name=Les Vieux chats, genre=Comédie, year=2010, originalId=796202 name=Trois soeurs, genre=Comédie, year=2012, originalId=828379 name=Miss Bala, genre=Ac2on, year=2010, originalId=664350 name=Rampart, genre=Ac2on, year=2012, originalId=747997 name=La Nuit d'en face, genre=Drame, year=2012, originalId=812498 name=Rêve et silence, genre=Drame, year=2012, originalId=854345 CIs in VK CI2: 2 Comédie, 2 Ac2on, 2 Drame name=Paparazzi, genre=Comédie, year=1997, originalId=59455 name=Mensonges et trahisons et plus si affinités..., genre=Comédie, year=2004, originalId=57814 name=Des Serpents dans l'avion, genre=Ac2on, year=2005, originalId=277082 name=Predator, genre=Ac2on, year=1987, originalId=60897 name=La Fievre du samedi soir, genre=Drame, year=1977, originalId=8093 name=Cocktail, genre=Drame, year=1988, originalId=5590 CIs in MovieLens CI1: 2 Comédie, 2 Ac2on, 2 Drame name=Hav Plenty (1997), genre=Comedy, year=1997, originalId=1903 name=Red Dwarf, The (Le Nain rouge) (1998), genre=Comedy, year=1998, originalId=2685 name=Montana (1998), genre=Ac2on, year=1998, originalId=3184 name=Bait (2000), genre=Ac2on, year=2000, originalId=3898 name=Went to Coney Island on a Mission From God... Be Back by Five (1998), genre=Drama, year=1998, originalId=3887 name=Price of Glory (2000), genre=Drama, year=2000, originalId=3482 CIs in MovieLens CI2: 2 Comédie, 2 Ac2on, 2 Drame name=Meet the Parents (2000), genre=Comedy, year=2000, originalId=3948 name=Bamboozled (2000), genre=Comedy, year=2000, originalId=3943 name=Highlander: Endgame (2000), genre=Ac2on, year=2000, originalId=3889 name=Get Carter (2000), genre=Ac2on, year=2000, originalId=3946 name=Two Family House (2000), genre=Drama, year=2000, originalId=3951 name=Contender, The (2000), genre=Drama, year=2000, originalId=3952 CIs in VK CI1: 2 movies <= 1980, 2 movies >1980 <=2000, 2 movies >2000 name=2001 : L'Odyssée de l'espace, genre=Aventure, year=1968, originalId=4148 name=Les Dents de la Mer, genre=Aventure, year=1975, originalId=10709 name=Jurassic Park, genre=Aventure, year=1993, originalId=7637 name=Toy Story, genre=Anima2on, year=1996, originalId=53406 name=Vol spécial, genre=Documentaire, year=2011, originalId=777986 name=La Nuit d'en face, genre=Drame, year=2012, originalId=812498 CIs in VK CI2: 2 movies <= 1980, 2 movies >1980 <=2000, 2 movies >2000 name=Marathon Man, genre=Thriller, year=1976, originalId=84942 name=Star Wars : Episode IV -­‐ Un nouvel espoir (La Guerre des étoiles), genre=Ac2on, year=1977, originalId=64669 name=Predator, genre=Ac2on, year=1987, originalId=60897 name=Piège de cristal, genre=Ac2on, year=1988, originalId=60151 name=Les Brigades du Tigre, genre=Ac2on, year=2005, originalId=108145 name=Des Serpents dans l'avion, genre=Ac2on, year=2005, originalId=277082 CIs in MovieLens CI1: CI on years, had to change to <= 1980, 1981-­‐1990 and 1991-­‐2000 because this dataset is older name=Last Time I Saw Paris, The (1954), genre=Drama, year=1954, originalId=972 name=Mass Appeal (1984), genre=Drama, year=1984, originalId=2397 name=Trick or Treat (1986), genre=Horror, year=1986, originalId=2464 name=Snows of Kilimanjaro, The (1952), genre=Adventure, year=1952, originalId=3207 name=Price of Glory (2000), genre=Drama, year=2000, originalId=3482 name=Went to Coney Island on a Mission From God... Be Back by Five (1998), genre=Drama, year=1998, originalId=3887 CIs in MovieLens CI2: CI on years, had to change to <= 1980, 1981-­‐1990 and 1991-­‐2000 because this dataset is older name=Phantom of the Opera, The (1943), genre=Drama, year=1943, originalId=3936 name=Sorority House Massacre (1986), genre=Horror, year=1986, originalId=3941 name=Sorority House Massacre II (1990), genre=Horror, year=1990, originalId=3942 name=Get Carter (1971), genre=Thriller, year=1971, originalId=3947 name=Two Family House (2000), genre=Drama, year=2000, originalId=3951 name=Contender, The (2000), genre=Drama, year=2000, originalId=3952 Future Work •  Set it up remotely to try it out •  Apply on –  POIs from Wikipedia –  Retail data from Intermarché •  Extend with –  Personalized composi2on (add user’s interest as a weight or form user groups to provide best feedback on CI) –  Support constraints on mul2ple orthogonal aVributes at once –  Support budget constraints (cumula2ve on objects) such as price, length …