MovieRec - CS 410 Project Report Team : Pattanee

Transcription

MovieRec - CS 410 Project Report Team : Pattanee
MovieRec - CS 410 Project Report
Team :
Pattanee Chutipongpattanakul - chutipo2
Swapnil Shah - sshah219
Abstract
MovieRec is a unique movie search engine that allows users to search for any type of the movie
that they want to watch without any prior knowledge of films. By using the most helpful
reviews voted by usual moviegoers from the most popular sites like IMDB and Amazon,
MovieRec does not use any pre existing categories like most of the websites and completely
rely on the most helpful reviews provided by the moviegoers, which allows the user to find
the most unbiased results. The user can search by using any keyword that describes the kind
of movie that they want to watch and provide us a feedback on our results that can help make
the search results better. In the addition, after finding the relevant results, the user can click
on the name of the desired movie, and it will lead the user to Amazon Instant Video Page that
allows anyone to watch trailer or buy/rent a movie.
Introduction
The idea of the MovieRec comes from the daily inconveniences that many users may face when
it comes to all the available movie search engines. Everyone loves watching movies. The movie
search engines that are popular among people have very limited functionality. The most
popular movie database IMDB only allows user to search by using the movie titles. The
question arises that what if the user just wants to search what type of movie he/she wants to
watch. For example, “fairytale” or a movie with a “twisted ending” or similar to that. The movie
recommendation search engines out there do not provide reliable results as most of them
already are organized into predefined categories like Genres. However, while surfing through
the websites, we can read a lots of user reviews for a movie sorted as most helpful reviews.
Most helpful reviews are the ones that most of the general population agree with. So, we
decided to use that as our dataset because people discuss what they like or didn’t like in a
movie, spoilers, important plot points into their reviews. Using the most helpful reviews
allows us to find some useful keywords describing a movie that can’t be predefined.
Goals and Challenges
MovieRec takes movie reviews from multiple websites such as Amazon and IMDB for each
movie and stores it as documents by using text files. The system takes queries from the user
and depending on the mode our project is running on, a relevance feedback mechanism is
available for the user.
We faced many challenges through out the project; each phase has a problem of its own:
Populating the database:
Our first challenge is find an efficient way to crawl for the movies and keeping our database as
wide as possible, the initial plan was to crawl the review pages and extract both the reviews
and the ratings. We planned to weigh the ratings according to the usefulness of the review;
however, this is where we encountered our first problem. The structure of the website we
chose to extract the reviews made if difficult to extract specific elements such as ratings since
the ruby crawler we used extracted the inner text of the page. We did manage to extract the
most useful reviews, but we were forced to adjust our plans regarding weighing of the
ratings and the score band we were planning to implement – stated in the proposal. Another
challenge we met was combining the crawled data from multiple websites without merging
the text files manually. After crawling we ended up having two sets of reviews per movie,
which is inconvenient since we are using the document structure provided by lucene, we
must use one text file per document so that we can create those during indexing. We did
manage to populate our database after efforts and moved on to our next goal.
Implementing search and ranking function:
Before utilizing assignment 3 we did research on Solr and deciding against using it. We turned
to using available resources instead. Implementing the search engine and ranking function
went smoothly since we took advantage of assignment 3’s codes. We analyzed what each
method does and tried to modify it, the process of dissecting the assignment took time but it
enabled us to understand the toolkit better. We wanted to bring something to the table so we
want to create our own ranking function. This is where our challenge came in, the function
used in the assignment was BM25, and we experimented with known functions simple TFIDF
and BM25L. After studying multiple functions we tried combining them, that is how we drafted
our function. The results were less than satisfactory, but we used the function anyways since
we plan on building on it as we further develop the project.
Implementing relevance feedback:
As mentioned in the proposal, we planned to have a new function that differentiates our
search engine from the existing projects that we often come across. We were planning to
adjust query according to the feedback from the users so that a better result can be given. We
were planning to have 3 options, ‘helpful’, ‘not helpful’, and ‘watched’. We wanted to include
the third option because the documents that were already watched will not be deemed
irrelevant to the query. We researched and found out that lucene does not offer relevance
feedback so another challenge appears. We searched for other toolkits related to lucene and
found LucQE, which is a toolkit allowing query expansion. Query expansion in LucQE is based
on Rocchio. We tried to incorporate it into the search engine, however we were not successful,
we kept on going back to LucQE multiple times throughout the development of our project
but until now we received multiple bugs and could not use the toolkit. We tried
implementing feedback using data structures in Java to keep track of the queries and
document scores, find the most recurring words for the results deemed useful and use it for
query expansion for the time being and resulted in limited success.
Combining the user interface and the search engine:
We faced multiple challenges off the bat since the interface was not working on our
computers. We simply could not get the interface up and running. EWS machines didn’t allows
us to download certain packages needed for the working interface. Thus, we ended up
installing Linux on to our system, and tried it. There were massive confusions about which
apache java version to use because it wasn't compiling. In addition, it wasn’t exporting the
JAVAPATH correctly. After a lot of research on where to add the JAVAPATH, and some changes
in the pom file, the basic interface started running on the local host. Everything went smoothly
after the interface was up and running. After researching on the process of combining them,
we designed our own user interface and joined it with the search engine.
Work Summarization
We built MovieRec as a standard text retrieval engine. Instead of having a conventional
database search like other sites which categorizes movies by tags, we use crawled movie
reviews as our dataset and not tag our movies eliminating the limiting factor of what category
a movie fits into. We utilized available resources and toolkits, modified parts and did research
on more efficient implementations. The search results from our site is not as accurate as we
hope, but we are planning to expand our dataset by crawling for more reviews and develop
our ranking function to make the returned results more accurate.
Related Work
We are quite confident that there are websites that offers similar functions as our website,
however to the best of our knowledge none of the existing projects offered feedback from the
users for query expansion. Further more, we do not know any websites that base their search
upon the reviews from the general public which enables searching by contents of the movies
instead of the general ‘genre filtering’ or conventional tagging of the movies. We did try
researching websites that might be similar to our work.
Jinni
Jinni is one of the most popular and highly recommended movie recommendation service that
we surveyed from the web. It returned the most relevant results. However, they were still
based on the pre-defined tags that they have assigned to each movie. So, many times we ended
up finding the total non relevant results to some of the simplest queries. For the tags that are
not found, we found out that Jinni only searches through the name of the movies and returns
the movies that contains the query inside its name. We find that this is not efficient because
more than often the movie name does not associate with the contents.
Suggestmemovie
Suggestmemovie is another one of the popular movie recommendation website. It does have a
very useful filtering featues. However, it fails miserably for many queries like “twisted plot”,
“gory”, “fairytale”. The results were very poor and sometimes non existence. When entered
queries to find movies of a certain theme, the results returned are not accurate at all, the
movies that were returned are normally not related to that theme. The search and
recommendation system fails more than succeed.
IMBD
The most popular movie database website on the internet that completely relies on the title
of the movies when it comes to searching for movies. However, the website provides a useful
function of allowing the users to provide reviews for the movies and also allows upvotes and
downvotes. It even sorts the most useful reviews into one page. This is a factor we took
advantage of, in order to make our database the most rich and accurate, we only used the
reviews from IMDB that were deemed most useful by the general public.
Methods
Populating database
We listed the movies we would like to populate our database with and obtain the urls
manually. We used the ruby crawler and the PhantomJS text scraper to perform the extraction
of the texts in those pages. We decided to crawl the reviews from multiple websites because
we believe that in order to get a good description of the movie and have a fair dataset, we
should get reviews from both the general public, amazon customers and critics. The dataset
will only then represent the movies fairly enough for the user to search keywords that are
related to the movie content and the words will be found in the reviews. The crawled text
files is then processed by merging the text files with similar titles to become one.
Search Engine
We initially planned on using Solr, which we did heavy research on, however during
implementation we ran into problem with importing the packages so we decided to use lucene
for the time being. Therefore MovieRec utilizes a backend search engine implemented using
lucene. The textfiles were indexed into document objects, a class provided by lucene. The
files were stored in ‘name’, ‘url’ and ‘contents’ manner. We modified the program so that the
url will not lead to the page it was crawled from, but the instant movie function of
Amazon.com so that when the results were returned the user can directly have access to the
movie be it buying or renting. The search utilizes our ranking function and returns the results
as array list of ResultDocs. The code snippets were taken from the most highlighted portion
of the review, only a few lines of the review containing those words were displayed in our
interface.
Our ranking function
Relevance feedback and Query expansion
We researched about implementations of relevance feedback in lucene. We studied LucQE
and the module it provided (Rocchio Query Expansion) and decided to implement it. After a
failed attempt we wrote a java program which once takes contents of the documents, finds
words with highest frequencies, performs normalization and smoothing with general LM,
using all the documents. We used this when the user is using the interactive search through
the terminal. Our program will ask whether the document is relevant or not. If it is relevant
topic language models obtained from the program will be used with the original query to
expand the search. We also tried storing the queries in a data structure with the relevant
documents deemed by the user and keeping scores according to feedback. Our relevance
feedback as of right now is far from complete, however, further work will include going back
and try to implement LucQE’s query expansion module again.
Interface
After finished connecting the back end with the sample interface provided in assignment 3, we
designed our own interface. Initially we were going to include a function that the picture
beside the results will change according to the relevant feedback given. However we could
not carry out that plan since our relevance feedback and query expansion is not yet complete.
After finishing the first version of our web interface, we joined it with the search engine
backend yielding MovieRec as seen below.
Usage and Evaluation
The main function of MovieRec is that the words that were searched will actually match the
contents of the movie or is a accurate description of the movie. It will not only match the title
with the query word, it will also find those words in the reviews that were deemed most
useful. We will test the website. The feedback was not implemented in the user interface,
however it was used in the terminal interactive search.
For the above example, we entered the query “dragon” in “MovieRec” and “Jinni”.
Our search engine returns the movies that have dragons in them like “How to train your
dragon”, “The Hobbit Desolation of Smaug” , “Shrek” etc. while most of the results
in“Jinni” returns the results that have the word dragon in the title like “Green Dragon”,
“The girl with the dragon tattoo”.
In the screenshot below, we searched the keyword “twisted plot”. For which,
MovieRec returns some of the most relevant movies that has twisted plots or endings
like “Now you see me”, “The usual suspect” , “The sixth sense”. which is quite useful,
while suggestmemovie does not return anything for the same query.
.
While it returns highly relevant results most of the times, sometimes due to the user
reviews being very ambiguous, the result gets affected. When we searched for the word
“magic”, as seen in the screenshot below one of the movies that returned was “Jurrassic
Park” because of the ambiguous nature of Natural Language. But that is the reason why
we developed the feedback system, so we can improve our search results for the queries.
Conclusions and Future works
The we ended up diverting from many of our initial plans due to technicalities, however the
final product that we ended up producing worked as we intended. MovieRec is a movie search
engine that does not restrict the users to only search by traditional keywords that mostly
describes themes, moods, genres, or any other filters. MovieRec allows the user to search by
any keyword they can think of, from broad keywords like ‘gory’, to narrow keywords like
‘cops’ or ‘womanizers’ or even names of movie characters or even certain elements of the
movie. We have implemented and tweaked crawling, indexing, searching and ranking from
knowledge obtained from class and through assignments. Although our project has flaws, but
we do intend to extend on that in further iterations.
Fixing limitations
We can go back and finish the proper implementation of relevance feedback and integrate it to
the frontend of the application as well. We can also improve our ranking function to become
more accurate. We can attempt to overcome the limiting factors we find in crawling and
implementing the ratings and display the range or ratings with the movies returned.
Expansions
We can further expand our database by crawling reviews from more websites, even allowing
the users to add reviews of movies to add to our data set. And lastly, we could also add
functions that summarizes the reviews and remove redundant reviews from the data set.
Individual Contributions
Swapnil Shah (sshah219)
- Research on Solr
- Crawling and populating the database
- modification of text files
- frontend user interface and backend search engine connection
- ranking function
- System setup and installation work
Pattanee Chutipongpattanakul (chutipo2)
- Research on LucQE
- Crawling and populate database
- Query expansion and relevance feedback using java data structure
- Write java program to process reviews using unigram language models to find topic models
- smoothing topic models
- Ranking function
- Web application (user interface frontend)
References
[1] Rubens, Neil. "LucQE [lucky] Lucene Query Expansion Module." LucQE. Computer Modelling and New Technologies, 2006. Web. 13 May 2014. [2] "Apache Solr." Apache Lucene ­. The Apache Software Foundation, 2011. Web. 13 May 2014. [3] "Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & More." Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & More. N.p., n.d. Web. 11 May 2014. [4] IMDb. IMDb.com, n.d. Web. 11 May 2014. [5] "Suggest Me Movies." Suggest Me Movie RSS. N.p., n.d. Web. 13 May 2014. [6] "Jinni: Find Movies, TV Shows Matching Your Taste & Watch Online." Find Movies, TV Shows Matching Your Taste and Watch Online. N.p., n.d. Web. 13 May 2014."Jinni: Find Movies, TV Shows Matching Your Taste & Watch Online." Find Movies, TV Shows Matching Your Taste and Watch Online. N.p., n.d. Web. 13 May 2014.