MovieRec - CS 410 Project Report Team : Pattanee
Transcription
MovieRec - CS 410 Project Report Team : Pattanee
MovieRec - CS 410 Project Report Team : Pattanee Chutipongpattanakul - chutipo2 Swapnil Shah - sshah219 Abstract MovieRec is a unique movie search engine that allows users to search for any type of the movie that they want to watch without any prior knowledge of films. By using the most helpful reviews voted by usual moviegoers from the most popular sites like IMDB and Amazon, MovieRec does not use any pre existing categories like most of the websites and completely rely on the most helpful reviews provided by the moviegoers, which allows the user to find the most unbiased results. The user can search by using any keyword that describes the kind of movie that they want to watch and provide us a feedback on our results that can help make the search results better. In the addition, after finding the relevant results, the user can click on the name of the desired movie, and it will lead the user to Amazon Instant Video Page that allows anyone to watch trailer or buy/rent a movie. Introduction The idea of the MovieRec comes from the daily inconveniences that many users may face when it comes to all the available movie search engines. Everyone loves watching movies. The movie search engines that are popular among people have very limited functionality. The most popular movie database IMDB only allows user to search by using the movie titles. The question arises that what if the user just wants to search what type of movie he/she wants to watch. For example, “fairytale” or a movie with a “twisted ending” or similar to that. The movie recommendation search engines out there do not provide reliable results as most of them already are organized into predefined categories like Genres. However, while surfing through the websites, we can read a lots of user reviews for a movie sorted as most helpful reviews. Most helpful reviews are the ones that most of the general population agree with. So, we decided to use that as our dataset because people discuss what they like or didn’t like in a movie, spoilers, important plot points into their reviews. Using the most helpful reviews allows us to find some useful keywords describing a movie that can’t be predefined. Goals and Challenges MovieRec takes movie reviews from multiple websites such as Amazon and IMDB for each movie and stores it as documents by using text files. The system takes queries from the user and depending on the mode our project is running on, a relevance feedback mechanism is available for the user. We faced many challenges through out the project; each phase has a problem of its own: Populating the database: Our first challenge is find an efficient way to crawl for the movies and keeping our database as wide as possible, the initial plan was to crawl the review pages and extract both the reviews and the ratings. We planned to weigh the ratings according to the usefulness of the review; however, this is where we encountered our first problem. The structure of the website we chose to extract the reviews made if difficult to extract specific elements such as ratings since the ruby crawler we used extracted the inner text of the page. We did manage to extract the most useful reviews, but we were forced to adjust our plans regarding weighing of the ratings and the score band we were planning to implement – stated in the proposal. Another challenge we met was combining the crawled data from multiple websites without merging the text files manually. After crawling we ended up having two sets of reviews per movie, which is inconvenient since we are using the document structure provided by lucene, we must use one text file per document so that we can create those during indexing. We did manage to populate our database after efforts and moved on to our next goal. Implementing search and ranking function: Before utilizing assignment 3 we did research on Solr and deciding against using it. We turned to using available resources instead. Implementing the search engine and ranking function went smoothly since we took advantage of assignment 3’s codes. We analyzed what each method does and tried to modify it, the process of dissecting the assignment took time but it enabled us to understand the toolkit better. We wanted to bring something to the table so we want to create our own ranking function. This is where our challenge came in, the function used in the assignment was BM25, and we experimented with known functions simple TFIDF and BM25L. After studying multiple functions we tried combining them, that is how we drafted our function. The results were less than satisfactory, but we used the function anyways since we plan on building on it as we further develop the project. Implementing relevance feedback: As mentioned in the proposal, we planned to have a new function that differentiates our search engine from the existing projects that we often come across. We were planning to adjust query according to the feedback from the users so that a better result can be given. We were planning to have 3 options, ‘helpful’, ‘not helpful’, and ‘watched’. We wanted to include the third option because the documents that were already watched will not be deemed irrelevant to the query. We researched and found out that lucene does not offer relevance feedback so another challenge appears. We searched for other toolkits related to lucene and found LucQE, which is a toolkit allowing query expansion. Query expansion in LucQE is based on Rocchio. We tried to incorporate it into the search engine, however we were not successful, we kept on going back to LucQE multiple times throughout the development of our project but until now we received multiple bugs and could not use the toolkit. We tried implementing feedback using data structures in Java to keep track of the queries and document scores, find the most recurring words for the results deemed useful and use it for query expansion for the time being and resulted in limited success. Combining the user interface and the search engine: We faced multiple challenges off the bat since the interface was not working on our computers. We simply could not get the interface up and running. EWS machines didn’t allows us to download certain packages needed for the working interface. Thus, we ended up installing Linux on to our system, and tried it. There were massive confusions about which apache java version to use because it wasn't compiling. In addition, it wasn’t exporting the JAVAPATH correctly. After a lot of research on where to add the JAVAPATH, and some changes in the pom file, the basic interface started running on the local host. Everything went smoothly after the interface was up and running. After researching on the process of combining them, we designed our own user interface and joined it with the search engine. Work Summarization We built MovieRec as a standard text retrieval engine. Instead of having a conventional database search like other sites which categorizes movies by tags, we use crawled movie reviews as our dataset and not tag our movies eliminating the limiting factor of what category a movie fits into. We utilized available resources and toolkits, modified parts and did research on more efficient implementations. The search results from our site is not as accurate as we hope, but we are planning to expand our dataset by crawling for more reviews and develop our ranking function to make the returned results more accurate. Related Work We are quite confident that there are websites that offers similar functions as our website, however to the best of our knowledge none of the existing projects offered feedback from the users for query expansion. Further more, we do not know any websites that base their search upon the reviews from the general public which enables searching by contents of the movies instead of the general ‘genre filtering’ or conventional tagging of the movies. We did try researching websites that might be similar to our work. Jinni Jinni is one of the most popular and highly recommended movie recommendation service that we surveyed from the web. It returned the most relevant results. However, they were still based on the pre-defined tags that they have assigned to each movie. So, many times we ended up finding the total non relevant results to some of the simplest queries. For the tags that are not found, we found out that Jinni only searches through the name of the movies and returns the movies that contains the query inside its name. We find that this is not efficient because more than often the movie name does not associate with the contents. Suggestmemovie Suggestmemovie is another one of the popular movie recommendation website. It does have a very useful filtering featues. However, it fails miserably for many queries like “twisted plot”, “gory”, “fairytale”. The results were very poor and sometimes non existence. When entered queries to find movies of a certain theme, the results returned are not accurate at all, the movies that were returned are normally not related to that theme. The search and recommendation system fails more than succeed. IMBD The most popular movie database website on the internet that completely relies on the title of the movies when it comes to searching for movies. However, the website provides a useful function of allowing the users to provide reviews for the movies and also allows upvotes and downvotes. It even sorts the most useful reviews into one page. This is a factor we took advantage of, in order to make our database the most rich and accurate, we only used the reviews from IMDB that were deemed most useful by the general public. Methods Populating database We listed the movies we would like to populate our database with and obtain the urls manually. We used the ruby crawler and the PhantomJS text scraper to perform the extraction of the texts in those pages. We decided to crawl the reviews from multiple websites because we believe that in order to get a good description of the movie and have a fair dataset, we should get reviews from both the general public, amazon customers and critics. The dataset will only then represent the movies fairly enough for the user to search keywords that are related to the movie content and the words will be found in the reviews. The crawled text files is then processed by merging the text files with similar titles to become one. Search Engine We initially planned on using Solr, which we did heavy research on, however during implementation we ran into problem with importing the packages so we decided to use lucene for the time being. Therefore MovieRec utilizes a backend search engine implemented using lucene. The textfiles were indexed into document objects, a class provided by lucene. The files were stored in ‘name’, ‘url’ and ‘contents’ manner. We modified the program so that the url will not lead to the page it was crawled from, but the instant movie function of Amazon.com so that when the results were returned the user can directly have access to the movie be it buying or renting. The search utilizes our ranking function and returns the results as array list of ResultDocs. The code snippets were taken from the most highlighted portion of the review, only a few lines of the review containing those words were displayed in our interface. Our ranking function Relevance feedback and Query expansion We researched about implementations of relevance feedback in lucene. We studied LucQE and the module it provided (Rocchio Query Expansion) and decided to implement it. After a failed attempt we wrote a java program which once takes contents of the documents, finds words with highest frequencies, performs normalization and smoothing with general LM, using all the documents. We used this when the user is using the interactive search through the terminal. Our program will ask whether the document is relevant or not. If it is relevant topic language models obtained from the program will be used with the original query to expand the search. We also tried storing the queries in a data structure with the relevant documents deemed by the user and keeping scores according to feedback. Our relevance feedback as of right now is far from complete, however, further work will include going back and try to implement LucQE’s query expansion module again. Interface After finished connecting the back end with the sample interface provided in assignment 3, we designed our own interface. Initially we were going to include a function that the picture beside the results will change according to the relevant feedback given. However we could not carry out that plan since our relevance feedback and query expansion is not yet complete. After finishing the first version of our web interface, we joined it with the search engine backend yielding MovieRec as seen below. Usage and Evaluation The main function of MovieRec is that the words that were searched will actually match the contents of the movie or is a accurate description of the movie. It will not only match the title with the query word, it will also find those words in the reviews that were deemed most useful. We will test the website. The feedback was not implemented in the user interface, however it was used in the terminal interactive search. For the above example, we entered the query “dragon” in “MovieRec” and “Jinni”. Our search engine returns the movies that have dragons in them like “How to train your dragon”, “The Hobbit Desolation of Smaug” , “Shrek” etc. while most of the results in“Jinni” returns the results that have the word dragon in the title like “Green Dragon”, “The girl with the dragon tattoo”. In the screenshot below, we searched the keyword “twisted plot”. For which, MovieRec returns some of the most relevant movies that has twisted plots or endings like “Now you see me”, “The usual suspect” , “The sixth sense”. which is quite useful, while suggestmemovie does not return anything for the same query. . While it returns highly relevant results most of the times, sometimes due to the user reviews being very ambiguous, the result gets affected. When we searched for the word “magic”, as seen in the screenshot below one of the movies that returned was “Jurrassic Park” because of the ambiguous nature of Natural Language. But that is the reason why we developed the feedback system, so we can improve our search results for the queries. Conclusions and Future works The we ended up diverting from many of our initial plans due to technicalities, however the final product that we ended up producing worked as we intended. MovieRec is a movie search engine that does not restrict the users to only search by traditional keywords that mostly describes themes, moods, genres, or any other filters. MovieRec allows the user to search by any keyword they can think of, from broad keywords like ‘gory’, to narrow keywords like ‘cops’ or ‘womanizers’ or even names of movie characters or even certain elements of the movie. We have implemented and tweaked crawling, indexing, searching and ranking from knowledge obtained from class and through assignments. Although our project has flaws, but we do intend to extend on that in further iterations. Fixing limitations We can go back and finish the proper implementation of relevance feedback and integrate it to the frontend of the application as well. We can also improve our ranking function to become more accurate. We can attempt to overcome the limiting factors we find in crawling and implementing the ratings and display the range or ratings with the movies returned. Expansions We can further expand our database by crawling reviews from more websites, even allowing the users to add reviews of movies to add to our data set. And lastly, we could also add functions that summarizes the reviews and remove redundant reviews from the data set. Individual Contributions Swapnil Shah (sshah219) - Research on Solr - Crawling and populating the database - modification of text files - frontend user interface and backend search engine connection - ranking function - System setup and installation work Pattanee Chutipongpattanakul (chutipo2) - Research on LucQE - Crawling and populate database - Query expansion and relevance feedback using java data structure - Write java program to process reviews using unigram language models to find topic models - smoothing topic models - Ranking function - Web application (user interface frontend) References [1] Rubens, Neil. "LucQE [lucky] Lucene Query Expansion Module." LucQE. Computer Modelling and New Technologies, 2006. Web. 13 May 2014. [2] "Apache Solr." Apache Lucene . The Apache Software Foundation, 2011. Web. 13 May 2014. [3] "Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & More." Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & More. N.p., n.d. Web. 11 May 2014. [4] IMDb. IMDb.com, n.d. Web. 11 May 2014. [5] "Suggest Me Movies." Suggest Me Movie RSS. N.p., n.d. Web. 13 May 2014. [6] "Jinni: Find Movies, TV Shows Matching Your Taste & Watch Online." Find Movies, TV Shows Matching Your Taste and Watch Online. N.p., n.d. Web. 13 May 2014."Jinni: Find Movies, TV Shows Matching Your Taste & Watch Online." Find Movies, TV Shows Matching Your Taste and Watch Online. N.p., n.d. Web. 13 May 2014.