COURSES - MOOC Search Engine
Transcription
COURSES - MOOC Search Engine
COURSES An MOOC Search Engine Ruilin Xu [email protected] Abstract As there are hundreds of thousands of massive open online courses (MOOCs) over the internet, with more and more every day, it is quite cumbersome for users to search for the specific courses wanted efficiently. Usually people need to repeat the same search on multiple MOOC websites such as Coursera[4] and Udacity[5] in order to find what they are looking for. As a result, COURSES has been developed. COURSES is a MOOC search engine which gathers around 4000 courses from many major MOOC websites such as Coursera[4], Udacity[5], Khan Academy[6], Udemy[7] and Edx [8]. COURSES also realizes faceted search, where all search results are categorized into multiple categories such as course prices, length, workload, instructors, and course categories. Users are able to filter using these categories to target specific courses very efficiently by combining a search query and multiple filters. Introduction Nowadays, people love to enroll in massive open online courses (MOOCs) because of its convenience. As MOOCs become more and more popular, the number of MOOCs increases dramatically. As a result, numerous MOOC websites emerged. Unfortunately, with the number of MOOC websites increasing rapidly, it is more difficult for users to find the courses they want. They often need to repeat their searches again and again on different MOOC websites, trying to find the perfect result. This is both cumbersome and inefficient. Is there a way to search only once and easily find the courses we want from all the MOOC websites? With that question in mind, I was also inspired by Assignment 3; I found that my thought was not impossible to realize. Through discussion with the professor, I decided to create COURSES, a vertical MOOC search engine, powered by Apache Solr, to solve the problem and to apply what I learned from class into the real life. This search engine greatly simplifies the search for suitable online courses, because it combines courses from five sites (i.e., Coursera [4], Udacity[5], Khan Academy[6], Udemy[7], and Edx [8]), eliminating the need for users to hop from site to site. Users are also able to filter based on what they are looking for. There are similar websites such as RedHoop [1], MOOC-List[2], Class-Central[3], etc. But they all have their downsides. They either do not incorporate as much data as COURSES or do not display as much information about the courses as the user needs. They only display the title of the course and a brief 1 course introduction. The user cannot see any other attributes of the course, such as its instructors, its price, its length etc. As a result, COURSES is both very meaningful and useful. COURSES has a big database, around 4,000 courses as its data. COURSES also have many useful faceted filters for users to target search results efficiently. COURSES also displays all of the useful attributes of the courses within the search results so that users can see everything in one location, which can save users a significant amount of time in determining which courses to take. Related Work COURSES is inspired by Assignment 3 from our CS 410 course. Assignment 3 is basically a simple tutorial on building a simple search engine from some data sources. COURSES is similar to Assignment 3, because it uses a similar technique when parsing data. Both use the combination of Ruby and JavaScript files. The difference is that COURSES uses JavaScript parsers that are much more complicated than the one used in Assignment 3. COURSES’ parsers are web-specific, meaning that they are able to parse different data from different MOOC websites. After parsing data, the data will then be converted to XML files, which are compatible with Apache Solr. There is another well-implemented MOOC search engine called RedHoop[1]. RedHoop[1] and COURSES are very similar in the sense that they are both multi-site search engines, meaning that they all get course data from various different MOOC websites. In addition, they both have the functionality of faceted search. However, COURSES improved greatly in terms of displaying search results. RedHoop [1] only displays the course title with a brief introduction of the course, whereas COURSES display much more useful information such as price, course length, estimated workload, course language and instructors’ information. COURSES also has more search faceted filters such as course language and instructors, which are very important for international users who want to take courses in their own language and for users who have a strong preference over certain instructors. 2 Problem Definition The challenge that I solved was to create an online search engine for courses that draws from multiple sources and assists the user to more efficiently search using different facets. The input for the user is his search query and any filters they want to apply. The expected output is the list of search results, sorted by relevance. Building this search engine has four sub-challenges/stages, which I will now enumerate. 1. Data Crawling & Parsing Because I needed to aggregate all the data from various sources, the first problem faced was how to best parse and get the data. As we all know, each website has its own data format which might be very different from others. For example, courses from Coursera [4] don’t have a price field, since all of them are free. On the other hand, however, courses from Udacity[5] do have prices listed. 2. Data Processing & Consolidation With all the data correctly parsed, another problem immediately emerged. Since I was building a single source search engine which can handle data from numerous different websites, I needed to design a good data structure which could easily take and consolidate all the data. I needed to think about what key attributes a course should have so that I could use and apply this structure on all the websites COURSES gets data from. 3. Data Formatting & Outputting After consolidating the data, it’s time to output the data into Apache Solr. To do this, I needed to find a way that could easily convert the raw data into the standard format such as XML or JSON files that Apache Solr can read. 4. User Interface Design & Implementation The last step, after inserting all the data into Apache Solr, was to design and implement a great user interface which is sufficiently informative and correctly displays all the important data users need in an aesthetically pleasing way. 3 Methods According to the problems mentioned in the above section, I will provide my solution in detail here. COURSES is based on Apache Solr, which is a great framework for building vertical search engines. Although personally I think Solr is not that easy to use, since it doesn’t have much detailed introductory documentation. To get the wanted data, I needed to write my own parser. I did that based on the crawler and parser from our Assignment 3. I also modified it so that it generates the data XML files that Solr can read. For the front end user interface, I needed to make sure the data is correctly displayed and the user interface is pleasant to the eye. 1. Data Crawling & Parsing As mentioned above, since I needed to get data from all kinds of MOOC websites, I designed a websitespecific parser for each website, so that I am able to get all the useful data correctly. By reading deeply into each website via inspection tool, I came up with the following table of commands, which correctly parses the data from each website: Coursera: Title Website Length Workload Language Instructor Instructor intro Course categories Course intro Course body document.title document.title.substring(document.title.indexOf("|")+2) document.body.getElementsByClassName("iconcalendar")[0].parentNode.childNodes[1].innerText document.body.getElementsByClassName("icontime")[0].parentNode.childNodes[1].innerText + document.body.getElementsByClassName("icontime")[0].parentNode.childNodes[2].innerText document.body.getElementsByClassName("iconglobe")[0].parentNode.childNodes[1].innerText document.body.getElementsByClassName("coursera-course2-instructorsprofile")[i].childNodes[2].childNodes[0].getElementsByTagName("span")[0].inn erText – iterate i document.body.getElementsByClassName("coursera-course2-instructorsprofile")[i].childNodes[2].childNodes[1].getElementsByTagName("span")[0].inn erText – iterate i document.body.getElementsByClassName("coursera-coursecategories")[0].getElementsByTagName("a")[i].innerText – iterate i document.body.getElementsByClassName("span6")[0].innerText document.body.getElementsByClassName("span7")[0].innerText Edx: Title Website Length Workload Instructor document.title document.title.substring(document.title.indexOf("|")+2) document.body.getElementsByClassName("course-detaillength")[0].innerText.substring("Course Length: ".length) document.body.getElementsByClassName("course-detaileffort")[0].innerText.substring("Estimated effort: ".length) document.body.getElementsByClassName("stafflist")[0].getElementsByTagName("li")[i].childNodes [3].childNodes[1].innerText – iterate i 4 Instructor intro Course intro Course body document.body.getElementsByClassName("stafflist")[0].getElementsByTagName("li")[i].childNodes [3].childNodes[3].innerText – iterate i document.body.getElementsByClassName("course-detail-subtitle copylead")[0].innerText document.body.getElementsByClassName("course-section course-detailabout")[0].innerText + document.body.getElementsByClassName("view -display-iderrata")[0].innerText – second part might not exist Khan: Title Website Course intro Course body document.title document.title.substring(document.title.indexOf("|")+2) document.body.getElementsByClassName("topic-desc")[0].innerText document.getElementById("page-container-inner").innerText Udacity: Title Website Price Length Workload Instructor Instructor intro Course intro Course body document.title document.title.substring(document.title.indexOf("|")+2) document.body.getElementsByClassName("price-information")[0].innerText (if contains “null”, then free) document.body.getElementsByClassName("durationinformation")[0].getElementsByClassName("col-md10")[0].getElementsByTagName("strong")[0].innerText.substring("Approx. ".length) document.body.getElementsByClassName("durationinformation")[0].getElementsByClassName("col-md10")[0].getElementsByTagName("small")[0].getElementsByTagName("p")[0].innerT ext.substring("Assumes ".length) document.body.getElementsByClassName("row row-gap-medium instructorinformation-entry")[i].childNodes[2j1].childNodes[1].getElementsByTagName("h3")[0].innerText – iterate i, j (1, 2) document.body.getElementsByClassName("row row-gap-medium instructorinformation-entry")[i].childNodes[2j1].childNodes[3].getElementsByTagName("p")[0].innerText – iterate i, j (1, 2) document.body.getElementsByClassName("col-md-8 col-md-offset2")[1].getElementsByClassName("pretty-format")[0].innerText document.body.getElementsByClassName("col-md-8 col-md-offset2")[i].innerText – iterate i Udemy: Title Website Price Length document.title document.title.substring(document.title.indexOf("|")+2) document.body.getElementsByClassName("pb-p")[0].getElementsByClassName("pbpr")[0].innerText document.body.getElementsByClassName("wi")[0].getElementsByCla ssName("wili")[1].innerText.replace(" of high quality content", "") 5 Instructor Instructor intro Course intro Course body document.body.getElementsByClassName("tbli")[i].childNodes[1].getElementsByClassName("tbr")[0].getElementsByTagName("a")[0].innerText – iterate i document.body.getElementsByClassName("tbli")[i].childNodes[3].getElementsByTagName("p")[0].innerText – iterate i document.body.getElementsByClassName("ci-d")[0].innerText document.body.getElementsByClassName("mc")[0].innerText 2. Data Processing & Consolidation With the above data parsed, I next designed a structure which can hold all attributes of a course and fit data from all websites. The following table is the result: URL Title Website Price Length Coursera Given Parsed Parsed DEFAULT: FREE Parsed Edx Given Parsed Parsed DEFAULT: FREE Parsed Workload Parsed Parsed Language Parsed Instructor Parsed DEFAULT: Undefined Parsed Instructor intro Course categories Course intro Course body Parsed Parsed Parsed DEFAULT: Undefined Parsed Parsed Parsed Parsed Khan Given Parsed Parsed DEFAULT: FREE DEFAULT: Undefined DEFAULT: Undefined DEFAULT: Undefined DEFAULT: Undefined DEFAULT: Undefined DEFAULT: Undefined Parsed Parsed 6 Udacity Given Parsed Parsed Parsed Parsed Udemy Given Parsed Parsed Parsed Parsed Parsed DEFAULT: Undefined Parsed DEFAULT: Undefined DEFAULT: Undefined Parsed Parsed Parsed DEFAULT: Undefined Parsed Parsed DEFAULT: Undefined Parsed Parsed 3. Data Formatting & Outputting Apache Solr has its own rules of data files. I chose to use its XML rules. With the consolidated data above, I was then able to create data XML files (one file for each website). After trying all kinds of options, I finally decided to output data while reading in the data and processing it. The following code snippet is excerpted from one of the parsers: var length; try { length = "<field name=\"course_length\">" + document.body.getElementsByClassName("iconcalendar")[0].parentNode.childNodes[1].innerText.trim().replace(/& /g, '&').replace(/</g, '<').replace(/>/g, '>').replace(/"/g, '"').replace(/'/g, ''') + "</field>\n\t"; } catch (err) { length = "<field name=\"course_length\">Undefined</field>\n\t"; } The above code snippet shows the method of getting parsed data, processing it, and then outputting it into the correct XML format. The data is obtained by using the command shown in the tables above in the “Data Crawling & Parsing” section. COURSES trims out unnecessary characters, replaces some special characters such as “<” with its entity reference (“<”), since those characters cannot exist inside XML field, and puts the processed string into the correct XML field. If data does not exist (i.e., an error occurred after executing the commands), the field is set at its default value. 7 4. User Interface Design & Implementation Apache Solr has its default user interface which uses Velocity Search UI. I went into the “./config/velocity” folder to modify all the “.vm” files, which is a very complicated job. The page is formatted inside “layout.vm”. “layout.vm” then calls other files such as “heade r.vm”, “tabs.vm”, “content.vm”, and “footer.vm”, as well as the CSS file called “main.css”. I commented out the unused cases and focused on “content.vm”. After trying countless combinations, I came up with the main user interface, which appears as follows: From the above screenshot, we can see the COURSES’ logo along with a simple search field and a submit button. We can also see that on the left side of the web page, there are many facet filters. What’s great about those facet filters is that it puts all of the search results into different categories and concisely displays them to users. Users can then easily filter out unwanted results and get what they want more quickly and efficiently. The search results are displayed on the right side. It shows the attributes of the courses found, such as price, course length, estimated workload, course categories, instructors and even a brief introduction of the course. 8 Evaluation/Sample Results 1. Experiment & Sample Results After completing the above stages, I was able to test the system with some sample data. Since the entire set of data contains around 4000 URLs, it’s very time-consuming to test on the entire data set. Therefore, I extracted 10 URLs from each website and ran the search engine from that data. I then input some queries such as “teaching”, “machine learning”, “java”, etc. in order to see what was displayed. To evaluate the search engine, we need to answer the following two questions: 1. Are the search results correctly displayed? That is, if there exists a search result, we need to be able to find the following information: a. Total number of result b. Search time c. Search results with the following information: i. Course title, which is a link to the course webpage ii. Course webpage’s logo iii. Course URL iv. Price v. Length vi. Workload vii. Language viii. Category ix. Instructor x. Instructor intro xi. Course intro xii. Course body 2. Does the faceted filter on the left side work? When we click on any filter, does the filter show up and does it narrow down the search results? 9 The following shows the results after searching for “machine learning”: COURSES found 2 results in 31 milliseconds as shown, one course is from Edx [8] and the other one is from Udacity[5]. We can also see that the course title, URL, price, course length, course workload, course language, course category, instructor, and other information about the course are all correctly displayed (with default field values such as “Free” and “Undefined”). I then applied faceted filters on the left, such as the following: 10 The filters, “Dave Holtz” and “$150/month,” were successfully applied and the search result was correctly reduced to one result. 2. User Opinion I demoed the project to one of our TAs. He suggested I improve more on the user interface of the search engine. Another issue he mentioned is that the highlighting within the search results doesn’t seem to work well. 3. More Improvements 1. Implement keyword highlighting feature 2. User interface tweaking 11 Conclusions In creating COURSES, a single-source search engine for online courses, I learned how to parse and collect data from multiple sources, along with learning how to work with data structures to best present it. I also learned how to design a better UI through this experience. COURSES has potential, since it greatly simplifies the process of searching for a suitable course through combining course sources and implementing filter features to more accurately search. Future Work In the future, I will try to implement the entire search engine without Solr, which will give me more control of the data processing and result ranking, etc. Without Solr, I can also freely adjust user interface elements as needed. Another important improvement would be URL auto discovery and retrieval. There are probably hundreds or thousands of new courses generated every day. Being able to monitor them and automatically get any updates from the URL is very important for this kind of MOOC search engine. 12 References [1] RedHoop. https://www.redhoop.com [2] MOOC-List. www.mooc-list.com [3] Class-Central. https://www.class-central.com [4] Coursera. https://www.coursera.org [5] Udacity. https://www.udacity.com [6] Khan Academy. https://www.khanacademy.org [7] Udemy. https://www.udemy.com [8] Edx. https://www.edx.org [9] Sanjeev Mishra. “HowTo? Solr: building a custom search engine”. http://pixedin.blogspot.com/2012/05/howto-solr-building-custom-search.html [10] Apache Solr Reference Guide. https://cwiki.apache.org/confluence/display/solr/Velocity+Search+UI [11] Solr Tutorial. https://lucene.apache.org/solr/4_8_0/tutorial.html 13