COURSES - MOOC Search Engine

Transcription

COURSES - MOOC Search Engine
COURSES
An MOOC Search Engine
Ruilin Xu
[email protected]
Abstract
As there are hundreds of thousands of massive open online courses (MOOCs) over the internet, with
more and more every day, it is quite cumbersome for users to search for the specific courses wanted
efficiently. Usually people need to repeat the same search on multiple MOOC websites such as
Coursera[4] and Udacity[5] in order to find what they are looking for. As a result, COURSES has been
developed. COURSES is a MOOC search engine which gathers around 4000 courses from many major
MOOC websites such as Coursera[4], Udacity[5], Khan Academy[6], Udemy[7] and Edx [8]. COURSES also
realizes faceted search, where all search results are categorized into multiple categories such as course
prices, length, workload, instructors, and course categories. Users are able to filter using these
categories to target specific courses very efficiently by combining a search query and multiple filters.
Introduction
Nowadays, people love to enroll in massive open online courses (MOOCs) because of its convenience. As
MOOCs become more and more popular, the number of MOOCs increases dramatically. As a result,
numerous MOOC websites emerged. Unfortunately, with the number of MOOC websites increasing
rapidly, it is more difficult for users to find the courses they want. They often need to repeat their
searches again and again on different MOOC websites, trying to find the perfect result. This is both
cumbersome and inefficient. Is there a way to search only once and easily find the courses we want
from all the MOOC websites?
With that question in mind, I was also inspired by Assignment 3; I found that my thought was not
impossible to realize. Through discussion with the professor, I decided to create COURSES, a vertical
MOOC search engine, powered by Apache Solr, to solve the problem and to apply what I learned from
class into the real life. This search engine greatly simplifies the search for suitable online courses,
because it combines courses from five sites (i.e., Coursera [4], Udacity[5], Khan Academy[6], Udemy[7], and
Edx [8]), eliminating the need for users to hop from site to site. Users are also able to filter based on what
they are looking for.
There are similar websites such as RedHoop [1], MOOC-List[2], Class-Central[3], etc. But they all have their
downsides. They either do not incorporate as much data as COURSES or do not display as much
information about the courses as the user needs. They only display the title of the course and a brief
1
course introduction. The user cannot see any other attributes of the course, such as its instructors, its
price, its length etc.
As a result, COURSES is both very meaningful and useful. COURSES has a big database, around 4,000
courses as its data. COURSES also have many useful faceted filters for users to target search results
efficiently. COURSES also displays all of the useful attributes of the courses within the search results so
that users can see everything in one location, which can save users a significant amount of time in
determining which courses to take.
Related Work
COURSES is inspired by Assignment 3 from our CS 410 course. Assignment 3 is basically a simple tutorial
on building a simple search engine from some data sources. COURSES is similar to Assignment 3,
because it uses a similar technique when parsing data. Both use the combination of Ruby and JavaScript
files. The difference is that COURSES uses JavaScript parsers that are much more complicated than the
one used in Assignment 3. COURSES’ parsers are web-specific, meaning that they are able to parse
different data from different MOOC websites. After parsing data, the data will then be converted to XML
files, which are compatible with Apache Solr.
There is another well-implemented MOOC search engine called RedHoop[1]. RedHoop[1] and COURSES
are very similar in the sense that they are both multi-site search engines, meaning that they all get
course data from various different MOOC websites. In addition, they both have the functionality of
faceted search. However, COURSES improved greatly in terms of displaying search results. RedHoop [1]
only displays the course title with a brief introduction of the course, whereas COURSES display much
more useful information such as price, course length, estimated workload, course language and
instructors’ information. COURSES also has more search faceted filters such as course language and
instructors, which are very important for international users who want to take courses in their own
language and for users who have a strong preference over certain instructors.
2
Problem Definition
The challenge that I solved was to create an online search engine for courses that draws from multiple
sources and assists the user to more efficiently search using different facets. The input for the user is his
search query and any filters they want to apply. The expected output is the list of search results, sorted
by relevance.
Building this search engine has four sub-challenges/stages, which I will now enumerate.
1. Data Crawling & Parsing
Because I needed to aggregate all the data from various sources, the first problem faced was how to
best parse and get the data. As we all know, each website has its own data format which might be very
different from others. For example, courses from Coursera [4] don’t have a price field, since all of them
are free. On the other hand, however, courses from Udacity[5] do have prices listed.
2. Data Processing & Consolidation
With all the data correctly parsed, another problem immediately emerged. Since I was building a single
source search engine which can handle data from numerous different websites, I needed to design a
good data structure which could easily take and consolidate all the data. I needed to think about what
key attributes a course should have so that I could use and apply this structure on all the websites
COURSES gets data from.
3. Data Formatting & Outputting
After consolidating the data, it’s time to output the data into Apache Solr. To do this, I needed to find a
way that could easily convert the raw data into the standard format such as XML or JSON files that
Apache Solr can read.
4. User Interface Design & Implementation
The last step, after inserting all the data into Apache Solr, was to design and implement a great user
interface which is sufficiently informative and correctly displays all the important data users need in an
aesthetically pleasing way.
3
Methods
According to the problems mentioned in the above section, I will provide my solution in detail here.
COURSES is based on Apache Solr, which is a great framework for building vertical search engines.
Although personally I think Solr is not that easy to use, since it doesn’t have much detailed introductory
documentation. To get the wanted data, I needed to write my own parser. I did that based on the
crawler and parser from our Assignment 3. I also modified it so that it generates the data XML files that
Solr can read. For the front end user interface, I needed to make sure the data is correctly displayed and
the user interface is pleasant to the eye.
1. Data Crawling & Parsing
As mentioned above, since I needed to get data from all kinds of MOOC websites, I designed a websitespecific parser for each website, so that I am able to get all the useful data correctly.
By reading deeply into each website via inspection tool, I came up with the following table of
commands, which correctly parses the data from each website:
Coursera:
Title
Website
Length
Workload
Language
Instructor
Instructor
intro
Course
categories
Course intro
Course body
document.title
document.title.substring(document.title.indexOf("|")+2)
document.body.getElementsByClassName("iconcalendar")[0].parentNode.childNodes[1].innerText
document.body.getElementsByClassName("icontime")[0].parentNode.childNodes[1].innerText +
document.body.getElementsByClassName("icontime")[0].parentNode.childNodes[2].innerText
document.body.getElementsByClassName("iconglobe")[0].parentNode.childNodes[1].innerText
document.body.getElementsByClassName("coursera-course2-instructorsprofile")[i].childNodes[2].childNodes[0].getElementsByTagName("span")[0].inn
erText – iterate i
document.body.getElementsByClassName("coursera-course2-instructorsprofile")[i].childNodes[2].childNodes[1].getElementsByTagName("span")[0].inn
erText – iterate i
document.body.getElementsByClassName("coursera-coursecategories")[0].getElementsByTagName("a")[i].innerText – iterate i
document.body.getElementsByClassName("span6")[0].innerText
document.body.getElementsByClassName("span7")[0].innerText
Edx:
Title
Website
Length
Workload
Instructor
document.title
document.title.substring(document.title.indexOf("|")+2)
document.body.getElementsByClassName("course-detaillength")[0].innerText.substring("Course Length: ".length)
document.body.getElementsByClassName("course-detaileffort")[0].innerText.substring("Estimated effort: ".length)
document.body.getElementsByClassName("stafflist")[0].getElementsByTagName("li")[i].childNodes [3].childNodes[1].innerText
– iterate i
4
Instructor
intro
Course intro
Course body
document.body.getElementsByClassName("stafflist")[0].getElementsByTagName("li")[i].childNodes [3].childNodes[3].innerText
– iterate i
document.body.getElementsByClassName("course-detail-subtitle copylead")[0].innerText
document.body.getElementsByClassName("course-section course-detailabout")[0].innerText + document.body.getElementsByClassName("view -display-iderrata")[0].innerText – second part might not exist
Khan:
Title
Website
Course intro
Course body
document.title
document.title.substring(document.title.indexOf("|")+2)
document.body.getElementsByClassName("topic-desc")[0].innerText
document.getElementById("page-container-inner").innerText
Udacity:
Title
Website
Price
Length
Workload
Instructor
Instructor
intro
Course intro
Course body
document.title
document.title.substring(document.title.indexOf("|")+2)
document.body.getElementsByClassName("price-information")[0].innerText (if
contains “null”, then free)
document.body.getElementsByClassName("durationinformation")[0].getElementsByClassName("col-md10")[0].getElementsByTagName("strong")[0].innerText.substring("Approx.
".length)
document.body.getElementsByClassName("durationinformation")[0].getElementsByClassName("col-md10")[0].getElementsByTagName("small")[0].getElementsByTagName("p")[0].innerT
ext.substring("Assumes ".length)
document.body.getElementsByClassName("row row-gap-medium instructorinformation-entry")[i].childNodes[2j1].childNodes[1].getElementsByTagName("h3")[0].innerText – iterate i, j (1,
2)
document.body.getElementsByClassName("row row-gap-medium instructorinformation-entry")[i].childNodes[2j1].childNodes[3].getElementsByTagName("p")[0].innerText – iterate i, j (1,
2)
document.body.getElementsByClassName("col-md-8 col-md-offset2")[1].getElementsByClassName("pretty-format")[0].innerText
document.body.getElementsByClassName("col-md-8 col-md-offset2")[i].innerText – iterate i
Udemy:
Title
Website
Price
Length
document.title
document.title.substring(document.title.indexOf("|")+2)
document.body.getElementsByClassName("pb-p")[0].getElementsByClassName("pbpr")[0].innerText
document.body.getElementsByClassName("wi")[0].getElementsByCla ssName("wili")[1].innerText.replace(" of high quality content", "")
5
Instructor
Instructor
intro
Course intro
Course body
document.body.getElementsByClassName("tbli")[i].childNodes[1].getElementsByClassName("tbr")[0].getElementsByTagName("a")[0].innerText – iterate i
document.body.getElementsByClassName("tbli")[i].childNodes[3].getElementsByTagName("p")[0].innerText – iterate i
document.body.getElementsByClassName("ci-d")[0].innerText
document.body.getElementsByClassName("mc")[0].innerText
2. Data Processing & Consolidation
With the above data parsed, I next designed a structure which can hold all attributes of a course and fit
data from all websites. The following table is the result:
URL
Title
Website
Price
Length
Coursera
Given
Parsed
Parsed
DEFAULT: FREE
Parsed
Edx
Given
Parsed
Parsed
DEFAULT: FREE
Parsed
Workload
Parsed
Parsed
Language
Parsed
Instructor
Parsed
DEFAULT:
Undefined
Parsed
Instructor
intro
Course
categories
Course intro
Course body
Parsed
Parsed
Parsed
DEFAULT:
Undefined
Parsed
Parsed
Parsed
Parsed
Khan
Given
Parsed
Parsed
DEFAULT: FREE
DEFAULT:
Undefined
DEFAULT:
Undefined
DEFAULT:
Undefined
DEFAULT:
Undefined
DEFAULT:
Undefined
DEFAULT:
Undefined
Parsed
Parsed
6
Udacity
Given
Parsed
Parsed
Parsed
Parsed
Udemy
Given
Parsed
Parsed
Parsed
Parsed
Parsed
DEFAULT:
Undefined
Parsed
DEFAULT:
Undefined
DEFAULT:
Undefined
Parsed
Parsed
Parsed
DEFAULT:
Undefined
Parsed
Parsed
DEFAULT:
Undefined
Parsed
Parsed
3. Data Formatting & Outputting
Apache Solr has its own rules of data files. I chose to use its XML rules. With the consolidated data
above, I was then able to create data XML files (one file for each website). After trying all kinds of
options, I finally decided to output data while reading in the data and processing it. The following code
snippet is excerpted from one of the parsers:
var length;
try
{
length = "<field name=\"course_length\">" +
document.body.getElementsByClassName("iconcalendar")[0].parentNode.childNodes[1].innerText.trim().replace(/&
/g, '&amp;').replace(/</g, '&lt;').replace(/>/g,
'&gt;').replace(/"/g, '&quot;').replace(/'/g, '&#39;') +
"</field>\n\t";
}
catch (err)
{
length = "<field name=\"course_length\">Undefined</field>\n\t";
}
The above code snippet shows the method of getting parsed data, processing it, and then outputting it
into the correct XML format. The data is obtained by using the command shown in the tables above in
the “Data Crawling & Parsing” section. COURSES trims out unnecessary characters, replaces some
special characters such as “<” with its entity reference (“&lt;”), since those characters cannot exist inside
XML field, and puts the processed string into the correct XML field. If data does not exist (i.e., an error
occurred after executing the commands), the field is set at its default value.
7
4. User Interface Design & Implementation
Apache Solr has its default user interface which uses Velocity Search UI. I went into the
“./config/velocity” folder to modify all the “.vm” files, which is a very complicated job. The page is
formatted inside “layout.vm”. “layout.vm” then calls other files such as “heade r.vm”, “tabs.vm”,
“content.vm”, and “footer.vm”, as well as the CSS file called “main.css”. I commented out the unused
cases and focused on “content.vm”. After trying countless combinations, I came up with the main user
interface, which appears as follows:
From the above screenshot, we can see the COURSES’ logo along with a simple search field and a submit
button. We can also see that on the left side of the web page, there are many facet filters. What’s great
about those facet filters is that it puts all of the search results into different categories and concisely
displays them to users. Users can then easily filter out unwanted results and get what they want more
quickly and efficiently. The search results are displayed on the right side. It shows the attributes of the
courses found, such as price, course length, estimated workload, course categories, instructors and even
a brief introduction of the course.
8
Evaluation/Sample Results
1. Experiment & Sample Results
After completing the above stages, I was able to test the system with some sample data. Since the entire
set of data contains around 4000 URLs, it’s very time-consuming to test on the entire data set.
Therefore, I extracted 10 URLs from each website and ran the search engine from that data. I then input
some queries such as “teaching”, “machine learning”, “java”, etc. in order to see what was displayed.
To evaluate the search engine, we need to answer the following two questions:
1. Are the search results correctly displayed? That is, if there exists a search result, we need to be
able to find the following information:
a. Total number of result
b. Search time
c. Search results with the following information:
i. Course title, which is a link to the course webpage
ii. Course webpage’s logo
iii. Course URL
iv. Price
v. Length
vi. Workload
vii. Language
viii. Category
ix. Instructor
x. Instructor intro
xi. Course intro
xii. Course body
2. Does the faceted filter on the left side work? When we click on any filter, does the filter show up
and does it narrow down the search results?
9
The following shows the results after searching for “machine learning”:
COURSES found 2 results in 31 milliseconds as shown, one course is from Edx [8] and the other one is
from Udacity[5]. We can also see that the course title, URL, price, course length, course workload, course
language, course category, instructor, and other information about the course are all correctly displayed
(with default field values such as “Free” and “Undefined”).
I then applied faceted filters on the left, such as the following:
10
The filters, “Dave Holtz” and “$150/month,” were successfully applied and the search result was
correctly reduced to one result.
2. User Opinion
I demoed the project to one of our TAs. He suggested I improve more on the user interface of the search
engine. Another issue he mentioned is that the highlighting within the search results doesn’t seem to
work well.
3. More Improvements
1. Implement keyword highlighting feature
2. User interface tweaking
11
Conclusions
In creating COURSES, a single-source search engine for online courses, I learned how to parse and collect
data from multiple sources, along with learning how to work with data structures to best present it. I
also learned how to design a better UI through this experience. COURSES has potential, since it greatly
simplifies the process of searching for a suitable course through combining course sources and
implementing filter features to more accurately search.
Future Work
In the future, I will try to implement the entire search engine without Solr, which will give me more
control of the data processing and result ranking, etc. Without Solr, I can also freely adjust user interface
elements as needed. Another important improvement would be URL auto discovery and retrieval. There
are probably hundreds or thousands of new courses generated every day. Being able to monitor them
and automatically get any updates from the URL is very important for this kind of MOOC search engine.
12
References
[1] RedHoop. https://www.redhoop.com
[2] MOOC-List. www.mooc-list.com
[3] Class-Central. https://www.class-central.com
[4] Coursera. https://www.coursera.org
[5] Udacity. https://www.udacity.com
[6] Khan Academy. https://www.khanacademy.org
[7] Udemy. https://www.udemy.com
[8] Edx. https://www.edx.org
[9] Sanjeev Mishra. “HowTo? Solr: building a custom search engine”.
http://pixedin.blogspot.com/2012/05/howto-solr-building-custom-search.html
[10] Apache Solr Reference Guide. https://cwiki.apache.org/confluence/display/solr/Velocity+Search+UI
[11] Solr Tutorial. https://lucene.apache.org/solr/4_8_0/tutorial.html
13