NETS 212: Scalable & Cloud Computing Assignment 4: SocialRank via Spark

Transcription

NETS 212: Scalable & Cloud Computing Assignment 4: SocialRank via Spark
NETS 212: Scalable & Cloud Computing
Fall 2014
Assignment 4: SocialRank via Spark
Due November 13th at 10:00pm EST
Your fourth homework assignment involves revisiting SocialRank using a different platform, Apache Spark.
Spark models state as reliable distributed datasets, which are modeled in Java as variables, e.g., of type
JavaRDD (non-keyed values) or JavaPairRDD (keyed values). You can manipulate these RDDs by calling
methods on them; each method implements a callback (not so different from Node.JS). This basic
programming model is also shared by a variety of other programming platforms like Apache Crunch (opensourced Google FlumeJava).
The SocialRank Algorithm in Spark
Recall from the previous assignment that we had to implement iteration through a series of map-reduce
steps, and that each time we had to pass state from iteration to iteration through the
data. In Spark, this is simpler.
The Spark equivalent of SocialRank can work as follows:
1.
2.
3.
Initialize an RDD that contains all of the edges
Initialize an RDD that contains all of the nodes plus their SocialRanks (initialized to 1 or whatever)
Iterate n times:
a. Create an RDD with all of the weight propagations from nodes to their neighbors
b. Recompute the ranks-RDD by summing up the incoming weights (plus the decay factor)
This can be described in essentially a sequential fashion, even though it involves parallel stages.
Tasks
We have given you the broad skeleton of the SocialRank algorithm (check out
home1/n/nets212/nets212/HW4 from the svn repo). Take a look at the class
edu.upenn.cis.nets212.hw4.SocialRank, which operates as we sketched above.
First let’s actually run things. You’ll need to go to Terminal and get Spark running (over HDFS and Hadoop
MapReduce/YARN) as follows.
1.
2.
[Each time you reboot] Make sure HDFS and Yarn are running (recall how you ran start-dfs.sh and
start-yarn.sh from HW2)
Run (one time) sh ~/workspace/HW4/setup-spark.sh, which should make some directories and
copy some files to ensure Spark is able to run on your VM. You’ll need to type in your nets212
administrator password.
3.
4.
5.
If necessary, copy the sample graph data from HW3 (we’ve re-included it in the project, as
data/input and specifically data/input/part-0000) to HDFS. (In general, you should still have this
in HDFS from HW3.)
[Each time you reboot] Run source ~/workspace/HW4/set-classpath.sh
[Each time you reboot] Launch Spark via /opt/spark/sbin/start-master.sh
Now go back to Eclipse. Right-click on your project, hit Export, and make a JAR file. You can call it
something like hw4.jar. Save the jar into the workspace directory. Make sure “Export generated class
files and resources” is selected. Don’t worry if it says “jar finished with warnings.”
Run your program by first changing to the workspace directory, then:
spark-submit --class edu.upenn.cis.nets212.hw4.SocialRank hw4.jar input 10
where hw4.jar should be named according to your correct JAR name, input is the name of your sample
input data folder that is in HDFS, and 10 is the number of SocialRank iterations.
If you look in your HDFS directory you should see the final output and the initial-ranks for the nodes.
You can use hdfs dfs -get or hdfs dfs -copyToLocal to copy them to your hard drive for inspection. Or
you can run hdfs dfs –cat with the filename to see it on the screen.
Structure of code
The basic code provided computes SocialRank under the following assumptions:
1.
2.
3.
4.
Every node is assumed to have at least one out-edge.
We should initialize the rank of every node to 1.0.
We are not using the decay function.
There is no existing output file in the way!
Tasks
Your tasks will require you to look at some of the functions you can apply on RDDs. (You can find these in
detail at
https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/api/java/JavaPairRDD.html).
Let’s start by removing the 1st assumption from above (out-edges, which are in fact not always presence!
You’ll see there’s no initial rank value for Node 4), and also rescale the second value so it’s a probability.
1.
2.
3.
4.
Think about how you can create the ranks RDD in a way that ensures every node shows up if it is
connected. Pay special attention to the union operator, which enables you to merge two RDDs,
and (hint) also think about how you can transpose source and destination nodes.
Think about how you can scale the initial ranks to 1/n as opposed to 1.0, where n is the number of
nodes. Do keep in mind that there are a variety of functions for RDDs like aggregateByKey,
count, etc.
Find the function doing the SocialRank computation, and revise it to incorporate the decay factor,
with value 0.15 for the random-jump and 0.85 for out-link traversal.
Remove the initial-ranks and output files if they exist.
Submitting Your Work
Submit via Canvas the main edu.upenn.cis.nets212.hw4.SocialRank source code. Verify that:

Your code contains a reasonable amount of useful documentation (required for style points).

You have checked your final code into your SVN repository.
Extra Credit
For extra credit, get Spark running on Amazon Elastic MapReduce following the guide at:
https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923
Run it over the LiveJournal data you used for HW3, using the same cluster size as in HW3. Create a file
called extra-credit.txt explaining how you did this, and showing the full running times for Spark. Which
was faster, Hadoop from HW3 or Spark from HW4?