GitHub - prabhuvashwin/PageRank-Algorithm-Implementation: Implementation of Google's PageRank algorithm using Java, Hadoop, and MapReduce

Page Rank Algorithm Implementation

This is a read me document for the assignment 3 and explains all the files involved.

The input files used for this dataset is the Wiki-Micro Corpus.

While running the program, two directories are generated - one is an output directory, OUT_DIR, which will host the final output, and the other is an intermediate directory, INT_DIR, which will host the outputs of WebPageCount, GraphLink, and PageRankComputation. The input is present in INPUT_DIR.

The program works in the following way.

PageRankDriver.java drives the program. It is responsible for calling the required class in the sequential order, and setting the right input and output formats for different classes. It also takes care of the cleaning up the intermediate directories which were generated during the program execution.
WebPageCount.java is the first MapReduce job called. It is responsible for taking into account all the pages and its count. This class takes the input from INPUT_DIR. The output of this MapReduce job will be stored at INT_DIR/page_count. PageRankDriver uses the output of this file to calculate "N", which is the number of web pages present in the corpus. This value is used for page rank calculation.
Next MapReduce job being called is GraphLink.java. GraphLink uses the INPUT_DIR as input and "N" calculated using the output of WebPageCount to generate a link graph of all the web pages. The output of this MapReduce job will be stored at INT_DIR/graph_link
Next MapReduce job being called is PageRankComputation.java. This job will be called 10 times sequentially. The initial input of this job is INT_DIR/graph_link. The output of 1st iteration is used as the input of the 2nd iteration, and so on. The outputs of the respective iterations will be stored at INT_DIR/iteration_<iteration_id>.
The next MapReduce job will be PageRankSorting.java. This job uses the out of the last iteration of PageRankComputation stored at INT_DIR/iteration_10, as input. It sorts the pages on page rank in descending order using a comparator. The output of this computation is stored in OUT_DIR.

REFERENCES USED:

Pattern matcher group - https://examples.javacodegeeks.com/core-java/util/regex/matcher-group-example/
Comparator class for sorting - https://vangjee.wordpress.com/2012/03/30/implementing-rawcomparator-will-speed-up-your-hadoop-mapreduce-mr-jobs-2/
http://hadoop.apache.org/core/docs/current/api/

NOTE: I HAVE NOT RUN THIS PROGRAM ON CLUSTER AND THE BELOW INSTRUCTIONS ARE FOR CLOUDERA VM

Copy all the source code files onto the cluster at path ~/org/myorg/
Use "hadoop fs -copyFromLocal " to copy files to hadoop. Input files need to be under /input directory
Compile the program using "java -cp /usr/lib/hadoop/:/usr/lib/hadoop-mapreduce/ *.java -d build -Xlint
Build the jar file using "jar -cvf page rank.jar -C build/ ."
Run the program using "hadoop jar page rank.jar org.myorg.PageRankDriver /user/cloudera/input /user/cloudera/output"
Copy the output to the local file system using "hadoop fs -copyToLocal /user/cloudera/output/* output"

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Page Rank Algorithm Implementation

REFERENCES USED:

About

Releases

Packages

Languages

prabhuvashwin/PageRank-Algorithm-Implementation

Folders and files

Latest commit

History

Repository files navigation

Page Rank Algorithm Implementation

REFERENCES USED:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages