Matrix Factorization using Alternating Least Squares (ALS)

Spark Scala - Matrix Factorization using ALS program that takes a sparse matrix and performs matrix factorization into two smaller dense matrices. For our purposes, the Netflix Prize Dataset (found on Kaggle https://www.kaggle.com/netflix-inc/netflix-prize-data) was used which includes records for each movie rating by each user. Further future work would use the final two dense factored matrices to make "predictions" or recommend movies to users similar to the Netflix recommendation system.

Installation

The following components are installed:

JDK 1.8 (OpenJDK 8)
Scala 2.11.12
Hadoop 2.9.2
Spark 2.3.1 (without bundled Hadoop)
Maven
AWS CLI (for EMR execution)

Environment

Example ~/.bash_aliases: export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_HOME=/usr/local/hadoop/hadoop-2.9.2 export SCALA_HOME=/usr/local/scala export SPARK_HOME=/usr/local/spark export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export SPARK_DIST_CLASSPATH=$(hadoop classpath) export PATH=$PATH:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$SPARK_HOME/bin
Explicitly set JAVA_HOME in $HADOOP_HOME/etc/hadoop/hadoop-env.sh: export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Execution

All of the build & execution commands are organized in the Makefile.

Initial

Open command prompt.
Navigate to directory where the project files unzipped.
Add or move the edges.csv file to the input folder of the project folder
Edit the Makefile to customize the environment at the top. Sufficient for standalone: hadoop.root, jar.name, local.input, job.name Other defaults acceptable for running standalone. For AWS customize the following to run AWS EMR Hadoop below: ams.emr.release, aws.bucket, aws.num.nodes, aws.instance.type
Standalone Hadoop: make switch-standalone -- set standalone Hadoop environment (execute once) make local
Pseudo-Distributed Hadoop: (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation) make switch-pseudo -- set pseudo-clustered Hadoop environment (execute once) make pseudo -- first execution make pseudoq -- later executions since namenode and datanode already running
AWS EMR Hadoop: (you must configure the emr.* config parameters at top of Makefile) make upload-input-aws -- only before first execution make aws -- check for successful execution with web interface (aws.amazon.com) download-output-aws -- after successful execution & termination

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
logs		logs
outputs		outputs
src/main		src/main
.gitignore		.gitignore
README.md		README.md
dependency-reduced-pom.xml		dependency-reduced-pom.xml
netflix_preprocessing.py		netflix_preprocessing.py
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Matrix Factorization using Alternating Least Squares (ALS)

Installation

Environment

Execution

Initial

About

Releases

Packages

Languages

Anisalexvl/Distributed-Matrix-Factorization

Folders and files

Latest commit

History

Repository files navigation

Matrix Factorization using Alternating Least Squares (ALS)

Installation

Environment

Execution

Initial

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages