Skip to content

tuantran0910/Spotify-Analysis-with-PySpark

 
 

Repository files navigation

Spotify Analysis 🎵

HCMUS License: MIT

This project is aimed at analyzing data from the Spotify platform, utilizing the Spotify API and MongoDB for data extraction, Apache Hadoop for ELT processes, PySpark for transformation, and leveraging Dremio and Power BI for visualization and in-depth data analysis.

Data Pipeline Terraform Prefect Docker Spotify Apache Hadoop Apache Spark Dremio MongoDB Power Bi

Table of contents 📌

Overview

Project Structure

Structure

Data Schema

We initiate our data collection by scraping artists's name list from Spotify Artists. Subsequently, leveraging this list, we utilize the Spotify API to extract comprehensive data about each artist. The obtained raw data undergoes a series of ETL processes. Data Schema

Demo Video

This is our demo video on Youtube, you can watch via this Link

Prerequisite

Getting started 🚀

Set up your MongoDB Atlas

There are several ways to do in this step, but we will use terraform to deploy Atlas cluster. Please follow this Instruction

Set up environment

Clone this project to your machine by running the following command:

git clone https://github.com/PhongHuynh0394/Spotify-Analysis-with-PySpark.git
cd Spotify-Analysis-with-PySpark

then you need to create .env file base on env_template

cp env_template .env

Now please fill these informations blank in .env file, this can be done in Prerequisite and Set up your MongoDB Atlas section:

# Spotify
SPOTIFY_CLIENT_ID=<your-api-key>
SPOTIFY_CLIENT_SECRET=<your-api-key> 

# Mongodb
MONGODB_USER=<your-user-name>
MONGODB_PASSWORD=<your-user-password>
MONGODB_SRV=<your-srv-link> # Get this from running terraform set up

OK, now it's Docker's job ! Let's build your Docker images of this project by typing make build in your terminal

This process might take a few minutes, so just chill and take a cup of coffee ☕

Note: if you failed in this step, just remove the image or restart Docker and try again

If you've done building Docker images, now its time to run your system. Just type make run

Then check your services to make sure everything work correctly:

  1. Hadoop
  2. Prefect
  3. Data Warehouse
  4. Dashboard:
  5. Notebook:

Run your data pipeline

We use Prefect to build our data pipeline. When you check out port 4200, you'll see prefect UI, let's go to Deployment section, you'll see 2 deployments there correspond to 2 data pipelines

Pipeline 1 (Ingest MongoDB Atlas flow)

This data flow (or pipeline) is used to scrape data from spotify API by batch and ingest into MongoDB Atlas. It will execute automatically every 2 minutes and 5 seconds.

pipeline1-a

pipeline1-b

Tips: The purpose of this flow is preparing your raw data in MongoDB, you would see 4 collections in your database on MongoDB Atlas after this. You should run this flow a few times before run pipeline 2.

Pipeline 2 (ETL flow)

This data flow do ETL job. It Extract raw data from MongoDB and first full load into HDFS in bronze layer, Then Transforming by PySpark in silver and gold layer. You can trigger this flow by press the run button manually on the top right corner.

Bronze, Silver, Gold layer are just Data Qualification Directiory to store backup of data in HDFS.

pipline2-a

pipline2-b

Warehouse and UI

localhost:9047

We use Dremio to analyze data in HDFS directly. Don't forget the username is dremio and password is dremio123. Then follow this instruction:

Login to Dremie > Add Source > Choose HDFS

The connecting window will appear, please fill as following:

  • Name: HDFS
  • NameNode Host: namenode

Then press Save to Save your connection. You would see your connection appearing in your main window go to gold_layer directory and format all .parquet directories. Then run your SQL statement and start analyzing.

You can use our SQL statements in warehouse.sql: dremio These SQL statements used to create analytic view for Power Bi to draw Dashboard. You can also see it in PowerBI Dashboard

UI

Streamlit

localhost:8501

After all, you can access to Streamlit to see the Dashboard. Moreover, it can also utilize Machine Learning model to Recommend most porpular songs for you. streamlit

PowerBI Dashboard

PowerBI Dashboard

powerbi

You can also see it in powerbi_dashboard Or in our Streamlit app

And more

In future, we will update this repo in:

  • Utilizing Deep Learning model: In the future, we plan to leverage a Deep Learning model, specifically an NLP model, to analyze the lyrics of tracks.
  • Using Flask or other frameworks: Our goal is to switch to Flask or other frameworks, replacing the Streamlit Dashboard for improved functionality.
  • Using MongoDB locally: To streamline deployment and allow for personalized configuration, we'll be transitioning to using MongoDB locally.

Contributors

Huỳnh Lưu Vĩnh Phong
Huỳnh Lưu Vĩnh Phong

Data Engineer
Team Lead
Trần Ngọc Tuấn
Trần Ngọc Tuấn

Data Engineer
Phạm Duy Sơn
Phạm Duy Sơn

Data Science
Mai Chiến Vĩ Thiên
Mai Chiến Vĩ Thiên

Data Analyst

Finally

Feel free to use 😄

About

Analyzing Spotify Data with Pyspark and ETL Procedures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 83.9%
  • Python 14.0%
  • CSS 0.8%
  • HCL 0.5%
  • Shell 0.5%
  • Dockerfile 0.2%
  • Makefile 0.1%