This project uses the 1 million playlist data set: https://recsys-challenge.spotify.com/
Since the data set is ~70 million rows, and cannot all be easily managed on our local machines, so we've decided to use GCP (Google Cloud Developer) platform to help us manage our data. First, we uploaded the CSVs into a private GCP storage bucket, and from there sent the data to BigQuery, GCP's distributed database, which will allow us to query the data however we like.
Here's the interface (you need access): https://console.cloud.google.com/bigquery?folder=&organizationId=&project=spotted-d&p=spotted-d&d=playlist_songs&t=playlist_songs&page=table
From here, we can retrieve, query, join, and manipulate the entirety of the dataset however we like. Please follow the instructions below in this notebook to set up your Jupyter so that it can query directly from GCP.
The following setup should take just a few minutes, and allow you to import all the data into an EDA notebook.
Go to https://console.cloud.google.com/iam-admin/serviceaccounts?project=spotted-d. Once you are on this page, go to the "actions" tab, where there will be a drop-down indicated by three dots on the right-most part of the corresponding account. Click "Create Key" which will download a key for you somewhere on your local machine. Save it somewhere safe. :-)
If you plan to store it on this git project, make sure to put in a folder that is git-ignored so that we don't push it up to Github. If you create a directory called "config" under the top-level directory of this repository and stick your service key in there, it should be automatically git-ignored.
If you are using Mac, just run this on your command line:
export GOOGLE_APPLICATION_CREDENTIALS="[PATH]"
If you are using Windows:
$env:GOOGLE_APPLICATION_CREDENTIALS="[PATH]"
For more info, see below:
https://cloud.google.com/docs/authentication/getting-started#auth-cloud-implicit-python
It might be worth it to add this export line to your bash profile.
Run this on your terminal:
pip install --upgrade google-cloud-bigquery[pandas]
For more on the above two steps, you can check out the GCP documentation: https://cloud.google.com/bigquery/docs/visualize-jupyter
%load_ext google.cloud.bigquery