Skip to content

spotted-d/spotify-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spotify-project

This project uses the 1 million playlist data set: https://recsys-challenge.spotify.com/

Since the data set is ~70 million rows, and cannot all be easily managed on our local machines, so we've decided to use GCP (Google Cloud Developer) platform to help us manage our data. First, we uploaded the CSVs into a private GCP storage bucket, and from there sent the data to BigQuery, GCP's distributed database, which will allow us to query the data however we like.

Here's the interface (you need access): https://console.cloud.google.com/bigquery?folder=&organizationId=&project=spotted-d&p=spotted-d&d=playlist_songs&t=playlist_songs&page=table

From here, we can retrieve, query, join, and manipulate the entirety of the dataset however we like. Please follow the instructions below in this notebook to set up your Jupyter so that it can query directly from GCP.

Setup Instructions

The following setup should take just a few minutes, and allow you to import all the data into an EDA notebook.

1. Create a service account.

Go to https://console.cloud.google.com/iam-admin/serviceaccounts?project=spotted-d. Once you are on this page, go to the "actions" tab, where there will be a drop-down indicated by three dots on the right-most part of the corresponding account. Click "Create Key" which will download a key for you somewhere on your local machine. Save it somewhere safe. :-)

If you plan to store it on this git project, make sure to put in a folder that is git-ignored so that we don't push it up to Github. If you create a directory called "config" under the top-level directory of this repository and stick your service key in there, it should be automatically git-ignored.

2. Set up implicit authentication with gCloud

If you are using Mac, just run this on your command line:

export GOOGLE_APPLICATION_CREDENTIALS="[PATH]"

If you are using Windows:

$env:GOOGLE_APPLICATION_CREDENTIALS="[PATH]"

For more info, see below:

https://cloud.google.com/docs/authentication/getting-started#auth-cloud-implicit-python

It might be worth it to add this export line to your bash profile.

3. Install Google Cloud Big Query Pandas python package.

Run this on your terminal:

pip install --upgrade google-cloud-bigquery[pandas]

For more on the above two steps, you can check out the GCP documentation: https://cloud.google.com/bigquery/docs/visualize-jupyter

4. Insert this cell into your EDA notebook:

%load_ext google.cloud.bigquery

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •