In this project, we build a model that automatically tags restaurants with multiple labels using a dataset of user-submitted photos. Currently, restaurant labels are manually selected by Yelp users when they submit a review. Selecting the labels is optional, leaving some restaurants un- or only partially-categorized. In an age of food selfies and photo-centric social storytelling, it may be no surprise to hear that Yelp's users upload an enormous amount of photos every day alongside their written reviews.
- Numpy - For handling the datasets (
pip install numpy
) - Pandas - For handling the datasets (
pip install pandas
) - Scikit Learn - To use classification algorithms like SVM (
pip install -U scikit-learn
) - Python
The following dependencies are only required if you wish to extract image and business features from scratch. But we have already done that for you, you just need to download them from the links provided below in the table. Make sure that you put these files in "features" directory.
- H5Py - To store the features extracted from CNN (
pip install h5py
) - Caffe - To extract features from the images (Refer to the link)
code/
- contains programs to extract features and perform the final classification.data/
- contains training and testing images + metadata from Yelp dataset (We have already extracted and stored the features for east of project execution).features/
- contains the extracted features from images and restaurants (For ease of project execution).models/
- contains trained SVM model which can be used for future predictions without retraining (Will be generated automatically whenclassify.py
is run for the first time; for ease of project execution, we have included this model as well).
Again if you choose to extract image and business features from scratch, you will need this dataset. It is available here. Dataset description is also available. Download and extract the files/folders in the "data" directory.
For ease of project execution, we have already extracted the features and stored in the following files:
Filename | Size | Description | Command that was used for generation |
---|---|---|---|
train_features.h5 | 3.59 GB | Format: [PhotoId, ImageFeatures] This file contains ImageNet features of training dataset | python extract_image_features_train.py |
test_features.h5 | 18.2 GB | Format: [PhotoId, ImageFeatures] This file contains ImageNet features of test dataset | python extract_image_features_test.py |
train_business_features.csv | 91.7 MB | Format: [BusinessId, BusinessFeatures, ClassLabels] This file contains features extracted for businesses in training dataset. These features are extracted using train_features.h5. | python extract_business_features_train.py |
test_business_features.csv | 460 MB | Format: [BusinessId, BusinessFeatures] This file contains features extracted for businesses in test dataset. These features are extracted using test_features.h5. | python extract_business_features_test.py |
$ cd code
$ python classify.py