Amazon Product Success Prediction

Overview

The purpose of this repo is to consolidate and track code related to our Milestone 2 Project.

We plan to accomplish 3 main things:

Predict the likelihood of success of a product (defined by a metric related to total/average number of stars or salesrank) from product metadata as features
Feature Engineering - Use topic modeling and NLP techniques to create new target variables that represent keywords/sentiment from review text as new target variables to see if we can predict and enable a merchant to understand how buyers on the platform would react to a newly launch product with similar features
Clustering - We intend to use clustering to improve product search enabling customer to search for products with similar features and potentially use methods like DBScan for outlier detection to see if we can pick up fake reviews

Key Links

Pipeline

!

1. load_data

In this first stage of our pipeline, we take the metadata and the reviews data directly from the category URL (in this case : ‘Appliances’, and we parse, clean, and convert those into Pandas dataframes. Readers may be able to load different categories than ‘Appliances’ by simply replacing the URL in this Python file to the desired category’s URL.

2. preprocess_data

Once the data frame was processed thoroughly through previous functions, we designed this stage to turn the raw text data into a machine readable vectorized feature space that can be introduced to different machine learning algorithms. We do multiple preprocessing steps such as further cleaning, lemmatizing, removing stopwords, vectorizing every product/review features. We would then save this data in a pickle format as an output from this pipeline stage to be passed on to both our supervised and unsupervised models. Additionally there were aggregation on different features such as individual word weights using pre-trained word2vec models, tf-idf representations, and descriptive statistics of product/review in multiple forms(length of text, number of tables/images describing a product, and whether a review is verified or not).

3. Success Metric Creation

Scoring each review with a point system ranging from -2 to 2, there is also an option to remove identified fake reviews

4. Supervised Model 1

Train algorithms to predict a product’s success rate on combinations of word embeddings, handcrafted features, tf-idf vectors, cluster labels.

5. Supervised Model 2

Train algorithms to identify potential fake reviews based on review word embeddings, tfidf, length of reviews, whether a review is from a verified user. The aim is to test whether fake reviews detection and elimination would improve product success predictions.

6. Evaluate Supervised Models

Reporting the train-val-test and dummy accuracy scores, F1 scores, correlation coefficient. Confusion matrix, ROC curve, and Precision Recall curve is also provided.

7. Clustering and Evaluation

Clustering reviews data to identify trends/insights/anomalies, display graphs, compute the optimal parameter for each clustering model, and provide silhouette/calinski harabasz/davies bouldin scores.

Name		Name	Last commit message	Last commit date
Latest commit History 230 Commits
.dvc		.dvc
__pycache__		__pycache__
charts		charts
data/fake/labeled_data		data/fake/labeled_data
word2vec_models		word2vec_models
.DS_Store		.DS_Store
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
clustering_reviews.py		clustering_reviews.py
create_success_metric_module.py		create_success_metric_module.py
create_success_metrics.py		create_success_metrics.py
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
evaluate_supervised.py		evaluate_supervised.py
fake_detection_model.py		fake_detection_model.py
fake_review_detection_module.py		fake_review_detection_module.py
fd_evaluate_supervised.py		fd_evaluate_supervised.py
fd_supervised_report.json		fd_supervised_report.json
fd_test_model.py		fd_test_model.py
load_data.py		load_data.py
load_prepare_fake_labeled_data.py		load_prepare_fake_labeled_data.py
model_supervised.py		model_supervised.py
params.yaml		params.yaml
prdct_supervised_report.json		prdct_supervised_report.json
preprocess_data_module.py		preprocess_data_module.py
preprocess_products.py		preprocess_products.py
preprocess_reviews.py		preprocess_reviews.py
requirements.txt		requirements.txt
run_kmean_features.py		run_kmean_features.py
run_pca.py		run_pca.py
sentiment_analysis.py		sentiment_analysis.py
split_data.py		split_data.py
supervised_model_params_hypertune.txt		supervised_model_params_hypertune.txt
word2vec.py		word2vec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Product Success Prediction

Overview

Key Links

Pipeline

1. load_data

2. preprocess_data

3. Success Metric Creation

4. Supervised Model 1

5. Supervised Model 2

6. Evaluate Supervised Models

7. Clustering and Evaluation

About

Releases

Packages

Contributors 3

Languages

stuartong/amazon-product-prediction

Folders and files

Latest commit

History

Repository files navigation

Amazon Product Success Prediction

Overview

Key Links

Pipeline

1. load_data

2. preprocess_data

3. Success Metric Creation

4. Supervised Model 1

5. Supervised Model 2

6. Evaluate Supervised Models

7. Clustering and Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages