Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assignment update #13

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 24 additions & 10 deletions README.md
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -13,26 +13,40 @@ This model is then compared to an Azure AutoML run.


## Summary
**In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..."**
This project examines a given marketing dataset from banking customers in order to create a model which emulates whether or not a particular customer is likely to respond to positively respond to a marketing campaign.

**In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..."**
The best performing model, by accuracy, was generated by AutoML and leveraged a VotingEnsemble algorithm.

## Scikit-learn Pipeline
**Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm.**

**What are the benefits of the parameter sampler you chose?**
1. Create tabular data factory for the bank marketing dataset
2. Preprocess and clean data. Feature engineering utilized data-splitting.
3. Define random hyperparameter sampler for LogisticRegression with variables: regularization ('C') and maximum iterations ('max_iter')
4. Define early-stop policy (BanditPolicy)
5. Configure HyperDriveConfig to automate model generation

RandomParameterSampling was selected because it supports early-stopping.
BanditPolicy was selected as the early-stoping methodology to abort runs which are not meeting desired accuracy threshholds/expectations, thus improving overall computational efficiency.

**What are the benefits of the early stopping policy you chose?**

## AutoML
**In 1-2 sentences, describe the model and hyperparameters generated by AutoML.**
1. Create tabular data factory for the bank marketing dataset
2. Preprocess and clean data with same methodology as scikit-learn pipeline
3. Configure AutoML to automate model generation.

The AutoML pipeline was optimized for accuracy to be comparable with the Scikit-learn Pipeline and given a timeout of 30 minutes due to environment restrictons.


## Pipeline comparison
**Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one?**
The accuracy of the scikit-learn pipeline was 0.9091.
The accuracy of the AuoML pipeline was 0.9188.
Therefore, the AutoML pipeline out performed the Scikit-learn pipeline.

The AutoML pipeline identified a VotingEnsemble algorithm as the most accurate. VotingEnsemble considers previous autoML iterations to implement soft voting, wherein class predictions are determined from weighted averages. It is also interesting to note that the AutoML pipeline reported a balanced accuracy of 0.783. Disparate accuracy and balanced accuracy metrics often indicate imbalance in the dataset.


## Future work
**What are some areas of improvement for future experiments? Why might these improvements help the model?**
Given the imbalance of the dataset, it would be interesting to perform the experiment again optimizing for balanced_accuracy. The AutoML pipeline may be additionally improved given better resources/compute time, which in this experiment timed out at 30 minutes. VotingEnsemble would then have additional AutoML runs to consider in addition to more competiton from other algorithms which may better classify these data.

## Proof of cluster clean up
**If you did not delete your compute cluster in the code, please complete this section. Otherwise, delete this section.**
**Image of cluster marked for deletion**
Please see cluster clean up performed in notebook.
7 changes: 5 additions & 2 deletions train.py
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -54,14 +54,17 @@ def main():
# TODO: Create TabularDataset using TabularDatasetFactory
# Data is located at:
# "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"
data_path = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"

ds = ### YOUR CODE HERE ###
ds = TabularDatasetFactory.from_delimited_files(path = data_path)

x, y = clean_data(ds)

# TODO: Split data into train and test sets.

### YOUR CODE HERE ###a
### YOUR CODE HERE ###

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 42)

model = LogisticRegression(C=args.C, max_iter=args.max_iter).fit(x_train, y_train)

Expand Down
Loading