udacity · torijule · Dec 21, 2023
@@ -13,26 +13,40 @@ This model is then compared to an Azure AutoML run.
 
 
 ## Summary
-**In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..."**
+This project examines a given marketing dataset from banking customers in order to create a model which emulates whether or not a particular customer is likely to respond to positively respond to a marketing campaign.
 
-**In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..."**
+The best performing model, by accuracy, was generated by AutoML and leveraged a VotingEnsemble algorithm.
 
 ## Scikit-learn Pipeline
-**Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm.**
 
-**What are the benefits of the parameter sampler you chose?**
+1. Create tabular data factory for the bank marketing dataset
+2. Preprocess and clean data.  Feature engineering utilized data-splitting.
+3. Define random hyperparameter sampler for LogisticRegression with variables: regularization ('C') and maximum iterations ('max_iter')
+4. Define early-stop policy (BanditPolicy)
+5. Configure HyperDriveConfig to automate model generation
+
+RandomParameterSampling was selected because it supports early-stopping. 
+BanditPolicy was selected as the early-stoping methodology to abort runs which are not meeting desired accuracy threshholds/expectations, thus improving overall computational efficiency.  
 
-**What are the benefits of the early stopping policy you chose?**
 
 ## AutoML
-**In 1-2 sentences, describe the model and hyperparameters generated by AutoML.**
+1. Create tabular data factory for the bank marketing dataset
+2. Preprocess and clean data with same methodology as scikit-learn pipeline
+3. Configure AutoML to automate model generation.
+
+The AutoML pipeline was optimized for accuracy to be comparable with the Scikit-learn Pipeline and given a timeout of 30 minutes due to environment restrictons.  
+
 
 ## Pipeline comparison
-**Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one?**
+The accuracy of the scikit-learn pipeline was 0.9091.
+The accuracy of the AuoML pipeline was 0.9188.
+Therefore, the AutoML pipeline out performed the Scikit-learn pipeline.
+
+The AutoML pipeline identified a VotingEnsemble algorithm as the most accurate.  VotingEnsemble considers previous autoML iterations to implement soft voting, wherein class predictions are determined from weighted averages.  It is also interesting to note that the AutoML pipeline reported a balanced accuracy of 0.783.  Disparate accuracy and balanced accuracy metrics often indicate imbalance in the dataset.  
+
 
 ## Future work
-**What are some areas of improvement for future experiments? Why might these improvements help the model?**
+Given the imbalance of the dataset, it would be interesting to perform the experiment again optimizing for balanced_accuracy.  The AutoML pipeline may be additionally improved given better resources/compute time, which in this experiment timed out at 30 minutes.  VotingEnsemble would then have additional AutoML runs to consider in addition to more competiton from other algorithms which may better classify these data. 
 
 ## Proof of cluster clean up
-**If you did not delete your compute cluster in the code, please complete this section. Otherwise, delete this section.**
-**Image of cluster marked for deletion**
+Please see cluster clean up performed in notebook.
@@ -54,14 +54,17 @@ def main():
     # TODO: Create TabularDataset using TabularDatasetFactory
     # Data is located at:
     # "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"
+    data_path = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"
 
-    ds = ### YOUR CODE HERE ###
+    ds = TabularDatasetFactory.from_delimited_files(path = data_path)
 
     x, y = clean_data(ds)
 
     # TODO: Split data into train and test sets.
 
-    ### YOUR CODE HERE ###a
+    ### YOUR CODE HERE ###
+
+    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 42)
 
     model = LogisticRegression(C=args.C, max_iter=args.max_iter).fit(x_train, y_train)