Term Deposit Marketing - An Apziva Project (#2)

By Samuel Alter
Apziva: G3SuQYZYrFt9dwF3

Summary

This project used supervised and unsupervised learning techniques and the following frameworks - Pandas, NumPy, Matplotlib, Seaborn, Optuna, Scikit-learn, and Principal Component Analysis - to analyze a phone call dataset from a bank and train models in an effort to help save the bank time. The dataset has within it demographic and banking data on their customers. By showing the model only certain columns, we can simulate the model learning which customers will most likely purchase a financial product of the bank.

There are three phases to the project
- Layer 1 involves simulating ignorance on which customer eventually was approved and bought the product
- Layer 2 involves the full dataset to simulate which customers the bank should continue to call to secure the sale
- Layer 3 involves unsupervised learning to figure out the groups of the customers
Important conclusions:
- Layer 1: A model achieved over 403 hours of time savings, or 14.25% of their total time. The model only missed 6% of their eventual subscribers.
- Layer 2: I was able to train a model that saved over 2,600 hours, or 93% of their total time. The model only missed 11% of their total subscribers.
- Layer 3: An elbow plot helped determine that the optimal number of customer clusters is five.
  - You can see the summary statistics here
  - The ideal customer for the bank's term deposit loan is a blue-collar or management proefessional in their mid- to late-30s with secondary or tertiary level of education

Overview

I produced two notebooks for this project, one for the EDA and one for the Modeling. This being the ReadMe, you can jump to those sections that are found below.

EDA
Modeling

Summary
Overview
The dataset
Goals
EDA
- Figure 1: Barplots of count of customers between successful and and failed campaigns(#figure-1)
- Figure 2: Boxplots of numerical columns in dataset, separated by successful and failed campaigns(#figure-2)
- Figure 3: Correlation of feature variables with target](#figure-3)
- What about Scatterplots?
Modeling
- Notes on project setup
- Layer 1: Using only the demographic and banking data to simulate customers that haven't been contacted by the bank yet.
  - [Results] of Layer 1(#l1-results)
  - Other metrics to optimize: the F1 Score
- Layer 2: Training a model to predict the customers on the full dataset Feature Importances: Using the tree-based model to answer the question: Which features in the dataset are most important to predicting a likely customer?
- Layer 3: Performing unsupervised learning to understand the grouping patterns of the bank's customers
  - Three Clusters: Using PCA to create three clusters in the successful customer dataset
  - Five Clusters: Using PCA to create five clusters in the successful customer dataset
  - Conclusion: Five clusters were chosen as they were able to tell a richer story of the successful customers in the dataset

The dataset

Back to TOC

I am working with a phone call dataset that also has demographic information about the recipients:

Column	Data Type	Comments
`age`	Numeric	The age of the customer
`job`	Categorical	The job category of the customer
`marital`	Categorical	Whether the customer is married
`education`	Categorical	The customer's level of education
`default`	Binary	If the customer has credit in default or not
`balance`	Numeric	Average yearly balance in Euros
`housing`	Binary	If the customer has a housing loan or not
`loan`	Binary	If the customer has a personal loan
`contact`	Categorical	The type of contact communication
`day`	Numeric	Last contact day of the month
`month`	Categorical	Last contact month of the year
`duration`	Numeric	Duration of the last phone call with the customer
`campaign`	Numeric	The number of contacts performed during this campaign and for this client including the last contact

The final column, y, is the target of the dataset and shows whether the client subscribed to a term deposit.

Goals

Back to TOC

The startup is hoping that I can achieve ≥81% accuracy using a 5-fold cross validation strategy, taking the average performance score.

Bonus goals are:

Determine which customers are most likely to buy the term deposit loan
- Which segments of customers should the client prioritize?
Determine what makes the customer buy the loan
- Which feature should the startup focus on?

EDA

Back to TOC

There are 40000 rows and 14 columns in the datset, and it arrived to me clean, with no null values.

Of all 40000 customers, a little more than 7% received loans. This points to a very large class-imbalance in the datsaet.

With 13 columns, there was a lot of data to go through. We'll look at barplots of the amount of customers within each categorical column, separated into successful and failed campaigns Figure 1, boxplots of the continuous columns Figure 2, and a figure showing the correlatoin between each OneHotEncoded column against the target, y Figure 3. Note: the columns were OneHotEncoded so that each column as shown in the figure refers to one category within a column. For example, there are four categories for highest level of education attained (primary, secondary, tertiary) and a category for customers with unknown education level. The OneHotEncoded version of this column would have a separate column for education_primary, with those that only possess that level of education getting encoded as a 1 and the rest getting a 0.

For the continuous columns, here's a statistical summary table:

	age	balance	day	duration	campaign
count	40000.000000	40000.000000	40000.000000	40000.000000	40000.000000
mean	40.544600	1274.277550	16.017225	254.824300	2.882175
std	9.641776	2903.769716	8.278127	259.366498	3.239051
min	19.000000	-8019.000000	1.000000	0.000000	1.000000
25%	33.000000	54.000000	8.000000	100.000000	1.000000
50%	39.000000	407.000000	17.000000	175.000000	2.000000
75%	48.000000	1319.000000	21.000000	313.000000	3.000000
max	95.000000	102127.000000	31.000000	4918.000000	63.000000

We can glean the following insights from this table:

The mean values for the age, day, and campaign columns are about equal to the 50th percentile
- The distribution of the data may be symmetric
The max value in each column besides age and day is much larger than the column's 75th percentile
- This suggests there could be outliers
- age and day are more or less categorical, so it makes sense that the max age is 95 and max day is 31

Figure 1

Back to TOC Although the raw numbers differ drastically between successful and failed campaigns, the patterns are similar for most of the features. Also notable is that there were no calls made to customers in the month of September.

Figure 2

Back to TOC Duration does indeed seem different, though recall that this feature is describing how long the last phone call was with the customer. It may not tell us that much.

Figure 3

Back to TOC
Duration has the highest correlation with the target variable at over 0.4.

Scatterplots?

Back to TOC
What about scatterplots? you may ask. My response: Scatterplots did not seem to give us much insight. The data are very dispersed and a pattern does not readily emerge:

Modeling

Back to TOC
For the modeling, I used random seed: 4769

AutoSklearn to Optuna to scikit-learn: the Modeling Workflow

I first used AutoSklearn to help me explore the ML algorithm landscape to identify the best-performing models for this particular dataset. Next, In order to find the best hyperparameters for our modeling, used Optuna. This is similar to other hyperparameter search frameworks like Hyperopt, which are designed to quickly and efficiently find the best hyperparameters for your dataset. Finally, we will use sklearn to build the final, optimized model.

Notes on project setup

Back to TOC
We want to help the bank understand which customers are most likely to purchase the financial product. Knowing this would save the bank time and money. The dataset that we were given consists of demographic (and banking) data (like age,job,marital,and balance) as well as campaign-specific information (like contact,day,and duration).

Demographic and Banking Data	Campaign-Specific Data	Target Feature
`age`	`contact`	`y`
`job`	`day`
`marital`	`month`
`education`	`duration`
`default`	`campaign`
`balance`
`housing`
`loan`

We want to build a three-layered ML system that helps answer the project goals:

Understand which kinds of customers that they should call
I will not give the model access to the campaign call data
After the initial calls, understand which customers the company should keep calling
Give the model access to the campaign call data
Build a model using unsupervised learning to learn about clusters of customers in the dataset

Layer 1:
Use X_1 to model which customers to make calls to. We are training a model that does not know any call data, so this is before making calls.

Layer 2:
Use the full X dataset (for clarity in its use in the layer flow, we'll be using X_2 to model which customers the company should keep calling.

Layer 3:
Use unsupervised learning to uncover how the customers are grouped.

Layer 1

Back to TOC
I wrote a function that utilized AutoSklearn to spend 60 minutes perfoming a fitting and evaluation of the models. The function then returned a list of models that achieved a high accuracy.

However, with our balanced dataset, we needed more control, as we had to tune for recall. I decided that the best course of action was to do the following:

Run a grid search of sorts. I created a list of scaling techniques, like StandardScaler, a list of sampling techniques, like RandomOverSampler or SMOTETomek, and a list of classifiers to test, like RandomForestClassifier or LGBMClassifier.
Using nested for loops, I ran through each technique and saved the results to a dictionary.
I extracted the best metric from the results dictionary.

A best recall score of over 87% was found with using no scalers, the SMOTE resampling method, and the SGDClassifier model:

Class	Precision	Recall	F1-Score	Support
0	0.95	0.19	0.31	7414
1	0.08	0.87	0.14	586
Accuracy			0.24	8000
Macro Avg	0.51	0.53	0.23	8000
Weighted Avg	0.89	0.24	0.30	8000

Untuned, ideal model:

(Jump to the tuned model below)

The results pointed me in the direction of which scaler, sampling technique, and model I should use to optimize with Optuna.

After 100 trials, I found these parameters, which gave a training recall score of almost 95%:

Hyperparameter Name	Hyperparameter Value
penalty	elasticnet
l1_ratio	0.9665372247163372
loss	modified_huber
tol	75.52719927740569
learning_rate	invscaling
eta0	0.7274942852090539
power_t	647.2058587404654
early_stopping	True
validation_fraction	0.3765902841689254
alpha	7.181611953044439e-07
fit_intercept	False
max_iter	1344

Running a new model with these tuned hyperparameters gave the following results:

Class	Precision	Recall	F1-Score	Support
0	0.95	0.10	0.18	7414
1	0.08	0.94	0.14	586
Accuracy			0.16	8000
Macro Avg	0.51	0.52	0.16	8000
Weighted Avg	0.89	0.16	0.17	8000

Tuned, ideal model

Interpreting the results

Back to TOC

When precision for class 0 is 95%, that means when the model predicts a customer as a non-subscriber, it is correct 95% of the time.
A precision of 8% for class 1 indicates that the model is correctly predicting a customer as a subscriber 8% of the time. There are many false positives.
The recall for class 0 is 10%, which means that the model is only identifying 10% of the non-subscribers correctly.
A very high recall of 94% shows that the model identifies almost all of the actual subscribers correctly.
The take-home message: this model is really good at predicting subscribers.

Now let's figure out how much time the company would save.

The mean call time is about 4.25 minutes
With 8000 customers in the test set, that is a total call time of 566.67 hours that the company needs to call all the customers
TP + FP = Total calls with model
- TP = 548
- FP = 6703
- Total calls with model = 7251
Total calls with model * Mean call time = Total minutes with model
- Total minutes with model minutes = 30,795
Without the model, the company would have to call all 8000 customers:
- 8000 * 4.25 = 33,976 minutes without model
33,976 call minutes without model - 30,795 call minutes with model = 3,181 minutes, or 53 hours, or 9.36% of the total call time. While 52 hours is a fine result, it's not that meaningful of a savings. How did the untuned model, with ideal techniques perform? It saved the company 103 hours, so over 18%, but the company missed 74 subsribers rather than just 38.

Other metrics to optimize: F1 Score

Back to TOC
It was at this point that I thought about other metrics to optimize. Sure, we want to focus on the recall for class 1, because that captures the subscribers. But we also want to be mindful of saving the company time overall as well. The F1 Score could be helpful here, as it is a metric that is the harmonic mean of the recall and precision. Would this be the best balancing of the tradeoff between precision and recall? The F1 Score may also be better at accomodating the class distribution.

Remember that TP = True positive, FP = False positive, TN = True negative, and FN = False negative.

Using the same nested for loops as before, this time I extracted the best F1 Score result, and found that a MinMaxScaler with a SMOTE resampling technique on the LGBMClassifier was the best, and I got the following result:

Class	Precision	Recall	F1-Score	Support
0	0.94	0.57	0.71	7414
1	0.09	0.56	0.16	586
Accuracy			0.57	8000
Macro Avg	0.52	0.57	0.43	8000
Weighted Avg	0.88	0.57	0.67	8000

This model saved the company almost 316 hours, or 55.8% of their time, but it missed about half of their true customers. A threshold value of 0.005 produced the best recall score of this entire project (96%), and only missing 25 customers. This model, however, would only save the company just under 40 hours, or almost 7% of their total time.

What about utilizing a threshold value to tune the probability decision line for probabilities? If we can choose a better threshold, we may be able to control the cutoff point and settle with a tradeoff that we're comfortable with.

To illustrate this relationship, I generated a list of threshold values and saved the result. I tracked the recall score for both class 0 and 1 as well as the true and false positive amount for class 1. The vertical line is at a threshold value that would produce a 90% recall score for class 1:

Running the model on the testing set with the threshold value to yield a 90% recall score for class 1 saves the company just over 74 hours, or almost 13.1% of their time. They only miss 56 subscribers.

Layer 2

Back to TOC

Now it comes to the last step: running the model on the full dataset. I have gone through several models, tuned to a couple metrics, and iterated to find an optimal threshold value. This last step is crucial because it will show the company what time they can save overall. We will now run a new model first by training it on a train set that has been resampled, then testing it on the full dataset that has been encoded and scaled but not resampled.

When running the following specifications on the full dataset, we were able to get incredible time savings of almost 2,634 hours (or 109 days, 18 hours), which equates to more than 93% time savings had the company called every single individual in their dataset.

Specifications:

OneHotEncode
StandardScaler
RandomOverSampler
RandomForestClassifier
5-fold StratifiedKFold

Class	Precision	Recall	F1-Score	Support
0	0.99	0.99	0.99	37104
1	0.92	0.89	0.90	2896
Accuracy			0.99	8000
Macro Avg	0.96	0.94	0.95	40000
Weighted Avg	0.99	0.99	0.99	40000

This model will save the company almost 2,634 hours, or over 93% of their time, while letting only 11% of their customers through.

Extracting the feature importances from the model shows that call duration, balance, day, and age are very important decision points for the tree-based model. Duration has almost 47% of the total importance, far exceeding the other categories:

December has the lowest importance, which makes sense given it is the holidays time.

Feature	Importance
Duration	46.77%
Balance	7.54%
Day	7.36%
Age	6.31%

The bank would do well to focus on these important features for selecting customers or preparing to sell them the loan.

Layer 3

Back to TOC

Now we have come to the final section for this project. We have trained models to predict which customers are likely to buy the term deposit loan, and we trained models that have helped the company understand who they should continue to call. We have one last question that still needs to be answered, namely: can we group the customers into clusters? This would help the bank understand which kind of customer they should target, ones that would very likely successfully purchase the loan.

I prepared a new version of the dataset that was scaled using sklearn's Normalize rather than using sklearn's StandardScaler, as the resulting correlation plot showed better correlations with the continuous features like Duration and Balance:

The dendrogram showed similar results that we got with the feature importances, showing the importance of Age and Balance, among others:

Using KMeans to construct an elbow plot to determine the optimal number of clusters showed that three or five clusters gave a good representation of the total within-cluster sum of means, or inertia:

Three Clusters

Back to TOC

Using PCA, I reduced the dimensions to three. Then, out of PCA and TSNE, UMAP gave the most convincing plot for 3 clusters:

Calculating some summary statistics for the clusters gave the following table:

Cluster Number	Attribute	Value
1	Median age:	36
	Median balance (Euro)	22
	Education level	Secondary
	Job category	Blue-Collar (21.5%)
	Marriage status	Married
2	Median age:	39
	Median balance (Euro)	2326
	Education level	Tertiary
	Job category	Management (28.3%)
	Marriage status	Married
3	Median age:	37
	Median balance (Euro)	556
	Education level	Secondary
	Job category	Blue-Collar (21.6%)
	Marriage status	Married

It might make sense to focus on those in cluster 1, as they have more money to spend and are thus probably earning more, too. These clusters are derived from successful customers, so it would make the most sense to appeal to all three of these clusters.

Five Clusters

Back to TOC

Using PCA, I reduced the dimensions to five. Then, out of PCA and TSNE, UMAP also gave the most convincing plot for 5 clusters:

Calculating some summary statistics for the clusters gave the following table:

Cluster Number	Attribute	Value
1	Median age:	36
	Median balance (Euro)	36
	Education level	Secondary
	Job category	Management (21.4%)
	Marriage status	Married
2	Median age:	38
	Median balance (Euro)	2850
	Education level	Tertiary
	Job category	Management (29.7%)
	Marriage status	Married
3	Median age:	36
	Median balance (Euro)	420
	Education level	Secondary
	Job category	Blue-Collar (22.3%)
	Marriage status	Married
4	Median age:	38
	Median balance (Euro)	934
	Education level	Secondary
	Job category	Blue-Collar (22.2%)
	Marriage status	Married
5	Median age:	39
	Median balance (Euro)	-394
	Education level	Tertiary
	Job category	Management (25.0%)
	Marriage status	Married

What is most interesting to me here is the fifth grouping, cluser 5, with a negative amount in their bank account. They work in management and have a tertiary level of education, suggesting they are high earners, but are not prioritizing the health of their bank account. This would make sense that they would need a loan, but are probably able to pay it back given their employment situation. Being married and also having the oldest median age further suggests their financial stability.

Conclusion

Back to TOC

Given the summary statistics of the three- and five-cluster groupings, it seems like five groups tell a more complete story of the customers. There are important things to highlight within the groups, namely that the fifth and second clusters might be the most worthwhile customers to focus on, given their jobs and educational background probably give them higher-than-average incomes, which they could use to purchase the term deposit loans from the bank.

That being said, taken as a whole, these customers are in their mid- to late-30s and have achieved either a secondary or tertiary education level. Their bank accounts have different amounts of money in them (hence why the bank should target those in cluster two), but they work mostly in management (though some are in blue-collar jobs.

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
project2_eda.ipynb		project2_eda.ipynb
project2_modeling.ipynb		project2_modeling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Term Deposit Marketing - An Apziva Project (#2)

Summary

Overview

Table of Contents

The dataset

Goals

EDA

Figure 1

Figure 2

Figure 3

Scatterplots?

Modeling

Notes on project setup

Layer 1

Interpreting the results

Other metrics to optimize: F1 Score

Layer 2

Layer 3

Three Clusters

Five Clusters

Conclusion

About

Releases

Packages

Languages

License

sralter/term_deposit_marketing

Folders and files

Latest commit

History

Repository files navigation

Term Deposit Marketing - An Apziva Project (#2)

Summary

Overview

Table of Contents

The dataset

Goals

EDA

Figure 1

Figure 2

Figure 3

Scatterplots?

Modeling

Notes on project setup

Layer 1

Interpreting the results

Other metrics to optimize: F1 Score

Layer 2

Layer 3

Three Clusters

Five Clusters

Conclusion

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages