This is how you can use Data Science to evaluate your Marketing Campaign

5 min readMar 22, 2021

Project Overview

A Portuguese Bank performed several direct marketing campaingns designed to attract clients. However, the institution is not able to evaluate if the campaigns were successfull. There are also no analysis concerning the clients with higher probability to subscribe.

The goal is to use the available data to predict if the client will subscribe to a term deposit (represented in the target variable y). This prediction will allow us to evaluate if the marketing campaigns were successfull.

Problem Statement

The project aims to answer the questions:
1) Can we predict wheter a client will subscribe to a term deposit based on the data collected during the marketing campaigns? How accurate can we be?
2) What are the common features of clients that are more likely to subscribe?
3) Can we say that the formers marketing campaings were successful?

Data Visualization

Each column in the dataset was analysed with plots in order to understand the relationship of each variable with the target (0: no subscription, 1: subscription). Below are a few examples:

Plot of the Education categorical variable

Plot of the Duration continuous variable

Data Preprocessing

Possible outliers and abnormalities were identified with boxplots:

The distributions were once again analysed:

Spearman and Pearson correlations were used to analyse possible correlated variables:

Pearson Correlation

Spearman

The variables day and month are contained in pdays. On the other hand, the variables pdays and previous are seriously correlated. As pdays brings more meaning to the problem the variable previous was discarted.

Before the missing data treatment we had:

The poutcome variable had too many variables and was dicarted. The missing values in Contact were replaced by ‘unknown’ and the rows with missings in Job and Education were removed.

Categorical variables were treated using Label Encoder, Dummies and Mapping of the variables with hierarchical relationship.

Other common used steps:

Train Test Split: the data was splitted in test (30%) and train (70%) sets.
Feature Scaling: with Standard Scaling was used.
Dimensionality Reduction: there are two scenarios in this project. The first uses the regular data and the second uses PCA, with 83% of explained variance, to reduce the data dimensionality.

Model Considerations

The Random Forest Classifier was the model used as the baseline technique.
Randomized Grid Search was used to find the initial set of parameters.
Regular Grid Search was used to narrow down the set of parameters.
The optimized query is accuracy, given that the main goal of the project is to build a model able to correctly classify wheter a client will subscribe.
The best model was found in the scenario without PCA, with an accuracy of almost 90%.

Model Evaluation

Concerning the final model, the following metrics were analysed:
1. Precision
2. Recall
3. Accuracy
4. AUC
5. F1

Discussion

The accuracies mean (our main metric) in considerably high. It shows us that our goal, in terms of model, was achieved.
The precision and recall are balanced. The resulting f1 score is suitable for the project.
AUC metric is considerably high, another good indicator of the qualifity of the model.

Reflection

Question: Can we predict wheter a client will subscribe a term deposit ? Based on past data, how accurate can we be?

To answer this question the model was evaluated in the test set.

The metrics obtained are similar to the metrics obtained in the validation set, indicating that there is no overfit. The accuracy is consistent and shows that we can properly make predictions about new customers and prospects, with a confidence of almost 90%.

What are the common features of clients that are more likely to subscribe a term deposit?

Clients with secondary education are more likely to subscribe.

Clients from 30 to 40 years are more likely to subscribe.

Clients with no default are more likely to subscribe.

Clients with balace from 0 to 10000 are more likely to subscribe.

Clients with housing are more likely to subscribe.

Clients without loan are more likely to subscribe.

In an overall perspective, can we say that the formers marketing campaings were successful?

Based on the distribution below we can’t say the campaigns were successful.
The number of clients that didn’t subscribe is way bigger than those who subscribed.

Improvements

Possible improvements are:

Treat the problem as unbalanced, using a technique such as SMOTE to increase the number of observations in target 0. This would likely improve the metrics recall, precision and f1.
Obtain more data to take into account the pandemics effect in clients’ subscriptions.

Conclusion

Using Machine Learning techniques to predict if a client will subscribe to a Banking Service is a reliable alternative if a suitable period of time is available for analysis. The presented project shows how this is a solid hypothesis.

Github

Find below the link to my Github project:

https://github.com/crdealme/Bank-Marketing-Capstone