Every time a company chooses to promote some product through tactics like offering discounts or running a digital ad campaign, there is a certain cost as well as some potential revenue earning opportunity associated with it. If the company is not careful in choosing the right set of customers to be receiving the promotion, it can end up losing a lot of money without earning much in return.
The dataset that I have used in this project was originally used as a take-home assignment provided by Starbucks for their job candidates. The data for this exercise consists of about 120,000 data points split in a 2:1 ratio among training and test files. In the experiment simulated by the data, an advertising promotion was tested to see if it would bring more customers to purchase a specific product priced at $10. Since it costs the company 0.15 to send out each promotion, it would be best to limit that promotion only to those that are most receptive to the promotion. Each data point includes one column indicating whether or not an individual was sent a promotion for the product, and one column indicating whether or not that individual eventually purchased that product. Each individual also has seven additional features associated with them, which are provided abstractly as V1-V7.
Our goal is to maximize the following metrics:
IRR depicts how many more customers purchased the product with the promotion, as compared to if they didn't receive the promotion. Mathematically, it's the ratio of the number of purchasers in the promotion group to the total number of customers in the purchasers group (treatment) minus the ratio of the number of purchasers in the non-promotional group to the total number of customers in the non-promotional group (control).
NIR depicts how much is made (or lost) by sending out the promotion. Mathematically, this is 10 times the total number of purchasers that received the promotion minus 0.15 times the number of promotions sent out, minus 10 times the number of purchasers who were not given the promotion.
In this case,
$$ NIR = (10 \cdot N_{Treat\_Purchase} - 0.15 \cdot N_{Treat}) - (10 \cdot N_{Control\_Purchase}) $$We can make use of V1 to V7 variables available in training dataset for each person to decide whether to send promotion to that person or not. We can use various approaches that model the problem differently and predict the likelihood of person purchasing the product after receiving promotion.
From past data, we know there are four possible outomes:
Table of actual promotion vs. predicted promotion customers:
Actual | ||
---|---|---|
Predicted | Yes | No |
Yes | I | II |
No | III | IV |
The metrics are only being compared for the individuals we predict should obtain the promotion – that is, quadrants I and II. Since the first set of individuals that receive the promotion (in the training set) receive it randomly, we can expect that quadrants I and II will have approximately equivalent participants.
Comparing quadrant I to II then gives an idea of how well your promotion strategy will work in the future.
When we feel like we have a good optimization strategy, we can complete the promotion_strategy
function to be passed to the test_results
function.
# load in packages
from itertools import combinations
from test_results import test_results, score
import numpy as np
import time
import pandas as pd
import scipy as sp
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import f1_score
import hyperopt
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from hyperopt import space_eval
import pickle
from sklearn.metrics import f1_score
import numpy as np
import gc
import math
def f1_eval(y_pred, dtrain):
y_true = dtrain.get_label()
err = 1-f1_score(y_true, np.round(y_pred))
return 'f1_err', err
import decimal
from sklearn.model_selection import KFold
def float_range(start, stop, step):
start = decimal.Decimal(start)
stop = decimal.Decimal(stop)
step = decimal.Decimal(step)
while (start < stop):
yield float(start)
start += step
# load in the data
train_data = pd.read_csv('./training.csv')
train_data.head()
data_dir = "./data"
train_data.describe()
# Cells for you to work and document as necessary -
# definitely feel free to add more cells as you need
train_data["Promotion"].value_counts()
If we treat the act of giving promotion as a treatment given by the company to its customers and those that were not given promotion as the control group, then we can see that there is nearly equal number of customers that belong to both the groups.
train_data["purchase"].value_counts()
It is clear from this numbers that there is a high imbalance in the number of customers who chose to purchase the product vs those who didn't. We need to take care of this while using this dataset for trainin the machine learning algorithm by using some technique like oversampling from under represented (minority) value 1
for target variable purchase
. SMOTE is one useful technique that generates balanced dataset for training purpose while also introducing some variations in the input variables while oversampling the data with minority target value.
We are given dataset that includes customers that have been given and not given the promotion. As it costs the company 1.5$ to promote to the customer, it will try to avoid promoting it to the customers who are:
Company is interested in giving the promotion to the customers who are likely to make the purchase only after receiving the promotion. The job of the preditive model is to predict whether the given customer falls into this category. If yes, then our algorithm will suggest the company to give promotion to that customer, otherwise it won't suggest to give promotion to that customer.
A statistical model can be trained to decide whether to give customer the promotion or not by training it with dataset where each customer is labeled as 1 in the output variable if he has been shown promotion and has purchased to product, and 0 for the rest of the scenarios. We can name this new variable as response
as it indicates whether the customer resonded positively to our promotion.
train_data_1 = train_data.copy()
train_data_1["response"] = (train_data_1["Promotion"] == "Yes") & (train_data_1["purchase"] == 1)
features = ["V"+str(x) for x in range(1,8)]
X = train_data_1[features]
Y = train_data_1["response"]
Y.value_counts()
X_train, X_valid, Y_train, Y_valid = train_test_split(X, Y, test_size=0.2, random_state=42)
sm = SMOTE(random_state=42, ratio=1.0)
X_balanced_train, Y_balanced_train = sm.fit_resample(X_train, Y_train)
Converting back to dataframe and series
X_balanced_train = pd.DataFrame(X_balanced_train, columns=features)
X_balanced_train.columns
Y_balanced_train = pd.Series(Y_balanced_train)
cv = GridSearchCV(estimator=XGBClassifier(), param_grid={
"max_depth": range(5,8,1),
"min_child_weight": [5, 10, 20, 50],
"gamma": [0, 0.1, 0.2],
"random_state": [42],
"n_estimators": [1000]
},
scoring="f1", cv=3)
start_time = time.time()
fit_params= {
"eval_set": [(X_valid, Y_valid)],
"eval_metric": f1_eval,
"early_stopping_rounds":20,
"verbose": 0
}
cv.fit(X_balanced_train, Y_balanced_train, **fit_params)
elapsed_time = (time.time() - start_time) / 60
print('Elapsed computation time: {:.3f} mins'.format(elapsed_time))
cv.best_params_
This will help us deciding number of estimators.
xgb = XGBClassifier(n_estimators=1000)
best_params_xgb = cv.best_params_
xgb.set_params(**best_params_xgb)
xgb.fit(X=X_balanced_train, y=Y_balanced_train.values.ravel(), eval_set=[(X_valid, Y_valid)], eval_metric=f1_eval, early_stopping_rounds=10, verbose=10)
optimal_n_estimators = xgb.best_ntree_limit
We have found out optimal max_depth and number of estimators for XGBoost algorithm for our case. Train the XGBoost on entire training dataset for using it in promotion strategy.
X_balanced, Y_balanced = sm.fit_sample(X,Y)
X_balanced = pd.DataFrame(X_balanced, columns=features)
Y_balanced = pd.Series(Y_balanced)
xgb = XGBClassifier(max_depth=best_params_xgb["max_depth"],
gamma=best_params_xgb["gamma"],
min_child_weight=best_params_xgb["min_child_weight"],
n_estimators=optimal_n_estimators,
random_state=42)
xgb.fit(X_balanced, Y_balanced)
pickle.dump(xgb, open(data_dir + '/xgb_best_approach_1.pkl', 'wb'))
model = pickle.load(open(data_dir + "/xgb_best_approach_1.pkl", 'rb'))
def promotion_strategy(df):
'''
INPUT
df - a dataframe with *only* the columns V1 - V7 (same as train_data)
OUTPUT
promotion_df - np.array with the values
'Yes' or 'No' related to whether or not an
individual should recieve a promotion
should be the length of df.shape[0]
Ex:
INPUT: df
V1 V2 V3 V4 V5 V6 V7
2 30 -1.1 1 1 3 2
3 32 -0.6 2 3 2 2
2 30 0.13 1 1 4 2
OUTPUT: promotion
array(['Yes', 'Yes', 'No'])
indicating the first two users would recieve the promotion and
the last should not.
'''
test = df
preds = model.predict(test)
promotion = []
for pred in preds:
if pred:
promotion.append('Yes')
else:
promotion.append('No')
promotion = np.array(promotion)
return promotion
test_results(promotion_strategy)
promotion
input variable as 1 or 0 respectively and calculating the difference between the two probabailities. If the difference turns out to be greater than some threshold value, in that case we can send promotion to the person. train_data_1 = train_data.copy()
train_data_1["response"] = train_data_1["purchase"] == 1
train_data_1["response"].unique()
features = ["V"+str(x) for x in range(1,8)] + ["Promotion"]
# X = pd.concat([train_data_1[features],pd.get_dummies(train_data_1["Promotion"])], axis=1)
X = pd.get_dummies(train_data_1[features])
X.shape
features=X.columns
Y = train_data_1["response"]
Y.value_counts()
X_train, X_valid, Y_train, Y_valid = train_test_split(X, Y, test_size=0.2, random_state=42)
sm = SMOTE(random_state=42, ratio=1.0)
X_balanced_train, Y_balanced_train = sm.fit_resample(X_train, Y_train)
Converting back to dataframe and series
X_balanced_train = pd.DataFrame(X_balanced_train, columns=features)
X_balanced_train.columns
Y_balanced_train = pd.Series(Y_balanced_train)
cv = GridSearchCV(estimator=XGBClassifier(), param_grid={
"max_depth": range(5,8,1),
"min_child_weight": [5, 10, 20, 50],
"gamma": [0, 0.1, 0.2],
"random_state": [42],
"n_estimators": [1000]
},
scoring="f1",
cv=3)
start_time = time.time()
fit_params= {
"eval_set": [(X_valid, Y_valid)],
"eval_metric": "logloss",
"early_stopping_rounds":20,
"verbose": 0
}
cv.fit(X_balanced_train, Y_balanced_train, **fit_params)
elapsed_time = (time.time() - start_time) / 60
print('Elapsed computation time: {:.3f} mins'.format(elapsed_time))
cv.best_params_
# This will help us deciding number of estimators
xgb = XGBClassifier(n_estimators=1000)
best_params_xgb = cv.best_params_
xgb.set_params(**best_params_xgb)
xgb.fit(X=X_balanced_train, y=Y_balanced_train.values.ravel(), eval_set=[(X_valid, Y_valid)], eval_metric="logloss", early_stopping_rounds=10, verbose=10)
optimal_n_estimators = xgb.best_ntree_limit
We have found out optimal max_depth and number of estimators for XGBoost algorithm for our case. Train the XGBoost on entire training dataset for using it in promotion strategy.
X_balanced, Y_balanced = sm.fit_sample(X,Y)
X_balanced = pd.DataFrame(X_balanced, columns=features)
Y_balanced = pd.Series(Y_balanced)
xgb = XGBClassifier(max_depth=best_params_xgb["max_depth"],
gamma=best_params_xgb["gamma"],
min_child_weight=best_params_xgb["min_child_weight"],
n_estimators=optimal_n_estimators,
random_state=42)
xgb.fit(X_balanced, Y_balanced)
pickle.dump(xgb, open(data_dir + '/xgb_best_approach_2.pkl', 'wb'))
model = pickle.load(open(data_dir + "/xgb_best_approach_2.pkl", 'rb'))
We define diff
as difference in the probabilities of person purchasing the product with and without receiving promotion. We have to choose the threshold value and if the diff
is higher than that threshold value than we can choose to show promotion to that person. To decide the value of threshold that maximizes the NIR for given prediction model, I evaluate the thresholds in range from 0 to 0.1 by calculating the mean of NIR on 10 folds of the validation dataset and choose threshold value with maximum NIR.
def evaluate(X, Y, diff_threshold, after_promotion_purchase_prob_threshold):
def score(df, promo_pred_col = 'Promotion'):
n_treat = df.loc[df[promo_pred_col] == 'Yes',:].shape[0]
n_control = df.loc[df[promo_pred_col] == 'No',:].shape[0]
n_treat_purch = df.loc[df[promo_pred_col] == 'Yes', 'purchase'].sum()
n_ctrl_purch = df.loc[df[promo_pred_col] == 'No', 'purchase'].sum()
nir = 10 * n_treat_purch - 0.15 * n_treat - 10 * n_ctrl_purch
return nir
nir_scores = []
kf = KFold(n_splits=10, random_state=42)
for train_index, test_index in kf.split(X):
X_train, X_valid = X.loc[train_index], X.loc[test_index]
Y_train, Y_valid = Y.loc[train_index], Y.loc[test_index]
# As we have already trained the hyper parameters for XGBoost, we need not train it again here
# we can use the trained model, to calculate score for given threshold value
model = pickle.load(open(data_dir + "/xgb_best_approach_2.pkl", 'rb'))
X_valid_with_promo = X_valid.copy()
# predict probability of purchase with promotion
X_valid_with_promo["Promotion_Yes"] = 1
X_valid_with_promo["Promotion_No"] = 0
probs_with_promotion = model.predict_proba(X_valid_with_promo)[:, 1]
# predict probability of purchase without promotion
X_valid_with_promo["Promotion_Yes"] = 0
X_valid_with_promo["Promotion_No"] = 1
probs_without_promotion = model.predict_proba(X_valid_with_promo)[:, 1]
# calculate the difference as diff
diff = probs_with_promotion - probs_without_promotion
# if diff is above threshold choose to promote else don't
promos = (probs_with_promotion > after_promotion_purchase_prob_threshold) & (diff > diff_threshold)
val_data = X_valid.copy()
val_data["Promotion"] = "No"
val_data.loc[val_data["Promotion_Yes"] == 1, "Promotion"] = "Yes"
val_data["purchase"] = Y_valid.copy()
score_df = val_data.iloc[np.where(promos)]
nir = score(score_df)
nir_scores.append(nir)
return np.asscalar(np.mean(nir_scores))
(X_valid.index == Y_valid.index).all()
evaluated_point_scores = {}
def objective_threshold(params):
if (str(params) in evaluated_point_scores):
return evaluated_point_scores[str(params)]
else:
print(params)
diff_threshold = params["diff_threshold"]
after_promotion_purchase_prob_threshold = params["after_promotion_purchase_prob_threshold"]
nir_score = evaluate(X=X_valid, Y=Y_valid,
diff_threshold=diff_threshold,
after_promotion_purchase_prob_threshold=after_promotion_purchase_prob_threshold)
print("nir: " + str(nir_score))
evaluated_point_scores[str(params)] = -nir_score
return -nir_score
param_space = {
"diff_threshold": hp.choice("diff_threshold", list(float_range("0.02", "0.04", "0.001"))),
"after_promotion_purchase_prob_threshold": hp.choice("after_promotion_purchase_prob_threshold", list(float_range("0.0", "1.0", "0.1")))
}
start_time = time.time()
best_params_threshold = space_eval(
param_space,
fmin(objective_threshold,
param_space,
algo=hyperopt.tpe.suggest,
max_evals=200))
print(best_params_threshold)
elapsed_time = (time.time() - start_time) / 60
print('Elapsed computation time: {:.3f} mins'.format(elapsed_time))
best_diff_threshold = best_params_threshold["diff_threshold"]
best_after_promotion_purchase_prob_threshold = best_params_threshold["after_promotion_purchase_prob_threshold"]
def promotion_strategy(df):
'''
INPUT
df - a dataframe with *only* the columns V1 - V7 (same as train_data)
OUTPUT
promotion_df - np.array with the values
'Yes' or 'No' related to whether or not an
individual should recieve a promotion
should be the length of df.shape[0]
Ex:
INPUT: df
V1 V2 V3 V4 V5 V6 V7
2 30 -1.1 1 1 3 2
3 32 -0.6 2 3 2 2
2 30 0.13 1 1 4 2
OUTPUT: promotion
array(['Yes', 'Yes', 'No'])
indicating the first two users would recieve the promotion and
the last should not.
'''
X = df.copy()
# predict probability of purchase with promotion
X["Promotion_No"] = 0
X["Promotion_Yes"] = 1
probs_with_promotion = model.predict_proba(X)[:, 1]
# predict probability of purchase without promotion
X["Promotion_No"] = 1
X["Promotion_Yes"] = 0
probs_without_promotion = model.predict_proba(X)[:, 1]
# calculate the difference as diff
diff = probs_with_promotion - probs_without_promotion
should_promote = pd.DataFrame()
should_promote["promo"] = (probs_with_promotion > best_after_promotion_purchase_prob_threshold) & (diff > best_diff_threshold)
should_promote.loc[diff >= best_diff_threshold, "promo"] = "Yes"
should_promote.loc[diff < best_diff_threshold, "promo"] = "No"
return should_promote["promo"].to_numpy(dtype="str")
test_results(promotion_strategy)
We can try the two models approach that is commonly recommended on literature related to uplift measurement. In this approach, we create one model for people who have received the promotion and another model for those who haven't received it. Each model predicts whether the person would purchase the product. The difference between the probability predicted by first model and second model is to be considered for deciding wether to promote to to that person or not.
Caveat here is that the error of prediction can get doubled as we are using two separate models. Also the scale of the probabilities predicted by two models may not be the same.