Cup of Machine Learning From Starbucks

project over view :

this datasets contain simulated data about customer behavior on the rewards from mobile app , company used to send offers via app to customers and this offers varied from bogo “buy one get one free” to informational and discount , but not all customer who received offers interact with it or open it , so we need to explore validation of offers and customer interacting with it

Problem statement

we need to predict purchasing offers to most response customer (offer _received , offer viewed , offer complete ) based on demographic attributes of customer and other attributes of company offers by applying some sort of unsupervised machine learning (KNeighborsClassifier , RandomForestClassifier and DecisionTreeClassifier) but before applying model , datasets need to clean and asses then making some data exploratory to know some information about our customer and offers effectiveness also trying to answer following question

Project Metrics

I will use accuracy and F-score metrics for comparison and to test out the performance of the models.
according to score of F1 for test and train for 3 clustering we user we judge the and choose which model will be the best

Data cleaning and Implantation :

we have three datasets “portfolio.json , profile.json, transcript.json” all this three datasets include all information regards to customer it and offers awarded to customer and portfolio

Cleaning up of this portfolio datasets

1- changing the name of id to offer_id as we need to make it as primamry key and we will use use it to join other dataset.

Cleaning up of this profile dataset :

1- removing outlier from age .

Cleaning up of this transcript dataset :

1- change a person to Customer_id .

some Analysis outcomes

Customer age statistics as follow

count    14825.000000
mean 54.393524
std 17.383705
min 18.000000
25% 42.000000
50% 55.000000
75% 66.000000
max 101.000000
count     14825.000000
mean 65404.991568
std 21598.299410
min 30000.000000
25% 49000.000000
50% 64000.000000
75% 80000.000000
max 120000.000000
transaction        138953
offer received 76277
offer viewed 57725
offer completed 33579

Data Exploration & Data Visualization

Trying to generate question and its answer from data point of view

Data Prepossessing and model Implementation

using classification to detect event (offer received , offer viewed , offer completed) , so first we need to prepare data to apply model

with random state 0
with random state 42

refinement :

trying to bring more balance data to the model which will achieve in get model to real case , i think data cleaning and sharing all features by correct weights will make the model more powerful and real

Model Evaluation and Validation

we have evaluate the model by using F1 score for training and testing and according to the highest number of this test we can judge about the suitability of this model


The validation set (test data set) is used to evaluate the model. Both the models are better than the benchmark. The best score is created by the DecisionTreeClassifier model, as its validate F1 score is 85.10, which is much higher than the benchmark. The RandomForestClassifier model scores good as well compared to the benchmark, with a test F1 score of 75.87 . Our problem to solve is not that sensitive which requires very high F1 score, so the scores are good & sufficient and can be used for the classification purpose to predict whether a customer will respond to an offer. Reflection


What is the age distribution across gender with income ?

Future Improvements

  • there are more potentials to solving many queries and it can be utilized to answer many posed questions related customer interaction based on the Age and income as a whole too.
  • Try different additional machine learning models.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store