Identify Customer Segmentation

Mohamed Gamal
5 min readAug 21, 2021



In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You’ll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you’ll apply what you’ve learned on a third datasets with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

logo of the company datasets belong to

Motivation about this project

in business not all customer have the same criteria and as follow not all customers will be keen to interact with product with the same value , so companies tries to know more about its customers by some different of analysis and emerging technologies like deep Learning and Statistics

Customer segmentation : Customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately.

so we will try to grouping Customer according to its similar criteria by using types of machine learning like Clustering and PCA .

Data Pre-processing :

for the first dateset (Azdias):

1- check and encoded it

2- check outliers and remove it

3- manipulate data from other datasets .

4- making a feature scaling : Feature Scaling is a technique to standardize the independent features present in the data in a fixed range .

5- extract the most variance of datasets after applying PCA drawing variance per principle comment refit instant pca to assure the transformation retaining 30 principle component as it around 80 % of the variance

PCA with and with out Feature Scaling
  • To investigate the features, you should map each weight to their corresponding feature name, then sort the features according to weight. The most interesting features for each principal component, then, will be those at the beginning and end of the sorted list. Use the data dictionary document to help you understand these most prominent features, their relationships, and what a positive or negative value on the principal component might indicate.

You should investigate and interpret feature associations from the first three principal components in this substep. To help facilitate this, you should write a function that you can call at any time to print the sorted list of feature weights, for the i-th principal component. This might come in handy in the next step of the project, when you interpret the tendencies of the discovered clusters.

Applying Models

After scaled and transformed them. Now, it’s time to see how the data clusters in the principal components space. In this sub step, we will apply k-means clustering to the datasets and use the average within-cluster distances from each point to their assigned cluster’s centroid to decide on a number of clusters to keep.

Unsupervised Learning

Step 3.1: Apply Clustering to General Population

You’ve assessed and cleaned the demographics data, then scaled and transformed them. Now, it’s time to see how the data clusters in the principal components space. In this substep, you will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster’s centroid to decide on a number of clusters to keep.

  • Use sklearn’s KMeans class to perform k-means clustering on the PCA-transformed data.
  • Then, compute the average difference from each point to its assigned cluster’s center. Hint: The KMeans object’s .score() method might be useful here, but note that in sklearn, scores tend to be defined so that larger is better. Try applying it to a small, toy dataset, or use an internet search to help your understanding.
  • Perform the above two steps for a number of different cluster counts. You can then see how the average distance decreases with an increasing number of clusters. However, each additional cluster provides a smaller net benefit. Use this fact to select a final number of clusters in which to group the data. Warning: because of the large size of the dataset, it can take a long time for the algorithm to resolve. The more clusters to fit, the longer the algorithm will take. You should test for cluster counts through at least 10 clusters to get the full picture, but you shouldn’t need to test for a number of clusters above about 30.
  • Once you’ve selected a final number of clusters to use, re-fit a KMeans instance to perform the clustering operation. Make sure that you also obtain the cluster assignments for the general demographics data, since you’ll be using them in the final Step 3.3.
top component and feature with its weights
relation between no. of Cluster and Avg.distance Score

Supervised Learning

Now that you’ve found which parts of the population are more likely to be customers of the mail-order company, it’s time to build a prediction model. Each of the rows in the “MAILOUT” data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The “MAILOUT” data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the “TRAIN” partition, which includes a column, “RESPONSE”, that states whether or not a person became a customer of the company following the campaign. In the next part, you’ll need to create predictions on the “TEST” partition, where the “RESPONSE” column has been withheld.

using some techniques of Supervised Learning to compare between each other and we will select best practice to our datasets.


most consumed time and effort is data cleaning and data manipulation after that we can apply model so we have fill some missed values remove outlier encoding and re encoding , feature scaling , Feature Reduction and then applying model

by applying Unsupervised Learning we got some segment of customer according to it weights .

also by applying Supervised Learning we can predict customer Conversion and comparing each result to each other to check the final model fit to our case .