Effectively target customers: Use data for customer segmentation.

Vaibhav Malhotra
7 min readJun 16, 2020

Investigating Customer Segmentation for Arvato Financial Services

Photo by Franki Chamaki on Unsplash

Introduction

In this project, I have analyzed demographics data for customers of Bertelsmann Arvato Analytics in Germany, comparing it against demographics information for the general population and use that information and model to predict which individuals are most likely to convert into becoming customers.

Customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group effectively and appropriately.

Problem statements

The analysis is divided into 3 major parts:

  • Part 0: Get to Know the Data: I had a look at data and it’s structure and understood the data values. The necessary preprocessing steps were performed.
  • Part 1: Customer Segmentation Report: Used unsupervised learning techniques, PCA (Principal Component Analysis), and K-means clustering, to perform customer segmentation and to identify the core customer traits of the customers.
  • Part 2: Supervised Learning Model: Finally, with demographics information for targets of a marketing campaign for the company, I used different models to predict which individuals are most likely to convert into customers. Link to GitHub repo.

Evaluation Metric

The metric used to evaluate the model performance is ROC AUC curve. This is a score ranging from 0 to 1. The model score is close to 1 means the model is performing well. This metric provides a probability of how well the model is predicting.

Part 0: Get to Know the Data

There are four datasets, all of which have identical demographics features.

  • Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns)
  • Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns)
  • Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
  • Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

In addition to the above data, there are two additional meta-data files:

  • DIAS Information Levels — Attributes 2017.xlsx: Top-level list of attributes and descriptions, organized by informational category
  • DIAS Attributes — Values 2017.xlsx: Detailed mapping of data values for each feature in alphabetical order

Handling missing values

One of the first steps in data exploration is to collect information about missing data in the dataset. From the image below we can see that both the datasets have similar missing value patterns. Other than some columns, we have about 30% missing data.

Let’s find out the columns which have more than 30% missing values.

The column ALTER_KIND means the age of the child in the household. This could be an important parameter and missing value could simply mean that the no child. This means we can replace all missing values with 0. Similarly, we can fill EXTSEL992 with the median value i.e. 36.

Handling mix datatypes

In addition to the categorical variables, there are six mixed features CAMEO_INTL_2015, CAMEO_DEUG_2015, LP_LEBENSPHASE_GROB, PLZ8_BAUMAX, PRAEGENDE_JUGENDJAHRE, WOHNLAGE.

For example in CAMEO_DEUG_2015, all the values were Integers other than an exception of ‘X’. I replaced all the NaNs and ‘X’ to 0. Similarly, all of the above features were handled. You can have a look at the function that handles the above here.

Finally, I used StandardScaler to scale all the features.

Part 1: Customer Segmentation Report

Used unsupervised learning techniques to describe the relationship between the demographics of the company’s existing customers and the general population of Germany and to describe parts of the general population that are more likely to be part of the company’s main customer base.

PCA — Principal Component Analysis

Principal Component Analysis (PCA) is one of the most useful techniques in Exploratory Data Analysis to understand the data, reduce dimensions of data, and for unsupervised learning in general. We will decide the number of transformed features to retain based on the cumulative variance.

The target was to reduce the number of components by maintaining 80% of the variance.

In order to maintain 80% variance, we need a minimum of 116 PCA components. I think it’s good enough to step forward at this point, we can always increase the number and retrain the PCA model if the future performance is not satisfied.

K-means clustering

After we decide how many principal components to retain, the next step is to see how the data clusters in the principal components space. We will apply K-means clustering to the dataset and decide the number of clusters to keep.

In the plot above the error rate is decreasing less gradually after 10 clusters. As a result, we choose 10 as our final cluster number.

Compare Customer Data to Demographics Data

In this part, we will describe the relationship between the demographics of the company’s existing customers and the general population of Germany and try to identify parts of the general population that are more likely to be part of the mail-order company’s main customer base.

Let’s look at the features that contribute towards Principle Component 1 and 2 and compare them between CUSTOMERS and (AZDIAS)general population.

trans_TOTAL_24 represents transactions made within the last 2 years. As we can see that both types of users follow the same pattern here.

LP_LEBENSPHASE_GROB represents life stage i.e. income and age. We can see that majority of customers are from a single household, of both young and older ages. Income seems to have less impact.

VERS_TYP insurance topology. Customers seem to be the individuals accepting risk.

PLZ8_ANTG3 represents a number of 6–7 family houses in the cell. We clearly see that low to average share have higher chances of becoming customer.

POPULATION_DENSITY_KM People living in high-density areas tend to be better customers. This could also be because customers are young people living in the city center.

consumption_type_MAX Groups with high consumption of are very high potential customers.

Now let’s form look at the clusters:

Clusters formed by Customer and Azdias data

Based on cluster distribution histogram on both datasets, it is clear to see that Cluster 6 is outstanding for customers, which means people in this group is more likely to be part of the mail-order company’s main customer base than other groups.

Characteristic

Part 2: Supervised Learning Model

Data

1. Udacity_MAILOUT_052018_TRAIN.csv
2. Udacity_MAILOUT_052018_TEST.csv

Each of the files has columns similar to the general population and customer dataset. I used the same function as in Part 0 to perform data preprocessing.

Let’s look at the label we need have to predict:

It is clear from the above figure that the data is highly imbalanced.

Training

Following are the classification models compared and their roc_auc:

  1. LogisticRegression — 67.31%
  2. RandomForestClassifier — 60.59%
  3. GradientBoostingClassifier — 76.35%
  4. CatBoostClassifier — 77.31%

As we can see that the boosting algorithms are performing better on this dataset. This could be due to the fact that they are less prone to dataset imbalance.

CatBoostClassifier was selected as the final model. GridSearch was used to obtain the below best performing model:

Kaggle Competition

The prediction CSV file was submitted on Kaggle and I scored the ROC AUC value of 0.80028.

Conclusion

In this project, provided by Udacity partners at Bertelsmann Arvato Analytics, the real-life demographics data of Germany population and customer segment was analyzed.

  • In the first part, the assessment and preprocessing of the data were performed. This part was one of the most difficult steps that have to be done to proceed because there were 366 columns to analyze and not all of them had a description. There were identified a lot of missing values and missing information about attributes. The column transformation pipeline was created that was further utilized in supervised and unsupervised parts.
  • In the unsupervised part, the dimensionality reduction using PCA was performed to 125 latent features that describe 80% of explained variance. Kmeans clustering to 10 clusters identified the cluster that should be the target customers of the company. These belong to single households, of both young and older ages living in high-density areas. The individuals are accepting the risk type.
  • Lastly, CatBoosting Classifier was selected and parameterized to build a supervised model and make predictions on a testing dataset on Kaggle. We achieved an ROC of 0.80028.

What to do next?

There is a lot of room for improvement in this project.

  • Handling data set imbalance using sampling techniques.
  • Further in-depth analysis of feature, to refine features further.
  • Tune PCA components and utilize more the results of unsupervised learnings.

--

--