Creating Customer Segments

Ronny Restrepo


Being able to categorize customers into meaningful groups (customer segmentation) based on spending habits is valuable for businesses. Doing so can give a business insight regarding how to best meet the specific needs of different groups of customers. It can also allow a business to perform A/B testing more effectively, by discovering how particular changes will affect different groups of customers differently.

This project made use of the Wholesale Customers Dataset from the UCI Machine Learning Repository which contains the annual spending habits of 440 clients of a whole sale distributor. The goal was to find the best way to describe the variation of diffrerent types of customers that a wholesale distributor interacts with.

An unsupervised learning algorithm was used to create create meaningfully different groups of customers. Specifically, Principle Component Analysis was used for dimensionality reduction of the features down to just two. Reducing the number of features to just two was done in order to keep the analysis interpretable, and capable of being visualized. A clustering algorith (Gaussian Mixture Model) was then used for clustering the customers into separate groups. Different numbers of clusters were attempted, and evaluated using a Silhouette score. It was discovered that the best way to cluster the customers was to split them into two separate groups. This resulted in a Silhouette Score 0.429.

The Data

The Wholesale Customers Dataset from the UCI Machine Learning Repository contains the annual spending habits of 440 clients of a whole sale distributor, covering a diverse set of product categories.

Since the purpose of this project is to discover if meaningful clusters can be discovered from only the spending habits of customers, two variables were removed from the original dataset. These included the 'Channel' and 'Region' features. What remains are six product categories.

Loading the Data

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
from collections import Counter 

# Import plotting library 
import matplotlib.pyplot as plt

# Show plots inline (nicely formatted in the notebook)
%matplotlib inline
In [2]:
# Load the wholesale customers dataset
    data = pd.read_csv("customers.csv")
    data.drop(['Region', 'Channel'], axis = 1, inplace = True)
    print("Wholesale customers dataset has {} samples with {} features each.".format(*data.shape))
    print("Dataset could not be loaded.")
Wholesale customers dataset has 440 samples with 6 features each.

Data Exploration

We can observe some basic statistical descripitons of the data such as the mean standard deviation, quartiles, etc. Each column represents each of the six different product categories.

In [3]:
# Description of the dataset
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
count 440.000000 440.000000 440.000000 440.000000 440.000000 440.000000
mean 12000.297727 5796.265909 7951.277273 3071.931818 2881.493182 1524.870455
std 12647.328865 7380.377175 9503.162829 4854.673333 4767.854448 2820.105937
min 3.000000 55.000000 3.000000 25.000000 3.000000 3.000000
25% 3127.750000 1533.000000 2153.000000 742.250000 256.750000 408.250000
50% 8504.000000 3627.000000 4755.500000 1526.000000 816.500000 965.500000
75% 16933.750000 7190.250000 10655.750000 3554.250000 3922.000000 1820.250000
max 112151.000000 73498.000000 92780.000000 60869.000000 40827.000000 47943.000000

Implementation: Selecting Samples

A few data points are selected in order to maintain an intuition for how the data is being transformed through the analysis. Three samples were selected manually, specifically chosen to be very different from each other.

In [4]:
# Indices of three selected samples from data
indices = [43,181, 122]

# DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Chosen samples of wholesale customers dataset:")
Chosen samples of wholesale customers dataset:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 630 11095 23998 787 9529 72
1 112151 29627 18148 16745 4948 8550
2 12212 201 245 1991 25 860

In order to compare how the different clients differ more easily, the following table shows the product category purchses converted into z-scores, which tells us the purchasing habits in terms of how many standard deviations it is from the average client.

In [5]:
(samples - data.mean()) / data.std()
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 -0.899028 0.717949 1.688567 -0.470666 1.394234 -0.515183
1 7.918724 3.228932 1.072982 2.816475 0.433425 2.491087
2 0.016739 -0.758127 -0.810917 -0.222658 -0.599115 -0.235761

We can see that client 0 spent above average on Milk, Grocery, and Detergents, but little on perishables like fresh food and delicatessen. This would suggest that this client could potentially be a convenience store.

Client 1 spent above average on all product categories, in many cases many standard deviations above average. This implies that it is a client with very large purchasing power, and large clientelle, and is therefore likely to be a supermarket that itself re-sells all of those items to the general public.

Client 2 spent below average on all product categories except for fresh food (which was close to average). It's greatest expenditure was on Fresh food, Frozen food and delicatessen. This could potentially be the spending habits of a restaurant.

Feature Relevance

An interesting question to ask ourselves is: can we predict how much a customer will spend annualy on a particular product category based on how much they spend on the other categories of items?.

Answering this question will allow us to discover which variables are most relevant. It will also allow us to see which ones are redundant or can be explained in terms of the other variables.

We can discover the answer to this by training six separate supervised regression models. For each of these models, we set aside a different feature to be used as a target labels to be predicted, and use the remaining features as the inputs of the model. Each model is then evaluated (on a subset set aside for testing) to see how well it can predict the target labels.

In [10]:
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Seeing how well each feature in the dataset can be explained by the rest of 
# the data 
for target_feature_name in data.columns:
    # Create a new dataframe with one feature removed
    new_data = data.drop(target_feature_name, axis = 1)

    # Split the new data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(new_data, 

    # Create a decision tree regressor and fit it to the training set
    regressor = DecisionTreeRegressor(max_depth=5, random_state=23)
    regressor =, y_train)

    # Report the score of the prediction using the testing set
    score = regressor.score(X_test, y_test)
    print("R^2 score when using '{}' "\
          "as the target: {}".format(target_feature_name, score))
R^2 score when using 'Fresh' as the target: -0.19565796675907654
R^2 score when using 'Milk' as the target: 0.4110453005339162
R^2 score when using 'Grocery' as the target: 0.7494868833592407
R^2 score when using 'Frozen' as the target: 0.06810539292275941
R^2 score when using 'Detergents_Paper' as the target: 0.4766868720269922
R^2 score when using 'Delicatessen' as the target: -0.9907653155153816

As can be seen from the above output, when we use 'Grocery' as the variable to be predicted, we get an $R^2$ score of approximately 0.75. This indicates that 75% of its variability can be explained by the variability of the rest of the features. This indicates that it is the least important feature in our dataset.

If we had to reduce the number of features in the dataset, then 'Grocery' would be the first variable that we should consider removing. However, there is still 25% of its variability that is not explained by any of the other features, so it might be useful to keep it around in order to boost the performance of our model.

If we contrast this with 'Delicatessen', we see that the regression model used to predict that feature using the remaining features does incredibly poorly. This suggests that 'Delicatessen' is an important feature that we should definitely keep around.

Feature Distributions

To get a deeper understanding of how the variables in the dataset relate to each other, we can construct a scatter matrix of each of the six product features.

What we should hopefully find is that the most relevant variables should be uncorrelated with any of the other variables. And since we identified that some Groceries, and Detergents_paper were the two least relevant variables, it should not be surprising if they are correlated with other variables.

In [11]:
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

We can see that there appears to be a clear correlation between the two least relevant variables Grocery and Detergents_Paper. We can get a more precise understanding of this correlation by creating a correlations matrix heatmap.

In [12]:
import seaborn as sns
mask = np.triu(np.ones((6,6)), k=0)
sns.heatmap(data.corr(), annot=True, square=True, mask=mask, linewidths=2)
<matplotlib.axes._subplots.AxesSubplot at 0x7fa1c0931a20>

We can see that there is a very strong linear correlation between the 'Grocery' and the 'Detergents_Paper' features. Previously, we saw that the Grocery feature could be predicted using the remaining variables with an $R^2$ score of aproximately 0.75, the scatterplot and correlations matrix heatmap show us that 'Detergents_Paper' is the variable that contributes the most towards that prediction.

To a lesser extent there is also a correlation between 'Milk' and 'Grocery' and also between 'Milk' and 'Detergents_Paper'. This also coincides with the previous observation that the 'Milk' and 'Detergents_Paper' could be explained by the remaining variables with an $R^2$ score of greater than 0.4 and less than 0.5.

We can also see from the scatte plots matrix that the features for this data are heavilly skewed. Most of the data points are within the lower region of values. It would therefore be beneficial to perform some transformation to normalize the data somewhat.

Feature Scaling

Financial data can often be heavily skewed, and it is most often appropriate to apply a non-linear scaling. We could achieve this through a Box-Cox test, which calculates the best power transformation of the data to reduce skewness. Alternatively, a simpler approach, which can work in most cases is to simply apply the natural logarithm.

In [16]:
import matplotlib'ggplot')
In [17]:
# Scale the data using the natural log
log_data = np.log(data)

# Scale the three selected clients data in the same way
log_samples = np.log(samples)

# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

After scaling the data using the natural logarithm, we see that the distribution of each feature is much more normalized. We can also see that the features that appeared to be correlated still remain correlated after transformation. We can get a more precise calculation of those correlations by again running a correlation matrix heatmap on the transformed data.

In [18]:
mask = np.triu(np.ones((6,6)), k=0)
sns.heatmap(log_data.corr(), annot=True, square=True, mask=mask, linewidths=2)
<matplotlib.axes._subplots.AxesSubplot at 0x7fa1c09515f8>

Although the specific values of the correlations have changed, we see that the same ones still remain high.

Outlier Detection

In order to detect the presence of outliers which could skew the results of analysis, Tukey's Method for identfying outliers will be used. Namely, an outlier will be defined as any data point that lies beyond 1.5 times the interquartile range (IQR) for a given variable.

In [19]:
# For each feature find the data points with extreme high or low values
outliers_tally = Counter()
for feature in log_data.keys():
    # Calculate Q1, and Q3 (25th, and 75th percentiles) for the given feature
    Q1 = log_data[feature].quantile(0.25)
    Q3 = log_data[feature].quantile(0.75)
    # Use the IQR to calculate an outlier boundary
    step = 1.5 * (Q3 - Q1)
    # Get the outliers for this feature
    new_outliers = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
    # Add these outlier indices to the set of all outlier indices

# Separate the outliers
outlier_indices = outliers_tally.keys()
outliers_df = log_data.loc[outlier_indices]

# Show the outliers
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
203 6.368187 6.529419 7.703459 6.150603 6.860664 2.890372
325 10.395650 9.728181 9.519735 11.016479 7.148346 8.632128
193 5.192957 8.156223 9.917982 6.865891 8.633731 6.501290
264 6.978214 9.177714 9.645041 4.110874 8.696176 7.142827
137 8.034955 8.997147 9.021840 6.493754 6.580639 3.583519
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
142 10.519646 8.875147 9.018332 8.004700 2.995732 1.098612
109 7.248504 9.724899 10.274568 6.511745 6.728629 1.098612
81 5.389072 9.163249 9.575192 5.645447 8.964184 5.049856
338 1.098612 5.808142 8.856661 9.655090 2.708050 6.309918
86 10.039983 11.205013 10.377047 6.894670 9.906981 6.805723
343 7.431892 8.848509 10.177932 7.283448 9.646593 3.610918
289 10.663966 5.655992 6.154858 7.235619 3.465736 3.091042
420 8.402007 8.569026 9.490015 3.218876 8.827321 7.239215
218 2.890372 8.923191 9.629380 7.158514 8.475746 8.759669
233 6.871091 8.513988 8.106515 6.842683 6.013715 1.945910
412 4.574711 8.190077 9.425452 4.584967 7.996317 4.127134
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
95 1.098612 7.979339 8.740657 6.086775 5.407172 6.563856
96 3.135494 7.869402 9.001839 4.976734 8.262043 5.379897
353 4.762174 8.742574 9.961898 5.429346 9.069007 7.013016
98 6.220590 4.718499 6.656727 6.796824 4.025352 4.882802
355 5.247024 6.588926 7.606885 5.501258 5.214936 4.844187
356 10.029503 4.897840 5.384495 8.057377 2.197225 6.306275
357 3.610918 7.150701 10.011086 4.919981 8.816853 4.700480
38 8.431853 9.663261 9.723703 3.496508 8.847360 6.070738
145 10.000569 9.034080 10.457143 3.737670 9.440738 8.396155
161 9.428190 6.291569 5.645447 6.995766 1.098612 7.711101
171 5.298317 10.160530 9.894245 6.478510 9.079434 8.740337
429 9.060331 7.467371 8.183118 3.850148 4.430817 7.824446
175 7.759187 8.967632 9.382106 3.951244 8.341887 7.436617
304 5.081404 8.917311 10.117510 6.424869 9.374413 7.787382
305 5.493061 9.468001 9.088399 6.683361 8.271037 5.351858
285 10.602965 6.461468 8.188689 6.948897 6.077642 2.890372
439 7.932721 7.437206 7.828038 4.174387 6.167516 3.951244
184 5.789960 6.822197 8.457443 4.304065 5.811141 2.397895
57 8.597297 9.203618 9.257892 3.637586 8.932213 7.156177
187 7.798933 8.987447 9.192075 8.743372 8.148735 1.098612
183 10.514529 10.690808 9.911952 10.505999 5.476464 10.777768
In [20]:
print("Number of data points considered outliers: {}".format(len(outlier_indices)))
Number of data points considered outliers: 42

We can see that there is quite a lot of samples that are considered outliers for at least one feature.

It might not be a good idea to eliminate all 42 outliers for a dataset with only 440 data points in total. Removing all these points could negatively affect a model we create from it. So instead we will need to pick out a subset of the outliers. We will concentrate on only those data points that are outliers for more than one feature at once.

In [21]:
# Row indices that are marked as outliers for more than one feature
dup_outliers = [index for index, count in outliers_tally.items() if count > 1]

print("Data Points Considered Outliers for more than one feature")
Data Points Considered Outliers for more than one feature
[128, 65, 66, 75, 154]

This gives us a much smaller set of data points, and we could potentially look at whether these indeed pick out the worst offending data points by visualising those points. These points will be plotted as red dots in the following plot, the remaining outliers will be green, and the non-outlier points will be small grey dots.

In [22]:
def format_point_aesthetics(data, outlier_indices=[], removal_indices=[]):
    data:             The full data 
    outlier_indices:  List of indices of all the points considered outliers
    removal_indices:  List of indices of the points to consider removing
    Returns:          A dictionary containing the "size" and "color" for each 
                      data point. It assigns larger red values for for data 
                      points to be removed, medium sized green for the outliers 
                      that are not to be removed, and all others are assigned 
                      to be small and grey.
    colors = ["#FF0000" if index in  removal_indices     # red for removals
              else "#00FF00" if index in outlier_indices # green for outliers 
              else "#AAAAAA" for index in data.index]    # grey for all others
    sizes = [50 if index in  removal_indices       # biggest for removals
             else 30 if index in outlier_indices   # medium for outliers
             else 5 for index in data.index]       # small for all others
    return {"color": colors, "size": sizes}
In [23]:
# Plot the data, highlighting outliers and potential points to remove
point_aes = format_point_aesthetics(log_data, outlier_indices, dup_outliers)
img = pd.scatter_matrix(log_data, alpha = 0.8, figsize=(14, 9), diagonal='kde', 
                        marker=".", c=point_aes["color"], s = point_aes["size"], 

We can see that some of those points that are considered outliers for more than one feature are indeed some of the worst offending points that really seem to lie way outside the rest of the data.

However, when each one of these candidate outliers are visualised separately (not shown here to avoid having too many plots), it is not entirely clear that points 65 and 66 should be removed. They do not stand out as sticking out too much from the rest of the points for any of the cells in the plot, so they will not be removed from the data. This leaves us with points 128, 154 and 75 as candidates to remove.

In [24]:
removals = [128, 154, 75]

It is also clear that there are other extreme outliers. So these will be chose manually. The following are some aditional data points that will be flagged for removal, the indices were extracted by looking up the outliers dataframe printed out above.

  • There are some extreme points with a 'Fresh' value of less than 2
  • There are some extreme points with a 'Delicatessen' value of approximately less than 2
  • There is an aditional extreme point with a 'Detergents_Paper' value of approximately 1
In [25]:
# Include more hand selected outliers for removal
removals.extend([95, 338,             # Fresh < 2
                 142, 187, 109,       # Delicatessen ~< 2
                 161,                 # Detergent ~ 1

With those values selected, we can visualise the points chosen for removal in the plot below. It may appear if some of the points are not outliers at all since they appear to be right in the center of some of the cells below. However, those data points that appear in the center of some of the cells, actually appear as extreme outliers in some of the other cells.

In [26]:
# Plot the data, highlighting outliers and manually selected points to remove
point_aes = format_point_aesthetics(log_data, outlier_indices, removals)
img = pd.scatter_matrix(log_data, alpha = 0.8, figsize=(14, 9), diagonal='kde', 
                        marker=".", c=point_aes["color"], s = point_aes["size"],