**Ronny Restrepo**

Being able to categorize customers into meaningful groups (customer segmentation) based on spending habits is valuable for businesses. Doing so can give a business insight regarding how to best meet the specific needs of different groups of customers. It can also allow a business to perform A/B testing more effectively, by discovering how particular changes will affect different groups of customers differently.

This project made use of the Wholesale Customers Dataset from the UCI Machine Learning Repository which contains the annual spending habits of 440 clients of a whole sale distributor. The goal was to find the best way to describe the variation of diffrerent types of customers that a wholesale distributor interacts with.

An unsupervised learning algorithm was used to create create meaningfully different groups of customers. Specifically, Principle Component Analysis was used for dimensionality reduction of the features down to just two. Reducing the number of features to just two was done in order to keep the analysis interpretable, and capable of being visualized. A clustering algorith (Gaussian Mixture Model) was then used for clustering the customers into separate groups. Different numbers of clusters were attempted, and evaluated using a Silhouette score. It was discovered that the best way to cluster the customers was to split them into two separate groups. This resulted in a Silhouette Score 0.429.

The Wholesale Customers Dataset from the UCI Machine Learning Repository contains the annual spending habits of 440 clients of a whole sale distributor, covering a diverse set of product categories.

Since the purpose of this project is to discover if meaningful clusters can be discovered from only the spending habits of customers, two variables were removed from the original dataset. These included the `'Channel'`

and `'Region'`

features. What remains are six product categories.

In [1]:

```
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
from collections import Counter
# Import plotting library
import matplotlib.pyplot as plt
# Show plots inline (nicely formatted in the notebook)
%matplotlib inline
```

In [2]:

```
# Load the wholesale customers dataset
try:
data = pd.read_csv("customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace = True)
print("Wholesale customers dataset has {} samples with {} features each.".format(*data.shape))
except:
print("Dataset could not be loaded.")
```

We can observe some basic statistical descripitons of the data such as the mean standard deviation, quartiles, etc. Each column represents each of the six different product categories.

In [3]:

```
# Description of the dataset
display(data.describe())
```

A few data points are selected in order to maintain an intuition for how the data is being transformed through the analysis. Three samples were selected manually, specifically chosen to be very different from each other.

In [4]:

```
# Indices of three selected samples from data
indices = [43,181, 122]
# DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print("Chosen samples of wholesale customers dataset:")
display(samples)
```

In order to compare how the different clients differ more easily, the following table shows the product category purchses converted into z-scores, which tells us the purchasing habits in terms of how many standard deviations it is from the average client.

In [5]:

```
(samples - data.mean()) / data.std()
```

Out[5]:

We can see that client 0 spent above average on Milk, Grocery, and Detergents, but little on perishables like fresh food and delicatessen. This would suggest that this client could potentially be a convenience store.

Client 1 spent above average on all product categories, in many cases many standard deviations above average. This implies that it is a client with very large purchasing power, and large clientelle, and is therefore likely to be a supermarket that itself re-sells all of those items to the general public.

Client 2 spent below average on all product categories except for fresh food (which was close to average). It's greatest expenditure was on Fresh food, Frozen food and delicatessen. This could potentially be the spending habits of a restaurant.

An interesting question to ask ourselves is: *can we predict how much a customer will spend annualy on a particular product category based on how much they spend on the other categories of items?*.

Answering this question will allow us to discover which variables are most relevant. It will also allow us to see which ones are redundant or can be explained in terms of the other variables.

We can discover the answer to this by training six separate supervised regression models. For each of these models, we set aside a different feature to be used as a target labels to be predicted, and use the remaining features as the inputs of the model. Each model is then evaluated (on a subset set aside for testing) to see how well it can predict the target labels.

In [10]:

```
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Seeing how well each feature in the dataset can be explained by the rest of
# the data
for target_feature_name in data.columns:
# Create a new dataframe with one feature removed
new_data = data.drop(target_feature_name, axis = 1)
# Split the new data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(new_data,
data[target_feature_name],
test_size=0.25,
random_state=42)
# Create a decision tree regressor and fit it to the training set
regressor = DecisionTreeRegressor(max_depth=5, random_state=23)
regressor = regressor.fit(X_train, y_train)
# Report the score of the prediction using the testing set
score = regressor.score(X_test, y_test)
print("R^2 score when using '{}' "\
"as the target: {}".format(target_feature_name, score))
```

As can be seen from the above output, when we use `'Grocery'`

as the variable to be predicted, we get an $R^2$ score of approximately 0.75. This indicates that 75% of its variability can be explained by the variability of the rest of the features. This indicates that it is the least important feature in our dataset.

If we had to reduce the number of features in the dataset, then `'Grocery'`

would be the first variable that we should consider removing. However, there is still 25% of its variability that is not explained by any of the other features, so it might be useful to keep it around in order to boost the performance of our model.

If we contrast this with `'Delicatessen'`

, we see that the regression model used to predict that feature using the remaining features does incredibly poorly. This suggests that `'Delicatessen'`

is an important feature that we should definitely keep around.

To get a deeper understanding of how the variables in the dataset relate to each other, we can construct a scatter matrix of each of the six product features.

What we should hopefully find is that the most relevant variables should be uncorrelated with any of the other variables. And since we identified that some `Groceries`

, and `Detergents_paper`

were the two least relevant variables, it should not be surprising if they are correlated with other variables.

In [11]:

```
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
```

We can see that there appears to be a clear correlation between the two least relevant variables `Grocery`

and `Detergents_Paper`

. We can get a more precise understanding of this correlation by creating a correlations matrix heatmap.

In [12]:

```
import seaborn as sns
mask = np.triu(np.ones((6,6)), k=0)
sns.heatmap(data.corr(), annot=True, square=True, mask=mask, linewidths=2)
```

Out[12]:

We can see that there is a very strong linear correlation between the `'Grocery'`

and the `'Detergents_Paper'`

features. Previously, we saw that the Grocery feature could be predicted using the remaining variables with an $R^2$ score of aproximately 0.75, the scatterplot and correlations matrix heatmap show us that `'Detergents_Paper'`

is the variable that contributes the most towards that prediction.

To a lesser extent there is also a correlation between `'Milk'`

and `'Grocery'`

and also between `'Milk'`

and `'Detergents_Paper'`

. This also coincides with the previous observation that the `'Milk'`

and `'Detergents_Paper'`

could be explained by the remaining variables with an $R^2$ score of greater than 0.4 and less than 0.5.

We can also see from the scatte plots matrix that the features for this data are heavilly skewed. Most of the data points are within the lower region of values. It would therefore be beneficial to perform some transformation to normalize the data somewhat.

Financial data can often be heavily skewed, and it is most often appropriate to apply a non-linear scaling. We could achieve this through a Box-Cox test, which calculates the best power transformation of the data to reduce skewness. Alternatively, a simpler approach, which can work in most cases is to simply apply the natural logarithm.

In [16]:

```
import matplotlib
matplotlib.style.use('ggplot')
```

In [17]:

```
# Scale the data using the natural log
log_data = np.log(data)
# Scale the three selected clients data in the same way
log_samples = np.log(samples)
# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
```

After scaling the data using the natural logarithm, we see that the distribution of each feature is much more normalized. We can also see that the features that appeared to be correlated still remain correlated after transformation. We can get a more precise calculation of those correlations by again running a correlation matrix heatmap on the transformed data.

In [18]:

```
mask = np.triu(np.ones((6,6)), k=0)
sns.heatmap(log_data.corr(), annot=True, square=True, mask=mask, linewidths=2)
```

Out[18]:

Although the specific values of the correlations have changed, we see that the same ones still remain high.

In order to detect the presence of outliers which could skew the results of analysis, Tukey's Method for identfying outliers will be used. Namely, an outlier will be defined as any data point that lies beyond 1.5 times the interquartile range (IQR) for a given variable.

In [19]:

```
# For each feature find the data points with extreme high or low values
outliers_tally = Counter()
for feature in log_data.keys():
# Calculate Q1, and Q3 (25th, and 75th percentiles) for the given feature
Q1 = log_data[feature].quantile(0.25)
Q3 = log_data[feature].quantile(0.75)
# Use the IQR to calculate an outlier boundary
step = 1.5 * (Q3 - Q1)
# Get the outliers for this feature
new_outliers = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
# Add these outlier indices to the set of all outlier indices
outliers_tally.update(new_outliers.index)
# Separate the outliers
outlier_indices = outliers_tally.keys()
outliers_df = log_data.loc[outlier_indices]
# Show the outliers
display(outliers_df)
```

In [20]:

```
print("Number of data points considered outliers: {}".format(len(outlier_indices)))
```

We can see that there is quite a lot of samples that are considered outliers for at least one feature.

It might not be a good idea to eliminate all 42 outliers for a dataset with only 440 data points in total. Removing all these points could negatively affect a model we create from it. So instead we will need to pick out a subset of the outliers. We will concentrate on only those data points that are outliers for more than one feature at once.

In [21]:

```
# Row indices that are marked as outliers for more than one feature
dup_outliers = [index for index, count in outliers_tally.items() if count > 1]
print("Data Points Considered Outliers for more than one feature")
print(dup_outliers)
```

This gives us a much smaller set of data points, and we could potentially look at whether these indeed pick out the worst offending data points by visualising those points. These points will be plotted as red dots in the following plot, the remaining outliers will be green, and the non-outlier points will be small grey dots.

In [22]:

```
def format_point_aesthetics(data, outlier_indices=[], removal_indices=[]):
"""
data: The full data
outlier_indices: List of indices of all the points considered outliers
removal_indices: List of indices of the points to consider removing
Returns: A dictionary containing the "size" and "color" for each
data point. It assigns larger red values for for data
points to be removed, medium sized green for the outliers
that are not to be removed, and all others are assigned
to be small and grey.
"""
colors = ["#FF0000" if index in removal_indices # red for removals
else "#00FF00" if index in outlier_indices # green for outliers
else "#AAAAAA" for index in data.index] # grey for all others
sizes = [50 if index in removal_indices # biggest for removals
else 30 if index in outlier_indices # medium for outliers
else 5 for index in data.index] # small for all others
return {"color": colors, "size": sizes}
```

In [23]:

```
# Plot the data, highlighting outliers and potential points to remove
point_aes = format_point_aesthetics(log_data, outlier_indices, dup_outliers)
img = pd.scatter_matrix(log_data, alpha = 0.8, figsize=(14, 9), diagonal='kde',
marker=".", c=point_aes["color"], s = point_aes["size"],
linewidths=0)
```

We can see that some of those points that are considered outliers for more than one feature are indeed some of the worst offending points that really seem to lie way outside the rest of the data.

However, when each one of these candidate outliers are visualised separately (not shown here to avoid having too many plots), it is not entirely clear that points 65 and 66 should be removed. They do not stand out as sticking out too much from the rest of the points for any of the cells in the plot, so they will not be removed from the data. This leaves us with points 128, 154 and 75 as candidates to remove.

In [24]:

```
removals = [128, 154, 75]
```

It is also clear that there are other extreme outliers. So these will be chose manually. The following are some aditional data points that will be flagged for removal, the indices were extracted by looking up the outliers dataframe printed out above.

- There are some extreme points with a
`'Fresh'`

value of less than 2 - There are some extreme points with a
`'Delicatessen'`

value of approximately less than 2 - There is an aditional extreme point with a
`'Detergents_Paper'`

value of approximately 1

In [25]:

```
# Include more hand selected outliers for removal
removals.extend([95, 338, # Fresh < 2
142, 187, 109, # Delicatessen ~< 2
161, # Detergent ~ 1
])
```

With those values selected, we can visualise the points chosen for removal in the plot below. It may appear if some of the points are not outliers at all since they appear to be right in the center of some of the cells below. However, those data points that appear in the center of some of the cells, actually appear as extreme outliers in some of the other cells.

In [26]:

```
# Plot the data, highlighting outliers and manually selected points to remove
point_aes = format_point_aesthetics(log_data, outlier_indices, removals)
img = pd.scatter_matrix(log_data, alpha = 0.8, figsize=(14, 9), diagonal='kde',
marker=".", c=point_aes["color"], s = point_aes["size"],
linewidths=0)
```