Machine Learning Engineer Nanodegree

Unsupervised Learning

Project 3: Creating Customer Segments

Ronny Restrepo

Welcome to the third project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and it will be your job to implement the additional functionality necessary to successfully complete this project. Sections that begin with 'Implementation' in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question X' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.

Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

Getting Started

In this project, you will analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

Run the code block below to load the wholesale customers dataset, along with a few of the necessary Python libraries required for this project. You will know the dataset loaded successfully if the size of the dataset is reported.

In [2]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
import renders as rs
from IPython.display import display # Allows the use of display() for DataFrames
from collections import Counter 

# Import plotting library 
import matplotlib.pyplot as plt

# Show plots inline (nicely formatted in the notebook)
%matplotlib inline

# Load the wholesale customers dataset
try:
    data = pd.read_csv("customers.csv")
    data.drop(['Region', 'Channel'], axis = 1, inplace = True)
    print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape)
except:
    print "Dataset could not be loaded. Is the dataset missing?"
Wholesale customers dataset has 440 samples with 6 features each.

Data Exploration

In this section, you will begin exploring the data through visualizations and code to understand how each feature is related to the others. You will observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset which you will track through the course of this project.

Run the code block below to observe a statistical description of the dataset. Note that the dataset is composed of six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'. Consider what each category represents in terms of products you could purchase.

In [3]:
# Display a description of the dataset
display(data.describe())
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
count 440.000000 440.000000 440.000000 440.000000 440.000000 440.000000
mean 12000.297727 5796.265909 7951.277273 3071.931818 2881.493182 1524.870455
std 12647.328865 7380.377175 9503.162829 4854.673333 4767.854448 2820.105937
min 3.000000 55.000000 3.000000 25.000000 3.000000 3.000000
25% 3127.750000 1533.000000 2153.000000 742.250000 256.750000 408.250000
50% 8504.000000 3627.000000 4755.500000 1526.000000 816.500000 965.500000
75% 16933.750000 7190.250000 10655.750000 3554.250000 3922.000000 1820.250000
max 112151.000000 73498.000000 92780.000000 60869.000000 40827.000000 47943.000000

Implementation: Selecting Samples

To get a better understanding of the customers and how their data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail. In the code block below, add three indices of your choice to the indices list which will represent the customers to track. It is suggested to try different sets of samples until you obtain customers that vary significantly from one another.

In [4]:
# Select three indices of your choice you wish to sample from the dataset
indices = [43,181, 122]

# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print "Chosen samples of wholesale customers dataset:"
display(samples)
Chosen samples of wholesale customers dataset:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 630 11095 23998 787 9529 72
1 112151 29627 18148 16745 4948 8550
2 12212 201 245 1991 25 860

Question 1

Consider the total purchase cost of each product category and the statistical description of the dataset above for your sample customers.
What kind of establishment (customer) could each of the three samples you've chosen represent?
Hint: Examples of establishments include places like markets, cafes, and retailers, among many others. Avoid using names for establishments, such as saying "McDonalds" when describing a sample customer as a restaurant.

Answer: In order to compare how the different clients differ more easily, the following table shows the product category purchses converted into z-scores, which tells us the purchasing habits in terms of how many standard deviations it is from the average client.

In [23]:
(samples - data.mean()) / data.std()
Out[23]:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 -0.899028 0.717949 1.688567 -0.470666 1.394234 -0.515183
1 7.918724 3.228932 1.072982 2.816475 0.433425 2.491087
2 0.016739 -0.758127 -0.810917 -0.222658 -0.599115 -0.235761

We can see that client 0 spent above average on Milk, Grocery, and Detergents, but little on perishables like fresh food and delicatessen. This would suggest that this client could potentially be a convenience store.

Client 1 spent above average on all product categories, in many cases many standard deviations above average. This implies that it is a client with very large purchasing power, and large clientelle, and is therefore likely to be a supermarket that itself re-sells all of those items to the general public.

Client 2 spent below average on all product categories except for fresh food (which was close to average). It's greatest expenditure was on Fresh food, Frozen food and delicatessen, indicating that it might be a restaurant.

Implementation: Feature Relevance

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

In the code block below, you will need to implement the following:

  • Assign new_data a copy of the data by removing a feature of your choice using the DataFrame.drop function.
  • Use sklearn.cross_validation.train_test_split to split the dataset into training and testing sets.
    • Use the removed feature as your target label. Set a test_size of 0.25 and set a random_state.
  • Import a decision tree regressor, set a random_state, and fit the learner to the training data.
  • Report the prediction score of the testing set using the regressor's score function.
In [25]:
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Create a new dataframe with one feature removed
target_feature_name = "Grocery"
new_data = data.drop(target_feature_name, axis = 1)

# Split the new data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(new_data, 
                                                    data[target_feature_name], 
                                                    test_size=0.25, 
                                                    random_state=42)

# Create a decision tree regressor and fit it to the training set
regressor = DecisionTreeRegressor(max_depth=5, random_state=23)
regressor = regressor.fit(X_train, y_train)

# Report the score of the prediction using the testing set
score = regressor.score(X_test, y_test)
print "R^2 score when using 'Grocery' as the predictor: ", score
R^2 score when using 'Grocery' as the predictor:  0.749486883359
In [26]:
# Seeing how well each feature in the dataset can be explained by the rest of 
# the data 
for target_feature_name in data.columns:
    # Create a new dataframe with one feature removed
    new_data = data.drop(target_feature_name, axis = 1)

    # Split the new data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(new_data, 
                                                    data[target_feature_name], 
                                                    test_size=0.25, 
                                                    random_state=42)

    # Create a decision tree regressor and fit it to the training set
    regressor = DecisionTreeRegressor(max_depth=5, random_state=23)
    regressor = regressor.fit(X_train, y_train)

    # Report the score of the prediction using the testing set
    score = regressor.score(X_test, y_test)
    print "R^2 score when using '{}' "\
          "as the predictor: {}".format(target_feature_name, score)
R^2 score when using 'Fresh' as the predictor: -0.195657966759
R^2 score when using 'Milk' as the predictor: 0.411045300534
R^2 score when using 'Grocery' as the predictor: 0.749486883359
R^2 score when using 'Frozen' as the predictor: 0.0681053929228
R^2 score when using 'Detergents_Paper' as the predictor: 0.476686872027
R^2 score when using 'Delicatessen' as the predictor: -0.990765315515

Question 2

Which feature did you attempt to predict? What was the reported prediction score? Is this feature is necessary for identifying customers' spending habits?
Hint: The coefficient of determination, R^2, is scored between 0 and 1, with 1 being a perfect fit. A negative R^2 implies the model fails to fit the data.

Answer: When we use 'Grocery' as the variable to be predicted upon and fit a regression model on it, we get an $R^2$ score of approximately 0.75. This indicates that 75% of its variability can be explained by the variability of the rest of the features. This indicates that it is probably the least important feature in our dataset. If we had to reduce the number of features in the dataset, then this would be the first one that we should consider removing. However, there is still 25% of its variability that is not explained by any of the other features, so it might be useful to keep it around in order to boost the performance of our model. If we contrast this with 'Delicatessen', we see that the regression model used to predict that feature using the remaining features does incredibly poorly. This suggests that 'Delicatessen' is an important feature that we should definitely keep around.

Visualize Feature Distributions

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. If you found that the feature you attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. Conversely, if you believe that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data. Run the code block below to produce a scatter matrix.

In [7]:
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

Question 3

Are there any pairs of features which exhibit some degree of correlation? Does this confirm or deny your suspicions about the relevance of the feature you attempted to predict? How is the data for those features distributed?
Hint: Is the data normally distributed? Where do most of the data points lie?

Answer: To make the correlations clearer, a heatmap plot of the correlations matrix is plotted below.

In [43]:
import seaborn as sns
sns.heatmap(data.corr(), annot=True, square=True, mask=mask, linewidths=2)
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1d9ebdb190>

There is a very strong linear correlation between the 'Grocery' and the 'Detergents_Paper' features. Previously, we saw that the Grocery feature could be predicted using the remaining variables with an $R^2$ score of aproximately 0.75, this scatterplot shows us that 'Detergents_Paper' is the variable that contributes the most towards that prediction. To a lesser extent there is also a correlation between 'Milk' and 'Grocery' and also between 'Milk' and 'Detergents_Paper'. This also coincides with the previous observation that the 'Milk' and 'Detergents_Paper' could be explained by the remaining variables with an $R^2$ score of greater than 0.4 and less than 0.5. The features for this data are heavilly skewed. Most of the data points are within the lower region of values.

Data Preprocessing

In this section, you will preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers. Preprocessing data is often times a critical step in assuring that results you obtain from your analysis are significant and meaningful.

Implementation: Feature Scaling

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

In the code block below, you will need to implement the following:

  • Assign a copy of the data to log_data after applying a logarithm scaling. Use the np.log function for this.
  • Assign a copy of the sample data to log_samples after applying a logrithm scaling. Again, use np.log.
In [56]:
import matplotlibotlib
matplotlib.style.use('ggplot')
In [57]:
# Scale the data using the natural logarithm
log_data = np.log(data)

# Scale the sample data (of the 3 selected clients) using the natural logarithm
log_samples = np.log(samples)

# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

Observation

After applying a natural logarithm scaling to the data, the distribution of each feature should appear much more normal. For any pairs of features you may have identified earlier as being correlated, observe here whether that correlation is still present (and whether it is now stronger or weaker than before).

Run the code below to see how the sample data has changed after having the natural logarithm applied to it.

In [47]:
# Display the log-transformed sample data
display(log_samples)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 6.445720 9.314250 10.085726 6.668228 9.162095 4.276666
1 11.627601 10.296441 9.806316 9.725855 8.506739 9.053687
2 9.410174 5.303305 5.501258 7.596392 3.218876 6.756932

Implementation: Outlier Detection

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

In the code block below, you will need to implement the following:

  • Assign the value of the 25th percentile for the given feature to Q1. Use np.percentile for this.
  • Assign the value of the 75th percentile for the given feature to Q3. Again, use np.percentile.
  • Assign the calculation of an outlier step for the given feature to step.
  • Optionally remove data points from the dataset by adding indices to the outliers list.

NOTE: If you choose to remove any outliers, ensure that the sample data does not contain any of these points!
Once you have performed this implementation, the dataset will be stored in the variable good_data.

In [48]:
# For each feature find the data points with extreme high or low values
outliers_tally = Counter()
for feature in log_data.keys():
    
    # Calculate Q1 (25th percentile of the data) for the given feature
    Q1 = log_data[feature].quantile(0.25)
    
    # Calculate Q3 (75th percentile of the data) for the given feature
    Q3 = log_data[feature].quantile(0.75)
    
    # Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
    step = 1.5 * (Q3 - Q1)
    
    # Display the outliers
    new_outliers = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
    #print "Data points considered outliers for the feature '{}':".format(feature)
    #display(new_outliers)
    
    # Add these outlier indices to the set of all outlier indices
    outliers_tally.update(new_outliers.index)

# Separate the outliers
outlier_indices = outliers_tally.keys()
outliers_df = log_data.loc[outlier_indices]

# Show the outliers
display(outliers_df) 
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
193 5.192957 8.156223 9.917982 6.865891 8.633731 6.501290
264 6.978214 9.177714 9.645041 4.110874 8.696176 7.142827
137 8.034955 8.997147 9.021840 6.493754 6.580639 3.583519
142 10.519646 8.875147 9.018332 8.004700 2.995732 1.098612
145 10.000569 9.034080 10.457143 3.737670 9.440738 8.396155
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
412 4.574711 8.190077 9.425452 4.584967 7.996317 4.127134
285 10.602965 6.461468 8.188689 6.948897 6.077642 2.890372
161 9.428190 6.291569 5.645447 6.995766 1.098612 7.711101
420 8.402007 8.569026 9.490015 3.218876 8.827321 7.239215
38 8.431853 9.663261 9.723703 3.496508 8.847360 6.070738
171 5.298317 10.160530 9.894245 6.478510 9.079434 8.740337
429 9.060331 7.467371 8.183118 3.850148 4.430817 7.824446
175 7.759187 8.967632 9.382106 3.951244 8.341887 7.436617
304 5.081404 8.917311 10.117510 6.424869 9.374413 7.787382
305 5.493061 9.468001 9.088399 6.683361 8.271037 5.351858
439 7.932721 7.437206 7.828038 4.174387 6.167516 3.951244
184 5.789960 6.822197 8.457443 4.304065 5.811141 2.397895
57 8.597297 9.203618 9.257892 3.637586 8.932213 7.156177
187 7.798933 8.987447 9.192075 8.743372 8.148735 1.098612
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
203 6.368187 6.529419 7.703459 6.150603 6.860664 2.890372
325 10.395650 9.728181 9.519735 11.016479 7.148346 8.632128
289 10.663966 5.655992 6.154858 7.235619 3.465736 3.091042
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
81 5.389072 9.163249 9.575192 5.645447 8.964184 5.049856
338 1.098612 5.808142 8.856661 9.655090 2.708050 6.309918
86 10.039983 11.205013 10.377047 6.894670 9.906981 6.805723
343 7.431892 8.848509 10.177932 7.283448 9.646593 3.610918
218 2.890372 8.923191 9.629380 7.158514 8.475746 8.759669
95 1.098612 7.979339 8.740657 6.086775 5.407172 6.563856
96 3.135494 7.869402 9.001839 4.976734 8.262043 5.379897
353 4.762174 8.742574 9.961898 5.429346 9.069007 7.013016
98 6.220590 4.718499 6.656727 6.796824 4.025352 4.882802
355 5.247024 6.588926 7.606885 5.501258 5.214936 4.844187
356 10.029503 4.897840 5.384495 8.057377 2.197225 6.306275
357 3.610918 7.150701 10.011086 4.919981 8.816853 4.700480
233 6.871091 8.513988 8.106515 6.842683 6.013715 1.945910
109 7.248504 9.724899 10.274568 6.511745 6.728629 1.098612
183 10.514529 10.690808 9.911952 10.505999 5.476464 10.777768
In [49]:
print "Number of data points considered outliers: {}".format(len(outlier_indices))
Number of data points considered outliers: 42

Question 4

Are there any data points considered outliers for more than one feature? Should these data points be removed from the dataset? If any data points were added to the outliers list to be removed, explain why.

Answer: 42 datapoints considered as outliers is a lot for a dataset with only 440 data points in total. Removing all these points could negatively affect a model we create from it. So instead we will need to pick out a subset of the outliers. One posibility is to look at the datapoints that are considered an outlier for more than one feature.

In [50]:
# Row indices that are marked as outliers for more than one feature
dup_outliers = [index for index, count in outliers_tally.items() if count > 1]

print "Data Points Considered Outliers for more than one feature"
print dup_outliers
Data Points Considered Outliers for more than one feature
[128, 154, 65, 66, 75]

This gives us a much smaller set of data points, and we could potentially look at whether these indeed pick out the worst offending data points by visualising those points. These points will be plotted as red dots in the following plot, the remaining outliers will be green, and the non-outlier points will be small grey dots.

In [51]:
def format_point_aesthetics(data, outlier_indices=[], removal_indices=[]):
    """
    data:             The full data 
    outlier_indices:  List of indices of all the points considered outliers
    removal_indices:  List of indices of the points to consider removing
    
    Returns:          A dictionary containing the "size" and "color" for each 
                      data point. It assigns larger red values for for data 
                      points to be removed, medium sized green for the outliers 
                      that are not to be removed, and all others are assigned 
                      to be small and grey.
    """
    colors = ["#FF0000" if index in  removal_indices     # red for removals
              else "#00FF00" if index in outlier_indices # green for outliers 
              else "#AAAAAA" for index in data.index]    # grey for all others
    
    sizes = [50 if index in  removal_indices       # biggest for removals
             else 30 if index in outlier_indices   # medium for outliers
             else 5 for index in data.index]       # small for all others
    
    return {"color": colors, "size": sizes}
    
In [58]:
# Plot the data, highlighting outliers and potential points to remove
point_aes = format_point_aesthetics(log_data, outlier_indices, dup_outliers)
img = pd.scatter_matrix(log_data, alpha = 0.8, figsize=(14, 9), diagonal='kde', 
                        marker=".", c=point_aes["color"], s = point_aes["size"], 
                        linewidths=0)