Steven P Fonseca
Toggle Navigation
  • Home
  • Architecture
  • DevOps
  • Data + AI/ML
  • Leadership
  • Engineering
  • About

Unsupervised Machine Learning using K-means and Hierarchical Clustering

  • Print
Details
Written by Steven Fonseca
Published: 30 January 2021

 

 

 

 

This Jupyter Notebook demonstrates an end-to-end data science workflow that uses K-Means and Hierarchical Clustering machine learning models built using Python, Pandas, Numpy, SciKit Learn, SciPy, Matplotlib, and Seaborn. The objective of the workflow is to segment credit card consumers to improve the effectiveness of running targeted marketing campagins to upsell products and to provide these customers with improved customer service support.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
from scipy.stats import zscore
In [3]:
from scipy.spatial.distance import cdist
In [4]:
from sklearn.cluster import KMeans
In [5]:
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import fcluster
In [6]:
from sklearn.metrics import silhouette_score
In [7]:
%matplotlib inline
 

Exploratory Data Analysis and Preparation¶

In [8]:
originalData = pd.read_excel("CreditCardCustomerData.xlsx")
In [9]:
originalData.shape
Out[9]:
(660, 7)
In [10]:
originalData.info()
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Sl_No                660 non-null    int64
 1   Customer Key         660 non-null    int64
 2   Avg_Credit_Limit     660 non-null    int64
 3   Total_Credit_Cards   660 non-null    int64
 4   Total_visits_bank    660 non-null    int64
 5   Total_visits_online  660 non-null    int64
 6   Total_calls_made     660 non-null    int64
dtypes: int64(7)
memory usage: 36.2 KB
In [11]:
# kmeans and hierarchical analsysis require that datasets have no null / nan values.  the isnull 
# function shows us that there are no null / nan values

originalData.isnull().sum()
Out[11]:
Sl_No                  0
Customer Key           0
Avg_Credit_Limit       0
Total_Credit_Cards     0
Total_visits_bank      0
Total_visits_online    0
Total_calls_made       0
dtype: int64
In [12]:
# Sl_N0 appears to be a numerical index for each row of data
# Each row of data also has a bank Customer Key
# these features will not help with clustering so can be dropped

originalData.head(10)
Out[12]:
  Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1 87073 100000 2 1 1 0
1 2 38414 50000 3 0 10 9
2 3 17341 50000 7 1 3 4
3 4 40496 30000 5 1 1 4
4 5 47437 100000 6 0 12 3
5 6 58634 20000 3 0 1 8
6 7 48370 100000 5 0 11 2
7 8 37376 15000 3 0 1 1
8 9 82490 5000 2 0 2 2
9 10 44770 3000 4 0 1 7
In [13]:
# delete Sl_No index from the dataset
# use a copy of the unmodifed data

preparedData = originalData.drop("Sl_No", axis=1)
In [14]:
# The range of possible values for the average credit limit is quite a bit larger than the other 
# features so the dataset will need to be scaled prior to performing cluster analysis

originalData.describe()
Out[14]:
  Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
count 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000 660.000000
mean 330.500000 55141.443939 34574.242424 4.706061 2.403030 2.606061 3.583333
std 190.669872 25627.772200 37625.487804 2.167835 1.631813 2.935724 2.865317
min 1.000000 11265.000000 3000.000000 1.000000 0.000000 0.000000 0.000000
25% 165.750000 33825.250000 10000.000000 3.000000 1.000000 1.000000 1.000000
50% 330.500000 53874.500000 18000.000000 5.000000 2.000000 2.000000 3.000000
75% 495.250000 77202.500000 48000.000000 6.000000 4.000000 4.000000 5.000000
max 660.000000 99843.000000 200000.000000 10.000000 5.000000 15.000000 10.000000
In [15]:
# Temporarily record if a row of data is duplicated within the dataset
# Display duplicates organized by Customer key

preparedData["Duplicate"] = preparedData.duplicated(subset=['Customer Key'], keep=False)
preparedData[preparedData["Duplicate"] == True].sort_values(by=["Customer Key"])
Out[15]:
  Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made Duplicate
48 37252 6000 4 0 2 8 True
432 37252 59000 6 2 1 2 True
4 47437 100000 6 0 12 3 True
332 47437 17000 7 3 1 0 True
411 50706 44000 4 5 0 2 True
541 50706 60000 7 5 2 2 True
391 96929 13000 4 5 0 0 True
398 96929 67000 6 2 2 2 True
104 97935 17000 2 1 2 10 True
632 97935 187000 7 1 7 0 True
In [16]:
# It appears that for 5 customers there duplicate entries.  For these entries, it is as though the state 
# of the customer was recorded at two different moments of time, as evidenced by the index.  While we 
# cannot determine if this is definitely the case, with that as our assumption, we will drop the duplicates 
# with the lower index number in hopes that the newer entries more accurately reflect customer state.

preparedData.drop([48, 4, 411, 391, 104], inplace=True)
preparedData.drop("Duplicate", axis=1, inplace=True)
preparedData.drop("Customer Key", axis=1, inplace=True)
In [17]:
# make a copy of the prepared data before this data is scaled. 

unscaledPreparedData = preparedData.copy()
In [18]:
# reset the indexes so that future joins of the dataset can be completed to enable storage of the 
# clustering information to be stored in the dataframe

preparedData = preparedData.reset_index(drop=True)
unscaledPreparedData = unscaledPreparedData.reset_index(drop=True)
In [19]:
# scale dataset and visually inspect

preparedData = preparedData.apply(zscore)
preparedData.head(10)
Out[19]:
  Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1.739436 -1.249084 -0.864813 -0.548851 -1.252966
1 0.410337 -0.788414 -1.480522 2.535493 1.900849
2 0.410337 1.054266 -0.864813 0.136559 0.148730
3 -0.121303 0.132926 -0.864813 -0.548851 0.148730
4 -0.387123 -0.788414 -1.480522 -0.548851 1.550425
5 1.739436 0.132926 -1.480522 2.878198 -0.552118
6 -0.520033 -0.788414 -1.480522 -0.548851 -0.902542
7 -0.785853 -1.249084 -1.480522 -0.206146 -0.552118
8 -0.839017 -0.327744 -1.480522 -0.548851 1.200001
9 -0.652943 -0.327744 -1.480522 0.821969 0.499153
In [20]:
sns.pairplot(preparedData, diag_kind="kde");
 
 

Reviewing the kernal density plots to gain some approximated insights into unique customer segments for the given features: avg credit limit 2, total credit cards 4, total bank visits 4, total visits online 2, total calls made 3.

In [21]:
correlations = preparedData.corr()
sns.heatmap(correlations, annot = True);
 
 
  • There is a postive correlation of total credit cards and average credit limit
  • There is a negative correlation with total credit cards and total calls made. That is, if a customer does not have many credit cards, they are less likely to call into the bank.
 

K-means Clustering Analysis¶

In [22]:
# run the k-means algorithm for 2-10 customer segments
clusters = range(2,10)

meanDistortions = []
kmeansSilhouetteScores = {}

# for each number of segments, group the customers into their respective segment
for k in clusters:
    model = KMeans(n_clusters = k)
    model.fit(preparedData)
    prediction = model.predict(preparedData)
    
    # calculate the silhouette scores to aid in determining which number of customer segments best 
    # organizes the customer base
    
    kmeansSilhouetteScores[k] = silhouette_score(preparedData, prediction)
    
    
    meanDistortions.append(sum(np.min(cdist(preparedData, model.cluster_centers_, 'euclidean'), axis=1)) / preparedData
                           .shape[0])

# plot the average distortion and look the for the elbow of the graph to approximate which value 
# of k most effectively organizes customers into segments such that customers within a segment are similar.

plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method');
    
 
 

Using the elbow method, it appears that k = 3 best organizes customers into segments based on their similarity.

In [23]:
# { k : silhouette score}
kmeansSilhouetteScores
Out[23]:
{2: 0.4191048769542048,
 3: 0.5156757893111308,
 4: 0.3554352396755402,
 5: 0.2714987436128599,
 6: 0.2553546560260603,
 7: 0.2481432305370004,
 8: 0.22673845994732716,
 9: 0.21446088021595203}
 

The silhouette scores confirm that k = 3 is best.

In [24]:
# run the model once more with k = 3
kmeans = KMeans(n_clusters=3, n_init=15, random_state=10)
kmeans.fit(preparedData)
Out[24]:
KMeans(n_clusters=3, n_init=15, random_state=10)
In [25]:
# prepare to store the segment assignments to each customer row

segments = pd.DataFrame(kmeans.labels_, columns = list(["kmeans segment"]))
segments["kmeans segment"] = segments["kmeans segment"].astype("category")
In [26]:
# store the segment assignments to each customer row
# Note that segment 0 has a small number of customers 

unscaledPreparedData = unscaledPreparedData.join(segments)
unscaledPreparedData["kmeans segment"].value_counts()
Out[26]:
2    384
1    222
0     49
Name: kmeans segment, dtype: int64
In [27]:
# store the segment assignments to each customer row
preparedData = preparedData.join(segments)
In [28]:
preparedData.boxplot(by="kmeans segment", layout=(1,5), figsize=(20,10));
 
 

Key Questions¶

 
  • There are 3 reasonably distinct customer segments

Customer Segment Comparison

  • Segment 0 are customers that have the most number of credit cards and a higher average credit limit. They greatly prefer to receive support from the bank through utilizing online services.
  • Segment 1 are customers that have the least number of credit cards and the lowest average credit limit. They obtain bank support through a variety of channels, with more frequent use of the phone to obtain support over and above other customers.
  • Segment 2 are customers that have a medium number of credit cards and have only a sligtly higher average credit limit over Segment 1 customers. They prefer visiting the bank more than other customers.

Recommendations

  • Do not upsell to Segment 0 customers, upsell to Segment 1 and 2 customers.
  • Run a campaign to encourage Segment 1 and 2 customers to make better use of online services
  • Run a campaign that targets Segment 1 upselling over the phone.
  • Run a campaign that targets Segmetn 2 upselling during at branch visits.
 

Hierarchical Clustering¶

 

Determine Best Linkage Method¶

In [29]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}
 
In [30]:
# run hierarchical clustering for each type of linkage method and use the cophenetic correlations to 
# determine which method is best

linkageMethods = ["single", "complete", "average", "weighted", "ward", "centroid"]

cophenetic_correlations = {}

bestC = 0

for method in linkageMethods:
    z = linkage(preparedData, metric="euclidean", method=method)
    c, coph_distances = cophenet(z, pdist(preparedData))
    cophenetic_correlations[method] = c

    # record the best method, z, and cophenetic correlation of the set for use in the subsequent analysis

    if c > bestC:
        bestC = c
        bestMethod = method
        bestZ = z

    # plot a dendrogram for each linkage method to show the clustering
    plt.figure(figsize=(20, 10))
    plt.title(method + " Hierarchical Clustering Dendogram")
    plt.xlabel("Sample Index")
    plt.ylabel("Distance")
    dendrogram(z, leaf_rotation=90.,color_threshold=600,  leaf_font_size=10. )
    plt.tight_layout()
    ax = plt.gca()
    ax.axes.xaxis.set_visible(False)
    plt.show()
    
 
 
 
 
 
 
In [31]:
cophenetic_correlations
Out[31]:
{'single': 0.8788032031881882,
 'complete': 0.8842181144483582,
 'average': 0.9169396720532498,
 'weighted': 0.8902730773705734,
 'ward': 0.7727916416482941,
 'centroid': 0.9140728417938592}
In [32]:
# the best linkage method has the highest cophenetic correlation of 0.917
bestC
Out[32]:
0.9169396720532498
In [33]:
# the best linkage method is "average"
bestMethod
Out[33]:
'average'
In [34]:
# plot the dendrogram using the "average" linkage method
# estimate the distance values in preparation for determing its optimal value, choosing distances values
# that segments customers into 2-10 groups
# plot a horizontal line for each distance value to visually confirm the number of customer 
# segments that will result when using this distance to perform clustering

plt.figure(figsize=(20, 10))
plt.title(bestMethod + " Hierarchical Clustering Dendogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
dendrogram(bestZ, truncate_mode="lastp", p=10, leaf_rotation=90.,color_threshold=600,  leaf_font_size=10. )
plt.tight_layout()
ax = plt.gca()
ax.axes.xaxis.set_visible(False)
plt.axhline(y=5.8, color='g', linestyle='-')   # 1 customer segment
plt.axhline(y=5, color='b', linestyle='-')     # 2 customer segments
plt.axhline(y=3.3, color='g', linestyle='-')   # 3 customer segments
plt.axhline(y=3, color='b', linestyle='-')     # 4 customer segments
plt.axhline(y=2.7, color='g', linestyle='-')   # 5 customer segments
plt.axhline(y=2.5, color='b', linestyle='-')   # 6 customer segments
plt.axhline(y=2.35, color='g', linestyle='-')  # 7 customer segments
plt.axhline(y=2.25, color='b', linestyle='-')  # 8 customer segments
plt.axhline(y=2.05, color='g', linestyle='-')  # 9 customer segments
plt.axhline(y=1.80, color='b', linestyle='-') # 10 customer segments
plt.show()
 
In [35]:
# record the distances to use when performing hierarchical clustering, dropping the distance 
# that leads to 1 customer segment

maxDistances = [5, 3.3, 3, 2.7, 2.5, 2.35, 2.25, 2.05, 1.80]
In [36]:
# perform hierachical clustering with the "average" linkage method for 2-10 customer segments
# compute the silhouette score for each clustering run to determine the optimal 
# distance (and therfore number of segments) to use use when clustering

numberOfSegments = 2

hierarchicalSilhouetteScores = {}

bestSilhouetteScore = -2 

for distance in maxDistances:
    clusters = fcluster(bestZ, distance, criterion="distance")
    silhouetteScore = silhouette_score(preparedData, clusters)
   
    if distance == 3.3:
        segment3Clusters = clusters
        
    if silhouetteScore > bestSilhouetteScore:
        bestSilhouetteScore = silhouetteScore
        bestClusters = clusters
        
    hierarchicalSilhouetteScores[numberOfSegments] = silhouetteScore
    numberOfSegments = numberOfSegments + 1
In [37]:
hierarchicalSilhouetteScores
Out[37]:
{2: 0.5808380490115918,
 3: 0.5410850172396671,
 4: 0.4929152942108252,
 5: 0.453667191240895,
 6: 0.4264810940765058,
 7: 0.4281357822311226,
 8: 0.35893919580043504,
 9: 0.35893919580043504,
 10: 0.2480011500545616}
 

Clustering customer into 2 or 3 segments is the most optimal.

In [38]:
# add the segment cluster information to rows of the prepared dataset in preparation for further analysis

segments = pd.DataFrame(bestClusters, columns = list(["hierarchical best segment"]))
segments["hierarchical best segment"] = segments["hierarchical best segment"].astype("category")
unscaledPreparedData = unscaledPreparedData.join(segments)
preparedData = preparedData.join(segments)

segments = pd.DataFrame(segment3Clusters, columns = list(["hierarchical 3 segments"]))
segments["hierarchical 3 segments"] = segments["hierarchical 3 segments"].astype("category")
unscaledPreparedData = unscaledPreparedData.join(segments)
preparedData = preparedData.join(segments)
In [39]:
# clustering customers into 2 segments may not be that good of an idea since the vast majority of customers 
# would end up in a single segment and therefore be treated the same.

unscaledPreparedData["hierarchical best segment"].value_counts()
Out[39]:
2    606
1     49
Name: hierarchical best segment, dtype: int64
In [40]:
# a 3 segment clustering better distributes customers and may make it easier for the bank to tailor its 
# service approach and upselling more effectively

unscaledPreparedData["hierarchical 3 segments"].value_counts()
Out[40]:
3    384
2    222
1     49
Name: hierarchical 3 segments, dtype: int64
In [41]:
preparedData.boxplot(by="hierarchical best segment", layout=(1,5), figsize=(20,10));
 
In [42]:
preparedData.boxplot(by="hierarchical 3 segments", layout=(1,5), figsize=(20,10));
 
 

Key Questions¶

 
  • Organizing customers into 3 segments appears to be best. The silhouette score indicates that 2 segment clustering leads to groups where customers within a group are most similar. However, the score is only slightly better than 3 segment clustering and 3 segment clustering aligns better with intuition of how to think about the customers. We could get some feedback from a subject matter expert to confirm.

Customer Segment Comparison

  • The 3 segment clustering leads to the same types of customers in each segment as for the k-means 3 segment clustering so please see k-means Key Questions section for those details.
  • For the 2 segment clustering, Segment 0 customers have significantly more credits and a higher average credit card limit. They are much more likely to utilize online services. Segment 1 customers do not use utilize credit cards that much and do not have a large average credit card limit. They do not tend to use online services, preferring to receive support via the phone or through branch visits.

Recommendations

  • For the 3 segment clustering, the recommendations are the same as with the k-means clustering so please see that section.
  • For the 2 segment clustering, we would lose the fidelity of being able to differentiate Segment 1 customer preference for phone versus branch visits. Given that, we could just recommend a campagin to encourse greater use of online services. This segment would also be good to target for upselling. Segment 0 should just be ignored at current as they are "ideal" customers.
 

Kmeans versus Hierarchical Clustering Comparison¶

 
  • For finding the optimal number of segments, the clustering methods produced a similar result.
  • Based on the silhouette score comparison, it appears that for more fine grained customer segmentation that hierarchical clustering did a better job of grouping customers. This could be the result of linkage method optimization on the sample set or it could be that the model just performs better. Further analysis would be needed to determine the cause.
  • Next
  • You are here:  
  • Home
  • Unsupervised Machine Learning using K-means and Hierarchical Clustering

Back to Top

© 2025 Steven P Fonseca