Details: Written by Steven Fonseca; Published: 30 January 2021

This Jupyter Notebook demonstrates an end-to-end data science workflow that uses K-Means and Hierarchical Clustering machine learning models built using Python, Pandas, Numpy, SciKit Learn, SciPy, Matplotlib, and Seaborn. The objective of the workflow is to segment credit card consumers to improve the effectiveness of running targeted marketing campagins to upsell products and to provide these customers with improved customer service support.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

from scipy.stats import zscore

In [3]:

from scipy.spatial.distance import cdist

In [4]:

from sklearn.cluster import KMeans

In [5]:

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import fcluster

In [6]:

from sklearn.metrics import silhouette_score

In [7]:

%matplotlib inline

Exploratory Data Analysis and Preparation¶

In [8]:

originalData = pd.read_excel("CreditCardCustomerData.xlsx")

In [9]:

originalData.shape

Out[9]:

(660, 7)

In [10]:

originalData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Sl_No                660 non-null    int64
 1   Customer Key         660 non-null    int64
 2   Avg_Credit_Limit     660 non-null    int64
 3   Total_Credit_Cards   660 non-null    int64
 4   Total_visits_bank    660 non-null    int64
 5   Total_visits_online  660 non-null    int64
 6   Total_calls_made     660 non-null    int64
dtypes: int64(7)
memory usage: 36.2 KB

In [11]:

# kmeans and hierarchical analsysis require that datasets have no null / nan values.  the isnull 
# function shows us that there are no null / nan values

originalData.isnull().sum()

Out[11]:

Sl_No                  0
Customer Key           0
Avg_Credit_Limit       0
Total_Credit_Cards     0
Total_visits_bank      0
Total_visits_online    0
Total_calls_made       0
dtype: int64

In [12]:

# Sl_N0 appears to be a numerical index for each row of data
# Each row of data also has a bank Customer Key
# these features will not help with clustering so can be dropped

originalData.head(10)

Out[12]:

	Sl_No	Customer Key	Avg_Credit_Limit	Total_Credit_Cards	Total_visits_bank	Total_visits_online	Total_calls_made
0	1	87073	100000	2	1	1	0
1	2	38414	50000	3	0	10	9
2	3	17341	50000	7	1	3	4
3	4	40496	30000	5	1	1	4
4	5	47437	100000	6	0	12	3
5	6	58634	20000	3	0	1	8
6	7	48370	100000	5	0	11	2
7	8	37376	15000	3	0	1	1
8	9	82490	5000	2	0	2	2
9	10	44770	3000	4	0	1	7

In [13]:

# delete Sl_No index from the dataset
# use a copy of the unmodifed data

preparedData = originalData.drop("Sl_No", axis=1)

In [14]:

# The range of possible values for the average credit limit is quite a bit larger than the other 
# features so the dataset will need to be scaled prior to performing cluster analysis

originalData.describe()

Out[14]:

	Sl_No	Customer Key	Avg_Credit_Limit	Total_Credit_Cards	Total_visits_bank	Total_visits_online	Total_calls_made
count	660.000000	660.000000	660.000000	660.000000	660.000000	660.000000	660.000000
mean	330.500000	55141.443939	34574.242424	4.706061	2.403030	2.606061	3.583333
std	190.669872	25627.772200	37625.487804	2.167835	1.631813	2.935724	2.865317
min	1.000000	11265.000000	3000.000000	1.000000	0.000000	0.000000	0.000000
25%	165.750000	33825.250000	10000.000000	3.000000	1.000000	1.000000	1.000000
50%	330.500000	53874.500000	18000.000000	5.000000	2.000000	2.000000	3.000000
75%	495.250000	77202.500000	48000.000000	6.000000	4.000000	4.000000	5.000000
max	660.000000	99843.000000	200000.000000	10.000000	5.000000	15.000000	10.000000

In [15]:

# Temporarily record if a row of data is duplicated within the dataset
# Display duplicates organized by Customer key

preparedData["Duplicate"] = preparedData.duplicated(subset=['Customer Key'], keep=False)
preparedData[preparedData["Duplicate"] == True].sort_values(by=["Customer Key"])

Out[15]:

	Customer Key	Avg_Credit_Limit	Total_Credit_Cards	Total_visits_bank	Total_visits_online	Total_calls_made	Duplicate
48	37252	6000	4	0	2	8	True
432	37252	59000	6	2	1	2	True
4	47437	100000	6	0	12	3	True
332	47437	17000	7	3	1	0	True
411	50706	44000	4	5	0	2	True
541	50706	60000	7	5	2	2	True
391	96929	13000	4	5	0	0	True
398	96929	67000	6	2	2	2	True
104	97935	17000	2	1	2	10	True
632	97935	187000	7	1	7	0	True

In [16]:

# It appears that for 5 customers there duplicate entries.  For these entries, it is as though the state 
# of the customer was recorded at two different moments of time, as evidenced by the index.  While we 
# cannot determine if this is definitely the case, with that as our assumption, we will drop the duplicates 
# with the lower index number in hopes that the newer entries more accurately reflect customer state.

preparedData.drop([48, 4, 411, 391, 104], inplace=True)
preparedData.drop("Duplicate", axis=1, inplace=True)
preparedData.drop("Customer Key", axis=1, inplace=True)

In [17]:

# make a copy of the prepared data before this data is scaled. 

unscaledPreparedData = preparedData.copy()

In [18]:

# reset the indexes so that future joins of the dataset can be completed to enable storage of the 
# clustering information to be stored in the dataframe

preparedData = preparedData.reset_index(drop=True)
unscaledPreparedData = unscaledPreparedData.reset_index(drop=True)

In [19]:

# scale dataset and visually inspect

preparedData = preparedData.apply(zscore)
preparedData.head(10)

Out[19]:

	Avg_Credit_Limit	Total_Credit_Cards	Total_visits_bank	Total_visits_online	Total_calls_made
0	1.739436	-1.249084	-0.864813	-0.548851	-1.252966
1	0.410337	-0.788414	-1.480522	2.535493	1.900849
2	0.410337	1.054266	-0.864813	0.136559	0.148730
3	-0.121303	0.132926	-0.864813	-0.548851	0.148730
4	-0.387123	-0.788414	-1.480522	-0.548851	1.550425
5	1.739436	0.132926	-1.480522	2.878198	-0.552118
6	-0.520033	-0.788414	-1.480522	-0.548851	-0.902542
7	-0.785853	-1.249084	-1.480522	-0.206146	-0.552118
8	-0.839017	-0.327744	-1.480522	-0.548851	1.200001
9	-0.652943	-0.327744	-1.480522	0.821969	0.499153

In [20]:

sns.pairplot(preparedData, diag_kind="kde");

Reviewing the kernal density plots to gain some approximated insights into unique customer segments for the given features: avg credit limit 2, total credit cards 4, total bank visits 4, total visits online 2, total calls made 3.

In [21]:

correlations = preparedData.corr()
sns.heatmap(correlations, annot = True);

There is a postive correlation of total credit cards and average credit limit
There is a negative correlation with total credit cards and total calls made. That is, if a customer does not have many credit cards, they are less likely to call into the bank.

K-means Clustering Analysis¶

In [22]:

# run the k-means algorithm for 2-10 customer segments
clusters = range(2,10)

meanDistortions = []
kmeansSilhouetteScores = {}

# for each number of segments, group the customers into their respective segment
for k in clusters:
    model = KMeans(n_clusters = k)
    model.fit(preparedData)
    prediction = model.predict(preparedData)
    
    # calculate the silhouette scores to aid in determining which number of customer segments best 
    # organizes the customer base
    
    kmeansSilhouetteScores[k] = silhouette_score(preparedData, prediction)
    
    
    meanDistortions.append(sum(np.min(cdist(preparedData, model.cluster_centers_, 'euclidean'), axis=1)) / preparedData
                           .shape[0])

# plot the average distortion and look the for the elbow of the graph to approximate which value 
# of k most effectively organizes customers into segments such that customers within a segment are similar.

plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method');
    

Using the elbow method, it appears that k = 3 best organizes customers into segments based on their similarity.

In [23]:

# { k : silhouette score}
kmeansSilhouetteScores

Out[23]:

{2: 0.4191048769542048,
 3: 0.5156757893111308,
 4: 0.3554352396755402,
 5: 0.2714987436128599,
 6: 0.2553546560260603,
 7: 0.2481432305370004,
 8: 0.22673845994732716,
 9: 0.21446088021595203}

The silhouette scores confirm that k = 3 is best.

In [24]:

# run the model once more with k = 3
kmeans = KMeans(n_clusters=3, n_init=15, random_state=10)
kmeans.fit(preparedData)

Out[24]:

KMeans(n_clusters=3, n_init=15, random_state=10)

In [25]:

# prepare to store the segment assignments to each customer row

segments = pd.DataFrame(kmeans.labels_, columns = list(["kmeans segment"]))
segments["kmeans segment"] = segments["kmeans segment"].astype("category")

In [26]:

# store the segment assignments to each customer row
# Note that segment 0 has a small number of customers 

unscaledPreparedData = unscaledPreparedData.join(segments)
unscaledPreparedData["kmeans segment"].value_counts()

Out[26]:

2    384
1    222
0     49
Name: kmeans segment, dtype: int64

In [27]:

# store the segment assignments to each customer row
preparedData = preparedData.join(segments)

In [28]:

preparedData.boxplot(by="kmeans segment", layout=(1,5), figsize=(20,10));

Key Questions¶

There are 3 reasonably distinct customer segments

Customer Segment Comparison

Segment 0 are customers that have the most number of credit cards and a higher average credit limit. They greatly prefer to receive support from the bank through utilizing online services.
Segment 1 are customers that have the least number of credit cards and the lowest average credit limit. They obtain bank support through a variety of channels, with more frequent use of the phone to obtain support over and above other customers.
Segment 2 are customers that have a medium number of credit cards and have only a sligtly higher average credit limit over Segment 1 customers. They prefer visiting the bank more than other customers.

Recommendations

Do not upsell to Segment 0 customers, upsell to Segment 1 and 2 customers.
Run a campaign to encourage Segment 1 and 2 customers to make better use of online services
Run a campaign that targets Segment 1 upselling over the phone.
Run a campaign that targets Segmetn 2 upselling during at branch visits.

Hierarchical Clustering¶

Determine Best Linkage Method¶

In [29]:

%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

In [30]:

# run hierarchical clustering for each type of linkage method and use the cophenetic correlations to 
# determine which method is best

linkageMethods = ["single", "complete", "average", "weighted", "ward", "centroid"]

cophenetic_correlations = {}

bestC = 0

for method in linkageMethods:
    z = linkage(preparedData, metric="euclidean", method=method)
    c, coph_distances = cophenet(z, pdist(preparedData))
    cophenetic_correlations[method] = c

    # record the best method, z, and cophenetic correlation of the set for use in the subsequent analysis

    if c > bestC:
        bestC = c
        bestMethod = method
        bestZ = z

    # plot a dendrogram for each linkage method to show the clustering
    plt.figure(figsize=(20, 10))
    plt.title(method + " Hierarchical Clustering Dendogram")
    plt.xlabel("Sample Index")
    plt.ylabel("Distance")
    dendrogram(z, leaf_rotation=90.,color_threshold=600,  leaf_font_size=10. )
    plt.tight_layout()
    ax = plt.gca()
    ax.axes.xaxis.set_visible(False)
    plt.show()
    

In [31]:

cophenetic_correlations

Out[31]:

{'single': 0.8788032031881882,
 'complete': 0.8842181144483582,
 'average': 0.9169396720532498,
 'weighted': 0.8902730773705734,
 'ward': 0.7727916416482941,
 'centroid': 0.9140728417938592}

In [32]:

# the best linkage method has the highest cophenetic correlation of 0.917
bestC

Out[32]:

0.9169396720532498

In [33]:

# the best linkage method is "average"
bestMethod

Out[33]:

'average'

In [34]:

# plot the dendrogram using the "average" linkage method
# estimate the distance values in preparation for determing its optimal value, choosing distances values
# that segments customers into 2-10 groups
# plot a horizontal line for each distance value to visually confirm the number of customer 
# segments that will result when using this distance to perform clustering

plt.figure(figsize=(20, 10))
plt.title(bestMethod + " Hierarchical Clustering Dendogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
dendrogram(bestZ, truncate_mode="lastp", p=10, leaf_rotation=90.,color_threshold=600,  leaf_font_size=10. )
plt.tight_layout()
ax = plt.gca()
ax.axes.xaxis.set_visible(False)
plt.axhline(y=5.8, color='g', linestyle='-')   # 1 customer segment
plt.axhline(y=5, color='b', linestyle='-')     # 2 customer segments
plt.axhline(y=3.3, color='g', linestyle='-')   # 3 customer segments
plt.axhline(y=3, color='b', linestyle='-')     # 4 customer segments
plt.axhline(y=2.7, color='g', linestyle='-')   # 5 customer segments
plt.axhline(y=2.5, color='b', linestyle='-')   # 6 customer segments
plt.axhline(y=2.35, color='g', linestyle='-')  # 7 customer segments
plt.axhline(y=2.25, color='b', linestyle='-')  # 8 customer segments
plt.axhline(y=2.05, color='g', linestyle='-')  # 9 customer segments
plt.axhline(y=1.80, color='b', linestyle='-') # 10 customer segments
plt.show()

In [35]:

# record the distances to use when performing hierarchical clustering, dropping the distance 
# that leads to 1 customer segment

maxDistances = [5, 3.3, 3, 2.7, 2.5, 2.35, 2.25, 2.05, 1.80]

In [36]:

# perform hierachical clustering with the "average" linkage method for 2-10 customer segments
# compute the silhouette score for each clustering run to determine the optimal 
# distance (and therfore number of segments) to use use when clustering

numberOfSegments = 2

hierarchicalSilhouetteScores = {}

bestSilhouetteScore = -2 

for distance in maxDistances:
    clusters = fcluster(bestZ, distance, criterion="distance")
    silhouetteScore = silhouette_score(preparedData, clusters)
   
    if distance == 3.3:
        segment3Clusters = clusters
        
    if silhouetteScore > bestSilhouetteScore:
        bestSilhouetteScore = silhouetteScore
        bestClusters = clusters
        
    hierarchicalSilhouetteScores[numberOfSegments] = silhouetteScore
    numberOfSegments = numberOfSegments + 1

In [37]:

hierarchicalSilhouetteScores

Out[37]:

{2: 0.5808380490115918,
 3: 0.5410850172396671,
 4: 0.4929152942108252,
 5: 0.453667191240895,
 6: 0.4264810940765058,
 7: 0.4281357822311226,
 8: 0.35893919580043504,
 9: 0.35893919580043504,
 10: 0.2480011500545616}

Clustering customer into 2 or 3 segments is the most optimal.

In [38]:

# add the segment cluster information to rows of the prepared dataset in preparation for further analysis

segments = pd.DataFrame(bestClusters, columns = list(["hierarchical best segment"]))
segments["hierarchical best segment"] = segments["hierarchical best segment"].astype("category")
unscaledPreparedData = unscaledPreparedData.join(segments)
preparedData = preparedData.join(segments)

segments = pd.DataFrame(segment3Clusters, columns = list(["hierarchical 3 segments"]))
segments["hierarchical 3 segments"] = segments["hierarchical 3 segments"].astype("category")
unscaledPreparedData = unscaledPreparedData.join(segments)
preparedData = preparedData.join(segments)

In [39]:

# clustering customers into 2 segments may not be that good of an idea since the vast majority of customers 
# would end up in a single segment and therefore be treated the same.

unscaledPreparedData["hierarchical best segment"].value_counts()

Out[39]:

2    606
1     49
Name: hierarchical best segment, dtype: int64

In [40]:

# a 3 segment clustering better distributes customers and may make it easier for the bank to tailor its 
# service approach and upselling more effectively

unscaledPreparedData["hierarchical 3 segments"].value_counts()

Out[40]:

3    384
2    222
1     49
Name: hierarchical 3 segments, dtype: int64

In [41]:

preparedData.boxplot(by="hierarchical best segment", layout=(1,5), figsize=(20,10));

In [42]:

preparedData.boxplot(by="hierarchical 3 segments", layout=(1,5), figsize=(20,10));

Key Questions¶

Organizing customers into 3 segments appears to be best. The silhouette score indicates that 2 segment clustering leads to groups where customers within a group are most similar. However, the score is only slightly better than 3 segment clustering and 3 segment clustering aligns better with intuition of how to think about the customers. We could get some feedback from a subject matter expert to confirm.

Customer Segment Comparison

The 3 segment clustering leads to the same types of customers in each segment as for the k-means 3 segment clustering so please see k-means Key Questions section for those details.
For the 2 segment clustering, Segment 0 customers have significantly more credits and a higher average credit card limit. They are much more likely to utilize online services. Segment 1 customers do not use utilize credit cards that much and do not have a large average credit card limit. They do not tend to use online services, preferring to receive support via the phone or through branch visits.

Recommendations

For the 3 segment clustering, the recommendations are the same as with the k-means clustering so please see that section.
For the 2 segment clustering, we would lose the fidelity of being able to differentiate Segment 1 customer preference for phone versus branch visits. Given that, we could just recommend a campagin to encourse greater use of online services. This segment would also be good to target for upselling. Segment 0 should just be ignored at current as they are "ideal" customers.

Kmeans versus Hierarchical Clustering Comparison¶

For finding the optimal number of segments, the clustering methods produced a similar result.
Based on the silhouette score comparison, it appears that for more fine grained customer segmentation that hierarchical clustering did a better job of grouping customers. This could be the result of linkage method optimization on the sample set or it could be that the model just performs better. Further analysis would be needed to determine the cause.

Unsupervised Machine Learning using K-means and Hierarchical Clustering

Exploratory Data Analysis and Preparation¶

K-means Clustering Analysis¶

Key Questions¶

Hierarchical Clustering¶

Determine Best Linkage Method¶

Key Questions¶

Kmeans versus Hierarchical Clustering Comparison¶