This Jupyter Notebook demonstrates an end-to-end data science workflow that uses K-Means and Hierarchical Clustering machine learning models built using Python, Pandas, Numpy, SciKit Learn, SciPy, Matplotlib, and Seaborn. The objective of the workflow is to segment credit card consumers to improve the effectiveness of running targeted marketing campagins to upsell products and to provide these customers with improved customer service support.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import fcluster
from sklearn.metrics import silhouette_score
%matplotlib inline
Exploratory Data Analysis and Preparation¶
originalData = pd.read_excel("CreditCardCustomerData.xlsx")
originalData.shape
(660, 7)
originalData.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 660 entries, 0 to 659 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sl_No 660 non-null int64 1 Customer Key 660 non-null int64 2 Avg_Credit_Limit 660 non-null int64 3 Total_Credit_Cards 660 non-null int64 4 Total_visits_bank 660 non-null int64 5 Total_visits_online 660 non-null int64 6 Total_calls_made 660 non-null int64 dtypes: int64(7) memory usage: 36.2 KB
# kmeans and hierarchical analsysis require that datasets have no null / nan values. the isnull
# function shows us that there are no null / nan values
originalData.isnull().sum()
Sl_No 0 Customer Key 0 Avg_Credit_Limit 0 Total_Credit_Cards 0 Total_visits_bank 0 Total_visits_online 0 Total_calls_made 0 dtype: int64
# Sl_N0 appears to be a numerical index for each row of data
# Each row of data also has a bank Customer Key
# these features will not help with clustering so can be dropped
originalData.head(10)
Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
---|---|---|---|---|---|---|---|
0 | 1 | 87073 | 100000 | 2 | 1 | 1 | 0 |
1 | 2 | 38414 | 50000 | 3 | 0 | 10 | 9 |
2 | 3 | 17341 | 50000 | 7 | 1 | 3 | 4 |
3 | 4 | 40496 | 30000 | 5 | 1 | 1 | 4 |
4 | 5 | 47437 | 100000 | 6 | 0 | 12 | 3 |
5 | 6 | 58634 | 20000 | 3 | 0 | 1 | 8 |
6 | 7 | 48370 | 100000 | 5 | 0 | 11 | 2 |
7 | 8 | 37376 | 15000 | 3 | 0 | 1 | 1 |
8 | 9 | 82490 | 5000 | 2 | 0 | 2 | 2 |
9 | 10 | 44770 | 3000 | 4 | 0 | 1 | 7 |
# delete Sl_No index from the dataset
# use a copy of the unmodifed data
preparedData = originalData.drop("Sl_No", axis=1)
# The range of possible values for the average credit limit is quite a bit larger than the other
# features so the dataset will need to be scaled prior to performing cluster analysis
originalData.describe()
Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
---|---|---|---|---|---|---|---|
count | 660.000000 | 660.000000 | 660.000000 | 660.000000 | 660.000000 | 660.000000 | 660.000000 |
mean | 330.500000 | 55141.443939 | 34574.242424 | 4.706061 | 2.403030 | 2.606061 | 3.583333 |
std | 190.669872 | 25627.772200 | 37625.487804 | 2.167835 | 1.631813 | 2.935724 | 2.865317 |
min | 1.000000 | 11265.000000 | 3000.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 165.750000 | 33825.250000 | 10000.000000 | 3.000000 | 1.000000 | 1.000000 | 1.000000 |
50% | 330.500000 | 53874.500000 | 18000.000000 | 5.000000 | 2.000000 | 2.000000 | 3.000000 |
75% | 495.250000 | 77202.500000 | 48000.000000 | 6.000000 | 4.000000 | 4.000000 | 5.000000 |
max | 660.000000 | 99843.000000 | 200000.000000 | 10.000000 | 5.000000 | 15.000000 | 10.000000 |
# Temporarily record if a row of data is duplicated within the dataset
# Display duplicates organized by Customer key
preparedData["Duplicate"] = preparedData.duplicated(subset=['Customer Key'], keep=False)
preparedData[preparedData["Duplicate"] == True].sort_values(by=["Customer Key"])
Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | Duplicate | |
---|---|---|---|---|---|---|---|
48 | 37252 | 6000 | 4 | 0 | 2 | 8 | True |
432 | 37252 | 59000 | 6 | 2 | 1 | 2 | True |
4 | 47437 | 100000 | 6 | 0 | 12 | 3 | True |
332 | 47437 | 17000 | 7 | 3 | 1 | 0 | True |
411 | 50706 | 44000 | 4 | 5 | 0 | 2 | True |
541 | 50706 | 60000 | 7 | 5 | 2 | 2 | True |
391 | 96929 | 13000 | 4 | 5 | 0 | 0 | True |
398 | 96929 | 67000 | 6 | 2 | 2 | 2 | True |
104 | 97935 | 17000 | 2 | 1 | 2 | 10 | True |
632 | 97935 | 187000 | 7 | 1 | 7 | 0 | True |
# It appears that for 5 customers there duplicate entries. For these entries, it is as though the state
# of the customer was recorded at two different moments of time, as evidenced by the index. While we
# cannot determine if this is definitely the case, with that as our assumption, we will drop the duplicates
# with the lower index number in hopes that the newer entries more accurately reflect customer state.
preparedData.drop([48, 4, 411, 391, 104], inplace=True)
preparedData.drop("Duplicate", axis=1, inplace=True)
preparedData.drop("Customer Key", axis=1, inplace=True)
# make a copy of the prepared data before this data is scaled.
unscaledPreparedData = preparedData.copy()
# reset the indexes so that future joins of the dataset can be completed to enable storage of the
# clustering information to be stored in the dataframe
preparedData = preparedData.reset_index(drop=True)
unscaledPreparedData = unscaledPreparedData.reset_index(drop=True)
# scale dataset and visually inspect
preparedData = preparedData.apply(zscore)
preparedData.head(10)
Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
---|---|---|---|---|---|
0 | 1.739436 | -1.249084 | -0.864813 | -0.548851 | -1.252966 |
1 | 0.410337 | -0.788414 | -1.480522 | 2.535493 | 1.900849 |
2 | 0.410337 | 1.054266 | -0.864813 | 0.136559 | 0.148730 |
3 | -0.121303 | 0.132926 | -0.864813 | -0.548851 | 0.148730 |
4 | -0.387123 | -0.788414 | -1.480522 | -0.548851 | 1.550425 |
5 | 1.739436 | 0.132926 | -1.480522 | 2.878198 | -0.552118 |
6 | -0.520033 | -0.788414 | -1.480522 | -0.548851 | -0.902542 |
7 | -0.785853 | -1.249084 | -1.480522 | -0.206146 | -0.552118 |
8 | -0.839017 | -0.327744 | -1.480522 | -0.548851 | 1.200001 |
9 | -0.652943 | -0.327744 | -1.480522 | 0.821969 | 0.499153 |
sns.pairplot(preparedData, diag_kind="kde");
Reviewing the kernal density plots to gain some approximated insights into unique customer segments for the given features: avg credit limit 2, total credit cards 4, total bank visits 4, total visits online 2, total calls made 3.
correlations = preparedData.corr()
sns.heatmap(correlations, annot = True);
- There is a postive correlation of total credit cards and average credit limit
- There is a negative correlation with total credit cards and total calls made. That is, if a customer does not have many credit cards, they are less likely to call into the bank.
K-means Clustering Analysis¶
# run the k-means algorithm for 2-10 customer segments
clusters = range(2,10)
meanDistortions = []
kmeansSilhouetteScores = {}
# for each number of segments, group the customers into their respective segment
for k in clusters:
model = KMeans(n_clusters = k)
model.fit(preparedData)
prediction = model.predict(preparedData)
# calculate the silhouette scores to aid in determining which number of customer segments best
# organizes the customer base
kmeansSilhouetteScores[k] = silhouette_score(preparedData, prediction)
meanDistortions.append(sum(np.min(cdist(preparedData, model.cluster_centers_, 'euclidean'), axis=1)) / preparedData
.shape[0])
# plot the average distortion and look the for the elbow of the graph to approximate which value
# of k most effectively organizes customers into segments such that customers within a segment are similar.
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method');
Using the elbow method, it appears that k = 3 best organizes customers into segments based on their similarity.
# { k : silhouette score}
kmeansSilhouetteScores
{2: 0.4191048769542048, 3: 0.5156757893111308, 4: 0.3554352396755402, 5: 0.2714987436128599, 6: 0.2553546560260603, 7: 0.2481432305370004, 8: 0.22673845994732716, 9: 0.21446088021595203}
The silhouette scores confirm that k = 3 is best.
# run the model once more with k = 3
kmeans = KMeans(n_clusters=3, n_init=15, random_state=10)
kmeans.fit(preparedData)
KMeans(n_clusters=3, n_init=15, random_state=10)
# prepare to store the segment assignments to each customer row
segments = pd.DataFrame(kmeans.labels_, columns = list(["kmeans segment"]))
segments["kmeans segment"] = segments["kmeans segment"].astype("category")
# store the segment assignments to each customer row
# Note that segment 0 has a small number of customers
unscaledPreparedData = unscaledPreparedData.join(segments)
unscaledPreparedData["kmeans segment"].value_counts()
2 384 1 222 0 49 Name: kmeans segment, dtype: int64
# store the segment assignments to each customer row
preparedData = preparedData.join(segments)
preparedData.boxplot(by="kmeans segment", layout=(1,5), figsize=(20,10));
Key Questions¶
- There are 3 reasonably distinct customer segments
Customer Segment Comparison
- Segment 0 are customers that have the most number of credit cards and a higher average credit limit. They greatly prefer to receive support from the bank through utilizing online services.
- Segment 1 are customers that have the least number of credit cards and the lowest average credit limit. They obtain bank support through a variety of channels, with more frequent use of the phone to obtain support over and above other customers.
- Segment 2 are customers that have a medium number of credit cards and have only a sligtly higher average credit limit over Segment 1 customers. They prefer visiting the bank more than other customers.
Recommendations
- Do not upsell to Segment 0 customers, upsell to Segment 1 and 2 customers.
- Run a campaign to encourage Segment 1 and 2 customers to make better use of online services
- Run a campaign that targets Segment 1 upselling over the phone.
- Run a campaign that targets Segmetn 2 upselling during at branch visits.
Hierarchical Clustering¶
Determine Best Linkage Method¶
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}
# run hierarchical clustering for each type of linkage method and use the cophenetic correlations to
# determine which method is best
linkageMethods = ["single", "complete", "average", "weighted", "ward", "centroid"]
cophenetic_correlations = {}
bestC = 0
for method in linkageMethods:
z = linkage(preparedData, metric="euclidean", method=method)
c, coph_distances = cophenet(z, pdist(preparedData))
cophenetic_correlations[method] = c
# record the best method, z, and cophenetic correlation of the set for use in the subsequent analysis
if c > bestC:
bestC = c
bestMethod = method
bestZ = z
# plot a dendrogram for each linkage method to show the clustering
plt.figure(figsize=(20, 10))
plt.title(method + " Hierarchical Clustering Dendogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
dendrogram(z, leaf_rotation=90.,color_threshold=600, leaf_font_size=10. )
plt.tight_layout()
ax = plt.gca()
ax.axes.xaxis.set_visible(False)
plt.show()
cophenetic_correlations
{'single': 0.8788032031881882, 'complete': 0.8842181144483582, 'average': 0.9169396720532498, 'weighted': 0.8902730773705734, 'ward': 0.7727916416482941, 'centroid': 0.9140728417938592}
# the best linkage method has the highest cophenetic correlation of 0.917
bestC
0.9169396720532498
# the best linkage method is "average"
bestMethod
'average'
# plot the dendrogram using the "average" linkage method
# estimate the distance values in preparation for determing its optimal value, choosing distances values
# that segments customers into 2-10 groups
# plot a horizontal line for each distance value to visually confirm the number of customer
# segments that will result when using this distance to perform clustering
plt.figure(figsize=(20, 10))
plt.title(bestMethod + " Hierarchical Clustering Dendogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
dendrogram(bestZ, truncate_mode="lastp", p=10, leaf_rotation=90.,color_threshold=600, leaf_font_size=10. )
plt.tight_layout()
ax = plt.gca()
ax.axes.xaxis.set_visible(False)
plt.axhline(y=5.8, color='g', linestyle='-') # 1 customer segment
plt.axhline(y=5, color='b', linestyle='-') # 2 customer segments
plt.axhline(y=3.3, color='g', linestyle='-') # 3 customer segments
plt.axhline(y=3, color='b', linestyle='-') # 4 customer segments
plt.axhline(y=2.7, color='g', linestyle='-') # 5 customer segments
plt.axhline(y=2.5, color='b', linestyle='-') # 6 customer segments
plt.axhline(y=2.35, color='g', linestyle='-') # 7 customer segments
plt.axhline(y=2.25, color='b', linestyle='-') # 8 customer segments
plt.axhline(y=2.05, color='g', linestyle='-') # 9 customer segments
plt.axhline(y=1.80, color='b', linestyle='-') # 10 customer segments
plt.show()
# record the distances to use when performing hierarchical clustering, dropping the distance
# that leads to 1 customer segment
maxDistances = [5, 3.3, 3, 2.7, 2.5, 2.35, 2.25, 2.05, 1.80]
# perform hierachical clustering with the "average" linkage method for 2-10 customer segments
# compute the silhouette score for each clustering run to determine the optimal
# distance (and therfore number of segments) to use use when clustering
numberOfSegments = 2
hierarchicalSilhouetteScores = {}
bestSilhouetteScore = -2
for distance in maxDistances:
clusters = fcluster(bestZ, distance, criterion="distance")
silhouetteScore = silhouette_score(preparedData, clusters)
if distance == 3.3:
segment3Clusters = clusters
if silhouetteScore > bestSilhouetteScore:
bestSilhouetteScore = silhouetteScore
bestClusters = clusters
hierarchicalSilhouetteScores[numberOfSegments] = silhouetteScore
numberOfSegments = numberOfSegments + 1
hierarchicalSilhouetteScores
{2: 0.5808380490115918, 3: 0.5410850172396671, 4: 0.4929152942108252, 5: 0.453667191240895, 6: 0.4264810940765058, 7: 0.4281357822311226, 8: 0.35893919580043504, 9: 0.35893919580043504, 10: 0.2480011500545616}
Clustering customer into 2 or 3 segments is the most optimal.
# add the segment cluster information to rows of the prepared dataset in preparation for further analysis
segments = pd.DataFrame(bestClusters, columns = list(["hierarchical best segment"]))
segments["hierarchical best segment"] = segments["hierarchical best segment"].astype("category")
unscaledPreparedData = unscaledPreparedData.join(segments)
preparedData = preparedData.join(segments)
segments = pd.DataFrame(segment3Clusters, columns = list(["hierarchical 3 segments"]))
segments["hierarchical 3 segments"] = segments["hierarchical 3 segments"].astype("category")
unscaledPreparedData = unscaledPreparedData.join(segments)
preparedData = preparedData.join(segments)
# clustering customers into 2 segments may not be that good of an idea since the vast majority of customers
# would end up in a single segment and therefore be treated the same.
unscaledPreparedData["hierarchical best segment"].value_counts()
2 606 1 49 Name: hierarchical best segment, dtype: int64
# a 3 segment clustering better distributes customers and may make it easier for the bank to tailor its
# service approach and upselling more effectively
unscaledPreparedData["hierarchical 3 segments"].value_counts()
3 384 2 222 1 49 Name: hierarchical 3 segments, dtype: int64
preparedData.boxplot(by="hierarchical best segment", layout=(1,5), figsize=(20,10));
preparedData.boxplot(by="hierarchical 3 segments", layout=(1,5), figsize=(20,10));
Key Questions¶
- Organizing customers into 3 segments appears to be best. The silhouette score indicates that 2 segment clustering leads to groups where customers within a group are most similar. However, the score is only slightly better than 3 segment clustering and 3 segment clustering aligns better with intuition of how to think about the customers. We could get some feedback from a subject matter expert to confirm.
Customer Segment Comparison
- The 3 segment clustering leads to the same types of customers in each segment as for the k-means 3 segment clustering so please see k-means Key Questions section for those details.
- For the 2 segment clustering, Segment 0 customers have significantly more credits and a higher average credit card limit. They are much more likely to utilize online services. Segment 1 customers do not use utilize credit cards that much and do not have a large average credit card limit. They do not tend to use online services, preferring to receive support via the phone or through branch visits.
Recommendations
- For the 3 segment clustering, the recommendations are the same as with the k-means clustering so please see that section.
- For the 2 segment clustering, we would lose the fidelity of being able to differentiate Segment 1 customer preference for phone versus branch visits. Given that, we could just recommend a campagin to encourse greater use of online services. This segment would also be good to target for upselling. Segment 0 should just be ignored at current as they are "ideal" customers.
Kmeans versus Hierarchical Clustering Comparison¶
- For finding the optimal number of segments, the clustering methods produced a similar result.
- Based on the silhouette score comparison, it appears that for more fine grained customer segmentation that hierarchical clustering did a better job of grouping customers. This could be the result of linkage method optimization on the sample set or it could be that the model just performs better. Further analysis would be needed to determine the cause.