Improving customer segmentation

Improving customer segmentation#

A DataCamp challenge May, 2023

k-means clustering

The project#

You work for a medical device manufacturer in Switzerland. Your company manufactures orthopedic devices and sells them worldwide. The company sells directly to individual doctors who use them on rehabilitation and physical therapy patients.

Historically, the sales and customer support departments have grouped doctors by geography. However, the region is not a good predictor of the number of purchases a doctor will make or their support needs.

Your team wants to use a data-centric approach to segmenting doctors to improve marketing, customer service, and product planning

The company stores the information you need in the following four tables. Some of the fields are anonymized to comply with privacy regulations.

Doctors contains information on doctors. Each row represents one doctor.

“DoctorID” - is a unique identifier for each doctor.
“Region” - the current geographical region of the doctor.
“Category” - the type of doctor, either ‘Specialist’ or ‘General Practitioner.’
“Rank” - is an internal ranking system. It is an ordered variable: The highest level is Ambassadors, followed by Titanium Plus, Titanium, Platinum Plus, Platinum, Gold Plus, Gold, Silver Plus, and the lowest level is Silver.
“Incidence rate” and “R rate” - relate to the amount of re-work each doctor generates.
“Satisfaction” - measures doctors’ satisfaction with the company.
“Experience” - relates to the doctor’s experience with the company.
“Purchases” - purchases over the last year.

Orders contains details on orders. Each row represents one order; a doctor can place multiple orders.

“DoctorID” - doctor id (matches the other tables).
“OrderID” - order identifier.
“OrderNum” - order number.
“Conditions A through J” - map the different settings of the devices in each order. Each order goes to an individual patient.

Complaints collects information on doctor complaints.

“DoctorID” - doctor id (matches the other tables).
“Complaint Type” - the company’s classification of the complaints.
“Qty” - number of complaints per complaint type per doctor.

Instructions has information on whether the doctor includes special instructions on their orders.

“DoctorID” - doctor id (matches the other tables).
“Instructions” - ‘Yes’ when the doctor includes special instructions, ‘No’ when they do not.

Create a report that covers the following:

How many doctors are there in each region? What is the average number of purchases per region?
Can you find a relationship between purchases and complaints?
Define new doctor segments that help the company improve marketing efforts and customer service.
Identify which features impact the new segmentation strategy the most.
Your team will need to explain the new segments to the rest of the company. Describe which characteristics distinguish the newly defined segments.

Data validation#

Let’s first take a look at the doctors table:

    DoctorID   Region              Category        Rank  Incidence rate  \
0     AHDCBA     4 15            Specialist  Ambassador           49.00   
1     ABHAHF   1 8 T4  General Practitioner  Ambassador           37.00   
2      FDHFJ   1 9 T4            Specialist  Ambassador           33.00   
3     BJJHCA  1 10 T3            Specialist  Ambassador           28.00   
4      FJBEA  1 14 T4            Specialist  Ambassador           23.00   
..       ...      ...                   ...         ...             ...   
432   AIABDJ     1 10            Specialist  Ambassador            2.18   
433   BBAJCF   1 9 T4            Specialist  Ambassador            2.17   
434    GGCFB  1 19 T4            Specialist  Ambassador            2.14   
435    FDCEG      1 9            Specialist  Ambassador            2.13   
436    EIEIB     1 13            Specialist  Ambassador            2.05   

     R rate Satisfaction  Experience  Purchases  
0      0.90        53.85        1.20       49.0  
1      0.00       100.00        0.00       38.0  
2      1.53           --        0.00       34.0  
3      2.03           --        0.48       29.0  
4      0.96        76.79        0.75       24.0  
..      ...          ...         ...        ...  
432    0.80        11.76        0.77       35.0  
433    1.68           --        0.11       19.0  
434    0.77           --        0.27       22.0  
435    0.84       100.00        0.32       25.0  
436    1.56        11.11        0.39       64.0  

[437 rows x 9 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 437 entries, 0 to 436
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   DoctorID        437 non-null    object 
 1   Region          437 non-null    object 
 2   Category        437 non-null    object 
 3   Rank            435 non-null    object 
 4   Incidence rate  437 non-null    float64
 5   R rate          437 non-null    float64
 6   Satisfaction    437 non-null    object 
 7   Experience      437 non-null    float64
 8   Purchases       437 non-null    float64
dtypes: float64(4), object(5)
memory usage: 30.9+ KB
None

I will make some changes addressing missing values, data-type reassignments and replacements.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 435 entries, 0 to 436
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   DoctorID        435 non-null    object  
 1   Region          435 non-null    category
 2   Category        435 non-null    category
 3   Rank            435 non-null    category
 4   Incidence rate  435 non-null    float64 
 5   R rate          435 non-null    float64 
 6   Satisfaction    435 non-null    float64 
 7   Experience      435 non-null    float64 
 8   Purchases       435 non-null    float64 
dtypes: category(3), float64(5), object(1)
memory usage: 27.0+ KB

Let’s take a look at the other tables.

    DoctorID Complaint Type  Qty
0      EHAHI        Correct   10
1      EHDGF        Correct    2
2      EHDGF        Unknown    3
3      EHDIJ        Correct    8
4      EHDIJ      Incorrect    2
..       ...            ...  ...
430   BHGIFC      Incorrect    1
431   BHHDDF        Correct    1
432   CJAFAB      Incorrect    1
433   CAAHID        Correct    2
434   CAECFG        Correct    2

[435 rows x 3 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   DoctorID        435 non-null    object
 1   Complaint Type  433 non-null    object
 2   Qty             435 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 10.3+ KB
None

   DoctorID Instructions
0    ADIFBD          Yes
1    ABHBED           No
2     FJFEG          Yes
3    AEBDAB           No
4    AJCBFE          Yes
..      ...          ...
72   ABEAFF          Yes
73    FCGCI          Yes
74    FBAHD          Yes
75    FCABB          Yes
76    GHDFB          Yes

[77 rows x 2 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   DoctorID      77 non-null     object
 1   Instructions  77 non-null     object
dtypes: object(2)
memory usage: 1.3+ KB
None

The data is now ready for analysis.

Data analysis#

Q1: How many doctors are there in each region? What is the average number of purchases per region?#

Show code cell source Hide code cell source

# Count categories of doctors by region
reg_cat = doctors.pivot_table(index='Region', columns='Category', values='DoctorID', aggfunc='count')

# Create new column with total doctors by region
reg_cat['total_docs'] = reg_cat.sum(axis=1)

# Sort by total number of doctors
reg_cat = reg_cat.sort_values('total_docs', ascending=False)

# Calculate purchase average by region
reg_pur_mean = doctors.pivot_table(index='Region', values='Purchases', aggfunc='mean')
reg_pur_mean = reg_pur_mean.rename(columns={'Purchases': 'purchases_mean'})

# Calculate purchase total by region
reg_pur_sum = doctors.pivot_table(index='Region', values='Purchases', aggfunc='sum')
reg_pur_sum = reg_pur_sum.rename(columns={'Purchases': 'purchases_sum'})

# Join tables on index
reg = reg_cat.join(reg_pur_sum).join(reg_pur_mean)


# Plot
fig, ax = plt.subplots(1, 2, sharey=True, figsize=(7, 9))

fig.suptitle('Total purchases per region', fontsize=14, y=0.92)
plt.subplots_adjust(wspace=0)
sns.despine()

(-reg['purchases_sum']).plot(ax=ax[0], kind='barh', color='slategrey')
reg[['Specialist', 'General Practitioner']].plot(ax=ax[1], kind='barh',
                                                 stacked=True,
                                                 color=['lightgray', 'chocolate'])

for i in range(2):
    ax[i].grid(axis="x")
    ax[i].set_axisbelow(True)
    ax[i].set_ylabel('Region', fontsize=14)
    ax[i].tick_params(axis='x', labelsize=12, rotation=0)
    ax[i].tick_params(axis='y', labelsize=12)

ax[0].set_xlabel('Total purchases', fontsize=14)
ax[1].set_xlabel('Number of doctors', fontsize=14)

ax[0].invert_yaxis()

ax[1].legend(title='Doctor category',
             loc='center', bbox_to_anchor=(0.8, 0.4),
             title_fontsize=12, fontsize=12)

ax[0].set_xticks(list(range(0, -401, -100)), labels=list(range(0, 401, 100)))

plt.show()

print(f"\nCorrelation coef. -> {reg['total_docs'].corr(reg['purchases_sum']):.2f}")

# Plot scatter
fig, ax = plt.subplots(figsize=(7.5, 5))

ax.plot(reg['total_docs'], reg['purchases_sum'], marker='o', linestyle='none',
        alpha=0.75, color='slategrey')

sns.despine()
ax.grid(axis="both")
ax.set_axisbelow(True)
ax.set_title('Regions', fontsize=14)
ax.set_xlabel('Number of doctors', fontsize=14)
ax.set_ylabel('Total purchases', fontsize=14)

ax.annotate('1 19 20', xy=(1, 132), xytext=(1, 170), fontsize=11,
            arrowprops={"arrowstyle":"-|>", "color":"black", 'linestyle':"--"})
ax.annotate('1 14', xy=(31.5, 389), xytext=(27, 388), fontsize=11,
            arrowprops={"arrowstyle":"-|>", "color":"black", 'linestyle':"--"})

ax.set_xlim(0)
ax.set_ylim(0)

plt.show()

../_images/c9548224c02aab8f7971b161e93f45ca8067b6b2116a5f628b3ad31fd700ae84.png

Correlation coef. -> 0.91

../_images/86492374d5b9cf80873b1a0c8dc24db1465a3cb7217b107248e0f03426ac28ba.png

Customer segmentation is based on the region, and regions are likely sorted by the number of customers in each one of them. The graphs demonstrate a strong correlation between the total number of purchases and the number of customers-doctors.

../_images/324b5b880da6abb849aa1cb9853ceb7ca759f8cb8863dfc488c1c407f5904c2f.png

Correlation coef. -> -0.13

../_images/2fb6d236fef77b04a43b9ac91d12bcb92ee77a14361f80ecbdde4a215750f0b6.png

However, when considering average purchases, we can observe that the best customers are dispersed across various regions. This likely creates challenges in targeting specific marketing and sales campaigns based on the average purchasing history of customers.

Q2: Can you find a relationship between purchases and complaints?#

../_images/26bc002b4ee91fd79b7e2dc508b5cff1f5ac24d0bd90eb654037b18b57d036e1.png

To facilitate easier viewing and minimize their impact, I will remove the three outliers marked in the graph.

Correlation coef. -> 0.07

../_images/0eb9e0b0f2441c902664eff0df985152764046da792504d8d5c595252393953f.png

There is no relationship between purchases and complaints. Knowing one does not allow us to deduce any information about the other.

Q3: Define new doctor segments that help the company improve marketing efforts and customer service.#

        Purchases  complaints
count  432.000000  432.000000
mean    10.375000    0.537037
std      9.344558    1.764175
min      3.000000    0.000000
25%      4.000000    0.000000
50%      7.000000    0.000000
75%     13.000000    0.000000
max     64.000000   14.000000

../_images/9a747e2fbfce6b0d9bbfedb96323437555dc3e905a0f9b4a5638c5502ce5297f.png

I will proceed with a logarithmic transformation to manage data skewness. Afterward, I will standardize the data to prepare it for the K-Means algorithm.

To determine the optimal number of clusters for K-Means, I will employ the elbow criterion method. For each number of clusters, I will calculate the sum of squared errors (SSE), which represents the sum of squared distances from each data point to its cluster center. By examining the chart, I will identify the elbow of the curve, which is the point where the SSE significantly levels off and becomes less substantial.

../_images/cca930fa115dae52111cf18d1acf65c67ff4c25211e59cb0c010adbe05a8512f.png

Number of clusters selected -> 3

	Purchases	complaints
	mean	mean	count
Cluster
0	5.3	0.1	258
1	11.2	5.4	36
2	19.6	0.2	138

In the table, we can already observe distinct differences among the three clusters. Cluster 1 exhibits a higher mean value for complaints, while cluster 2 demonstrates a higher average for purchases.

../_images/d6110775f9ad28f061612b32f6b37a9e5381a45248b622db74e8a8b9185af749.png

The snake plot of the normalized variables allows us to clearly observe the aforementioned differences between the clusters.

../_images/1933ebeb739abfc365ccba7a599089953367ecba75399bfc9652f42cf8ff0577.png

Finally, we revisit the scatter plot where we previously observed no relationship between variables. This time, each cluster is assigned a distinct color for easier identification. Additionally, I will assign descriptive names to each cluster:

Cluster 1 -> complainers (due to their higher average of complaints)
Cluster 2 -> purchasers (due to their higher average of purchases)
Cluster 0 -> lowers (simply a descriptive name, not indicating high purchases or complaints)

../_images/692e5e6d6843d6f44c261cb3a05378d585eab469fac215b69d681f8f86831e41.png

These graphs serve to confirm the meaningfulness of the descriptive names that were assigned.

Q4: Identify which features impact the new segmentation strategy the most.#

../_images/31af835c3d2842b6c514a28759e0c22ece01cad644f24f37fd1914feb100f35b.png

The new segmentation strategy imbalances the number of customers.

../_images/2c7c5a6d05537002800557d16df38bd837c858d9bcbdb382006c38c60a8c4460.png

Almost all complainers and the great majority of purchasers come from specialist doctors.

../_images/c9965566142f328fcb3b0c5d2beed6c8741d34a3689ba409db4f071f489216ff.png

This graph illustrates the distribution of customers from each cluster in relation to the internal ranking of the company.

../_images/24322583f0af815a035f44ec9649354fcc8aa857f4dcce7b16d833d6fc59b4a5.png

The amount of re-work each doctor generates.

../_images/e6736a7fd6c6d2cc73e3b4d2432e721c0f07c492b5e670b2377d6273c4ab6628.png

../_images/81c0465807515e40d8f9e605a319993d0b2471e8ecf42ad0f1ce46674dbc076b.png

../_images/449567b66166a8584b4f21b60efb69b37a3a30163417916d2384d6c61681b2a9.png

Conclusions#

In this project, a new customer segmentation strategy was proposed, taking into account customer purchasing and complaining rates. This classification approach would assist the company in addressing marketing and sales activities with greater precision, as well as improving customer service and support.