Reducing hospital readmissions

Reducing hospital readmissions#

A DataCamp challenge Mar, 2023

Hypothesis test Predictive analytics

The project#

You work for a consulting company helping a hospital group better understand patient readmissions. The hospital gave you access to ten years of information on patients readmitted to the hospital after being discharged. The doctors want you to assess if initial diagnoses, number of procedures, or other variables could help them better understand the probability of readmission.

They want to focus follow-up calls and attention on those patients with a higher probability of readmission.

What is the most common primary diagnosis by age group?
Some doctors believe diabetes might play a central role in readmission. Explore the effect of a diabetes diagnosis on readmission rates.
On what groups of patients should the hospital focus their follow-up efforts to better monitor patients with a high probability of readmission?

You have access to ten years of patient information (source):

Column	Description
`age`	age bracket of the patient
`time_in_hospital`	days (from 1 to 14)
`n_procedures`	number of procedures performed during the hospital stay
`n_lab_procedures`	number of laboratory procedures performed during the hospital stay
`n_medicationsv`	number of medications administered during the hospital stay
`n_outpatient`	number of outpatient visits in the year before a hospital stay
`n_inpatient`	number of inpatient visits in the year before the hospital stay
`n_emergency`	number of visits to the emergency room in the year before the hospital stay
`medical_specialty`	the specialty of the admitting physician
`diag_1`	primary diagnosis (Circulatory, Respiratory, Digestive, etc.)
`diag_2`	secondary diagnosis
`diag_3`	additional secondary diagnosis
`glucose_test`	whether the glucose serum came out as high (> 200), normal, or not performed
`A1Ctest`	whether the A1C level of the patient came out as high (> 7%), normal, or not performed
`change`	whether there was a change in the diabetes medication (‘yes’ or ‘no’)
`diabetes_med`	whether a diabetes medication was prescribed (‘yes’ or ‘no’)
`readmitted`	if the patient was readmitted at the hospital (‘yes’ or ‘no’)

Acknowledgments: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.

Data validation#

Read the data#

           age  time_in_hospital  n_lab_procedures  n_procedures  \
    [70-80)                 8                72             1   
    [70-80)                 3                34             2   
    [50-60)                 5                45             0   
    [70-80)                 2                36             0   
    [60-70)                 1                42             0   
...        ...               ...               ...           ...   
[80-90)                14                77             1   
[80-90)                 2                66             0   
[70-80)                 5                12             0   
[70-80)                 2                61             3   
[50-60)                10                37             1   

       n_medications  n_outpatient  n_inpatient  n_emergency  \
               18             2            0            0   
               13             0            0            0   
               18             0            0            0   
               12             1            0            0   
                7             0            0            0   
...              ...           ...          ...          ...   
           30             0            0            0   
           24             0            0            0   
            6             0            1            0   
           15             0            0            0   
           24             0            0            0   

            medical_specialty       diag_1       diag_2       diag_3  \
                   Missing  Circulatory  Respiratory        Other   
                     Other        Other        Other        Other   
                   Missing  Circulatory  Circulatory  Circulatory   
                   Missing  Circulatory        Other     Diabetes   
          InternalMedicine        Other  Circulatory  Respiratory   
...                       ...          ...          ...          ...   
               Missing  Circulatory        Other  Circulatory   
               Missing    Digestive       Injury        Other   
               Missing        Other        Other        Other   
Family/GeneralPractice  Respiratory     Diabetes        Other   
               Missing        Other     Diabetes  Circulatory   

      glucose_test A1Ctest change diabetes_med readmitted  
             no      no     no          yes         no  
             no      no     no          yes         no  
             no      no    yes          yes        yes  
             no      no    yes          yes        yes  
             no      no     no          yes         no  
...            ...     ...    ...          ...        ...  
         no  normal     no           no        yes  
         no    high    yes          yes        yes  
     normal      no     no           no        yes  
         no      no    yes          yes         no  
         no      no     no           no        yes  

[25000 rows x 17 columns]

Check data integrity#

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 age                25000 non-null  object
 time_in_hospital   25000 non-null  int64 
 n_lab_procedures   25000 non-null  int64 
 n_procedures       25000 non-null  int64 
 n_medications      25000 non-null  int64 
 n_outpatient       25000 non-null  int64 
 n_inpatient        25000 non-null  int64 
 n_emergency        25000 non-null  int64 
 medical_specialty  25000 non-null  object
 diag_1             25000 non-null  object
diag_2             25000 non-null  object
diag_3             25000 non-null  object
glucose_test       25000 non-null  object
A1Ctest            25000 non-null  object
change             25000 non-null  object
diabetes_med       25000 non-null  object
readmitted         25000 non-null  object
dtypes: int64(7), object(10)
memory usage: 3.2+ MB

We can see that there are no missing values.

There are also no duplicated rows.

Check categorical variables#

Let’s see if variables of type ‘object’ (strings) contain categories.

../_images/2a2af652339a7f11e8074a99417060c45e6ce10c883424c49949ee4287a24440.png

All of them are categorical variables.

The target variable, ‘readmitted’, is quite balanced, with the class of interest, ‘yes’, being well-represented and almost equal to the other class. I am going to replace the values in the target variable from ‘no-yes’ to numerical ‘0-1’ right away (without waiting for the creation of dummies) because it will facilitate some early analysis.

The number of values in the ‘Missing’ category for variables ‘diag_1’, ‘diag_2’, and ‘diag_3’ is low, so I will assign them to the ‘Other’ category to reduce dimensionality.

All the remaining ‘object’ columns can be converted to categorical.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 age                25000 non-null  category
 time_in_hospital   25000 non-null  int64   
 n_lab_procedures   25000 non-null  int64   
 n_procedures       25000 non-null  int64   
 n_medications      25000 non-null  int64   
 n_outpatient       25000 non-null  int64   
 n_inpatient        25000 non-null  int64   
 n_emergency        25000 non-null  int64   
 medical_specialty  25000 non-null  category
 diag_1             25000 non-null  category
diag_2             25000 non-null  category
diag_3             25000 non-null  category
glucose_test       25000 non-null  category
A1Ctest            25000 non-null  category
change             25000 non-null  category
diabetes_med       25000 non-null  category
readmitted         25000 non-null  int32   
dtypes: category(9), int32(1), int64(7)
memory usage: 1.6 MB

We can see that the memory usage has been reduced by half.

Check numerical variables#

I will plot histograms for them to get a sense of their value ranges.

../_images/fb6879642cb9dc25bb3514f5daa1cb33118a55f7fbd0281adc4edc7a9ffb292a.png

It appears that there may be some outliers. I will check this by creating a box plot, which will also help me to see more clearly that these numerical values are of the same order of magnitude.

../_images/c89d6cfe584aa60ced98ffda99a655057ad1e1a89526f11604c2713878ce267b.png

To reduce the impact of outliers on the model outcome, I will use the winsorization method to filter them out. This will limit the extreme values to lower and upper limits based on percentiles. I will use a 5% limit for both the upper and lower bounds.

../_images/48e3a3aa3fec81afdab5b49daf7cc6a85c693b918fcb071149b3877a69cb4b7a.png

Data analysis#

I will answer each of the three questions that were asked.

Q1: What is the most common primary diagnosis by age group?#

../_images/b67795f171f1c9e10862b577ef4037dfebc667949aa2bc2a8f1d3b70bf94367c.png

As shown in the graphs, circulatory disease is the most common primary diagnosis across most of the age groups.

Q2: What is the effect of a diabetes diagnosis on the rate of readmissions?#

I will begin by plotting the incidence ratio of the variable on the target variable. As I intend to use this type of graph for other variables in the project, I will define a function to create the so-called Predictor Insight Graph, which provides insight into the effect of the predictor variable on the target variable.

../_images/1efc59eabe84172b9c0c418334e7f0f8eb3cc31aea5b4a734beda7379155a69b.png

The graph shows that the incidence in the target variable is highest for diabetes as the primary diagnosis.

However, is this difference statistically significant? I will conduct a statistical test to answer this question.

Let’s calculate the distributions of the readmision rate for:

Patients with diabetes as primary diagnosis.
Patients without diabetes as primary diagnosis.

Readmission rate means:
0.536 <- with diabetes as primary diagnosis
0.465 <- without diabetes as primary diagnosis

../_images/a4fb6fc376d8c08e33f4ad474a611a811dfc454b14894b92102a22188c6ee9ab.png

There is no overlap, indicating that the difference is likely statistically significant.

Nevertheless, let’s perform the hypothesis test. I will assume the null hypothesis that there is no difference between the two groups. To test this, I will shift the distribution values to equalize their means, and then determine how often we could obtain purely by chance the difference in mean values that we are currently observing.

p-value = 0.0

After calculating the mean of 10,000 random bootstrap replicates, we did not observe a single one with the difference in means that we are currently observing. Therefore, we reject the null hypothesis and conclude that the difference is statistically significant. This means that having diabetes as a primary diagnosis certainly has an effect on the rate of readmissions.

Would the effect be the same for these two groups?

Patients with diabetes as the primary, secondary, or tertiary diagnosis.
Patients without diabetes in any of their diagnoses.

Readmission rate means:
0.465 <- with diabetes in any diagnosis
0.473 <- without diabetes in all diagnosis

../_images/1939590adacb8e1eb4581bdf558a957a7c1a52b944b43c4159489e3f2fafe29c.png

In this case, when considering secondary and tertiary diagnoses, the overlapping of the distributions suggests that there is no significant difference in readmission rates between patients with and without diabetes.

Q3: Identifying patients with a high probability of readmission#

I am going to identify the variables that have the greatest impact on the readmission rate. I will use a simple logistic regression model for the sake of interpretability.

Data preprocessing#

This process consists of:

Separating variables (features) and target.
Converting categorical variables to numerical (avoiding multicollinearity).
Splitting the data into training and testing sets.
Scaling the data (necessary for Logistic Regression).
Reconstructing complete basetables (features + target) to perform predictive analysis.

Variable selection#

Once we have the train and test basetables ready, we can proceed with the process of selecting the variables that have the highest predictive power.

To do so, I will use a forward stepwise variable selection procedure, in which AUC scores are considered as a metric. Variables will be sorted according to the predictive power achieved if we include them progressively in a Logistic Regression model. The process will be carried out only in the training basetable to avoid data leakage.

Step 1: variable 'n_inpatient' added
Step 2: variable 'n_outpatient' added
Step 3: variable 'diabetes_med_yes' added
Step 4: variable 'time_in_hospital' added
Step 5: variable 'n_emergency' added
Step 6: variable 'age_[80-90)' added
Step 7: variable 'age_[70-80)' added
Step 8: variable 'medical_specialty_Surgery' added
Step 9: variable 'medical_specialty_Other' added
Step 10: variable 'medical_specialty_InternalMedicine' added
Step 11: variable 'n_procedures' added
Step 12: variable 'diag_1_Diabetes' added
Step 13: variable 'A1Ctest_normal' added
Step 14: variable 'diag_2_Injury' added
Step 15: variable 'n_medications' added
Step 16: variable 'diag_3_Other' added
Step 17: variable 'n_lab_procedures' added
Step 18: variable 'age_[60-70)' added
Step 19: variable 'diag_1_Other' added
Step 20: variable 'diag_1_Injury' added
Step 21: variable 'diag_1_Musculoskeletal' added
Step 22: variable 'diag_1_Respiratory' added
Step 23: variable 'medical_specialty_Emergency/Trauma' added
Step 24: variable 'A1Ctest_no' added
Step 25: variable 'glucose_test_no' added
Step 26: variable 'diag_2_Respiratory' added
Step 27: variable 'diag_2_Digestive' added
Step 28: variable 'diag_3_Respiratory' added
Step 29: variable 'diag_3_Diabetes' added
Step 30: variable 'diag_3_Injury' added
Step 31: variable 'diag_3_Musculoskeletal' added
Step 32: variable 'diag_2_Musculoskeletal' added
Step 33: variable 'age_[50-60)' added
Step 34: variable 'diag_3_Digestive' added
Step 35: variable 'medical_specialty_Missing' added
Step 36: variable 'medical_specialty_Family/GeneralPractice' added
Step 37: variable 'diag_2_Diabetes' added
Step 38: variable 'change_yes' added
Step 39: variable 'glucose_test_normal' added
Step 40: variable 'diag_1_Digestive' added
Step 41: variable 'age_[90-100)' added
Step 42: variable 'diag_2_Other' added

We will now visualize the performance evolution as variables are included in the model in the order defined by the list. We will consider both the train and test basetables to check the validity of the results.

Show code cell source Hide code cell source

# Init lists
auc_values_train = []
auc_values_test = []
variables_evaluate = []

# Iterate over the variables in variables
for v in current_variables:
  
    # Add the variable
    variables_evaluate.append(v)
    
    # Calculate the train and test AUC of this set of variables
    auc_train, auc_test = auc_train_test(variables_evaluate, ["readmitted"], train, test)
    
    # Append the values to the lists
    auc_values_train.append(auc_train)
    auc_values_test.append(auc_test)

# Create dataframe to plot results
aucs = pd.concat([pd.DataFrame(np.array(auc_values_train),
                               columns=['Train'],
                               index=current_variables),
                  pd.DataFrame(np.array(auc_values_test),
                               columns=['Test'],
                               index=current_variables)],
                 axis=1)

# Plot
fig, ax = plt.subplots(figsize=(7, 12))

ax.plot(aucs['Train'], aucs.index, label='Train')
ax.plot(aucs['Test'], aucs.index, label='Test')

sns.despine()
ax.grid(axis="both")
ax.set_axisbelow(True)

ax.set_title('', fontsize=14)
ax.set_xlabel('AUC performance score', fontsize=14)
ax.set_ylabel("", fontsize=14)

ax.tick_params(axis='x', labelsize=12, rotation=0)
ax.tick_params(axis='y', labelsize=12)

ax.legend(title='Data set', loc='upper right', title_fontsize=13, fontsize=13)

ax.annotate('',
            xy=(0.602, 19),
            xytext=(0.602, 0), fontsize=12,
            arrowprops={"arrowstyle":"-|>", "color":"black", 'linestyle':"--"})

ax.annotate('',
            xy=(0.651, 19), 
            xytext=(0.602, 19), fontsize=12,
            arrowprops={"arrowstyle":"-|>", "color":"black", 'linestyle':"--"})

ax.annotate("Stepwise\nvariable selection list", (0.603, 10), size=12)
ax.annotate("Cut-off:\nperformance improvement no longer significant", (0.605, 18.5), size=12)

ax.invert_yaxis()

plt.show()

# Selected variables
n_variables = 19
selected_variables = current_variables[:n_variables]

../_images/5ba338931e8cdbdb25f8961618db4f27c989c9435ad0c077b966fb2ea73674f1.png

After conducting the forward stepwise variable selection procedure, a total of 20 variables were selected based on their predictive power. To ensure that we are not missing any important variables, I will compare the accuracy, precision, and recall scores of the Logistic Regression model when fitted with all variables and when fitted only with the 20 selected ones.

../_images/3eb9513c5feb05de5bc56b342fc1f3e117a2b3f41b4721b23105a47db2fe0f84.png

The model performance results are not impressive, especially regarding recall (sensitivity). Nevertheless, this comparison demonstrates that we are not losing any predictive information if we only consider the selected variables. Additionally, since this project aims to assess the most important predictors, we will not attempt to optimize the model results.

We will focus instead on the predictive power of the selected variables to try to gain insight about which factors contribute the most to readmissions in the hospital.

Coefficients of the Logistic Regression model tell us about the importance of each variable.

../_images/9b008270ab9590f0e223304363f65d728e3bc491c993d12ca7651bfed48d01a9.png

In the graph, we can see the values of the coefficients for each variable, sorted according to their absolute values (predictive power).

However, we selected our own list of variables based on the model performance’s progressive improvement. We can see that both lists have the same variables, but the order of importance is not the same.

../_images/d57000bfc7e8f0c9f611d85470cb86fae18825888a7acda3cd0643b1490bbb36.png

We can see that the first and second most important variables are the same in both lists, but the third most important coefficient (‘n_emergency’) does not come in the third position on the selected list. Instead, ‘diab_med_yes’ was selected, which, in principle, is a variable with less predictive power than ‘n_emergency,’ as seen in the graph above.

The following graph illustrates these differences in the relative positions.

../_images/0fa8b9d7d4e17fb4eeceef719a148f086f71a6c2151fba5ad02d07ef26f5ccee.png

These differences in the order of importance are due to correlations between variables. I will print the correlation matrix to illustrate this.

../_images/3a954df87be293fdb4e35b348809585d88e0d5c06f716126cbb9d10126eb38e7.png

We can deduce that ‘n_emergency’ was not included in the third position in the selected list because it has a relatively high correlation with the first two included variables (‘n_inpatient’ and ‘n_outpatient’). Therefore, the stepwise algorithm correctly selects the next most powerful variable, ‘diabetes_med_yes’.

The changes in the order of the remaining variables can be explained in a similar way, as they are also influenced by correlations with other variables.

Predictor Insight Graphs#

Finally, I will plot the Predictor Insight Graphs for the first 12 variables in the selected list, to gain an intuitive sense of their impact on the target.

../_images/f896874a9888ba702521e08f845bcac6a862913d058045d20072977e06c3def2.png

../_images/49e2673ef80b47bb35b5dc5ebe8cba85f07ed4247a7c1c3bd2fdf6963d5bd276.png

../_images/be90245a2287b71d5e8a79e31cb0a638c2a50ae1e8bfcd478edfb9cd5bb60329.png

../_images/aada2cea2da74dbe7dcd75b8119648fa7a550418e7e386c58a6d1148f3e37ec6.png

../_images/cf206dab307bce7b4c947a4e6899f4bcd8f8a13bea9bb4c6d653095c3297f8d9.png

../_images/c2e665c2fab56fc7d461bf7dbd54decaf3db774d940363e9cd00b33946ccc64b.png

../_images/f9571ad28487d8ab5ce6c1340d783c7ddea14d0c452d39c2275f8ee3185f010d.png

../_images/a8216093725cf9945f45402f8bc8c982276f37acc1bdd6be519aece57da4c83e.png

../_images/1999924284662558d752c9025411c70debf24f55a743520becb980d18d2c135f.png

../_images/81d64633b13a9d83a9831d950076210f43ff5d1d3415c60487a1fc5116ad404a.png

../_images/f67387e4ab3432d633d464e57f660f60f100c606ebc356deb67d6079e919b5d4.png

../_images/9b3a695ed91d064337a45829b4e8d774637602f78df17cc6406db709f6d9e8b8.png

We will conclude the analysis with this final Predictor Insight Graph, which corresponds to the incidence of diabetes as the primary diagnosis. This graph is the same as the one constructed while answering the second question addressed in this report, with the difference of aggregating all non-diabetes diagnoses. Looking at this graph, it may seem that the incidence (the admission rate) hike is not so high. However, as we learned earlier, it is indeed significant, considering the sizes and corresponding distributions.

Diabetes as the primary diagnosis plays a part, but this variable is in the 12th position in the list. Before it, there are other factors that have more influence on the readmission rate.

Conclusions#

The hospital should focus their follow-up efforts monitoring these patients:

The most important predictor is n_inpatient, which represents the number of inpatient visits in the year before the hospital stay. The readmission rate increases significantly when patients have more inpatient visits prior to their hospital stay, so it is important to closely monitor these individuals. Similarly, n_outpatient (the number of outpatient visits in the year before a hospital stay) is also a significant predictor, ranking second in importance.
In addition to the aforementioned predictors, it is important to look out for patients with a diabetes medication prescribed (diabetes_med_yes) as this variable also has significant predictive power for readmission rates.
Furthermore, the length of stay in the hospital upon admission (time_in_hospital) also plays a crucial role. The longer a patient stays in the hospital (from 1 to 14 days), the higher the probability of readmission.
The number of visits to the emergency room in the year before the hospital stay (n_emergency) should also be taken into account. Even a single visit to the emergency room indicates a higher probability of readmission.
Age, of course, is also an important factor. Patients between the ages of 70-80 and 80-90 are more likely to be readmitted to the hospital.
Some medical specialties are negatively correlated with readmission. These include Surgery, Internal Medicine, and Other. Therefore, patients admitted by physicians in these specialties have a lower probability of being readmitted.
The n_procedures (number of procedures performed during the hospital stay) is also negatively correlated.
As doctors’ intuition correctly anticipated, having Diabetes as a primary diagnosis (diag_1_Diabetes) also has a significant effect on readmission.

And it may also be worth taking a look at the rest of the selected variables. The number of medications (n_medications), number of laboratory procedures (n_lab_procedures), and age group 60-70 are positively correlated. On the other hand, A1Ctest_normal, diag_2_Injury, diag_3_Other, and diag_1_Other are negatively correlated.