Predicting hotel cancellations

Predicting hotel cancellations#

A DataCamp challenge May, 2023

Predictive analytics

The project#

You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!

They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.

Produce recommendations for the hotel on what factors affect whether customers cancel their booking.

They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:

Column	Description
`Booking_ID`	Unique identifier of the booking.
`no_of_adults`	The number of adults.
`no_of_children`	The number of children.
`no_of_weekend_nights`	Number of weekend nights (Saturday or Sunday).
`no_of_week_nights`	Number of week nights (Monday to Friday).
`type_of_meal_plan`	Type of meal plan included in the booking.
`required_car_parking_space`	Whether a car parking space is required.
`room_type_reserved`	The type of room reserved.
`lead_time`	Number of days before the arrival date the booking was made.
`arrival_year`	Year of arrival.
`arrival_month`	Month of arrival.
`arrival_date`	Date of the month for arrival.
`market_segment_type`	How the booking was made.
`repeated_guest`	Whether the guest has previously stayed at the hotel.
`no_of_previous_cancellations`	Number of previous cancellations.
`no_of_previous_bookings_not_canceled`	Number of previous bookings that were canceled.
`avg_price_per_room`	Average price per day of the booking.
`no_of_special_requests`	Count of special requests made as part of the booking.
`booking_status`	Whether the booking was cancelled or not.

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

Data validation#

Read the data#

      Booking_ID  no_of_adults  no_of_children  no_of_weekend_nights  \
     INN00001           NaN             NaN                   NaN   
     INN00002           2.0             0.0                   2.0   
     INN00003           1.0             0.0                   2.0   
     INN00004           2.0             0.0                   0.0   
     INN00005           2.0             0.0                   1.0   
...          ...           ...             ...                   ...   
 INN36271           3.0             0.0                   2.0   
 INN36272           2.0             0.0                   1.0   
 INN36273           2.0             0.0                   2.0   
 INN36274           2.0             0.0                   0.0   
 INN36275           2.0             0.0                   1.0   

       no_of_week_nights type_of_meal_plan  required_car_parking_space  \
                  NaN               NaN                         NaN   
                  3.0      Not Selected                         0.0   
                  1.0       Meal Plan 1                         0.0   
                  2.0       Meal Plan 1                         0.0   
                  1.0      Not Selected                         0.0   
...                  ...               ...                         ...   
              NaN       Meal Plan 1                         0.0   
              3.0       Meal Plan 1                         0.0   
              6.0       Meal Plan 1                         0.0   
              3.0      Not Selected                         0.0   
              2.0       Meal Plan 1                         NaN   

      room_type_reserved  lead_time  arrival_year  arrival_month  \
                  NaN        NaN           NaN            NaN   
          Room_Type 1        5.0        2018.0           11.0   
          Room_Type 1        1.0        2018.0            2.0   
          Room_Type 1      211.0        2018.0            5.0   
          Room_Type 1       48.0        2018.0            4.0   
...                  ...        ...           ...            ...   
              NaN       85.0        2018.0            8.0   
      Room_Type 1      228.0        2018.0           10.0   
      Room_Type 1      148.0        2018.0            7.0   
      Room_Type 1       63.0        2018.0            4.0   
      Room_Type 1      207.0        2018.0           12.0   

       arrival_date market_segment_type  repeated_guest  \
             NaN                 NaN             NaN   
             6.0              Online             0.0   
            28.0              Online             0.0   
            20.0              Online             0.0   
            11.0              Online             0.0   
...             ...                 ...             ...   
         3.0              Online             NaN   
        17.0              Online             0.0   
         1.0              Online             0.0   
        21.0              Online             0.0   
        30.0             Offline             0.0   

       no_of_previous_cancellations  no_of_previous_bookings_not_canceled  \
                             NaN                                   NaN   
                             0.0                                   0.0   
                             0.0                                   0.0   
                             0.0                                   0.0   
                             0.0                                   0.0   
...                             ...                                   ...   
                         0.0                                   0.0   
                         0.0                                   0.0   
                         0.0                                   0.0   
                         0.0                                   0.0   
                         0.0                                   0.0   

       avg_price_per_room  no_of_special_requests booking_status  
                   NaN                     NaN   Not_Canceled  
                106.68                     1.0   Not_Canceled  
                 60.00                     0.0       Canceled  
                100.00                     0.0       Canceled  
                 94.50                     0.0       Canceled  
...                   ...                     ...            ...  
            167.80                     1.0   Not_Canceled  
             90.95                     2.0       Canceled  
             98.39                     2.0   Not_Canceled  
             94.50                     0.0       Canceled  
            161.67                     0.0   Not_Canceled  

[36275 rows x 19 columns]

Check data integrity#

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 Booking_ID                            36275 non-null  object 
 no_of_adults                          35862 non-null  float64
 no_of_children                        35951 non-null  float64
 no_of_weekend_nights                  35908 non-null  float64
 no_of_week_nights                     35468 non-null  float64
 type_of_meal_plan                     35749 non-null  object 
 required_car_parking_space            33683 non-null  float64
 room_type_reserved                    35104 non-null  object 
 lead_time                             35803 non-null  float64
 arrival_year                          35897 non-null  float64
arrival_month                         35771 non-null  float64
arrival_date                          35294 non-null  float64
market_segment_type                   34763 non-null  object 
repeated_guest                        35689 non-null  float64
no_of_previous_cancellations          35778 non-null  float64
no_of_previous_bookings_not_canceled  35725 non-null  float64
avg_price_per_room                    35815 non-null  float64
no_of_special_requests                35486 non-null  float64
booking_status                        36275 non-null  object 
dtypes: float64(14), object(5)
memory usage: 5.3+ MB

Duplicates#

I will search for duplicate rows while excluding the ‘Booking_ID’ column, which serves as a unique identifier for each record.

duplicate rows -> 0
duplicate rows (excluding ID) -> 7445

There is a considerable amount of duplicates in relation with the size of the data set.

21 % of duplicates

Since these duplicates have distinct Booking_ID identifiers, it is worth considering whether they are actually errors or instead if they represent genuine bookings with identical values. While it’s not very likely, especially given the large number of duplicates, it is not impossible that all the variables have repeated values for different bookings. It is worth further investigation to determine the cause of these duplicates.

Anyway, after conducting some exploratory analysis, I could not identify any patterns in the duplicate records. Therefore, to prevent any bias or overfitting issues in the training and test data sets, I will remove all duplicate records and move on.

duplicated rows (excluding ID) -> 0
actual rows of dataframe -> 28830

Missing values#

required_car_parking_space              2465
market_segment_type                     1486
room_type_reserved                      1137
arrival_date                             946
no_of_week_nights                        788
no_of_special_requests                   782
repeated_guest                           575
no_of_previous_bookings_not_canceled     547
type_of_meal_plan                        519
arrival_month                            503
no_of_previous_cancellations             492
lead_time                                470
avg_price_per_room                       452
no_of_adults                             408
arrival_year                             373
no_of_weekend_nights                     367
no_of_children                           318
Booking_ID                                 0
booking_status                             0
dtype: int64

There are quite a number of missing values. Let’s take a look at the missingness matrix.

../_images/56595cb313f447fe32c03e93387d6afa19b3bac73316e22e89a3508adb5542cc.png

Before going with the columns, I will start by checking rows with multiple missing values.

      17
   9
   8
   8
   8
   7
    7
    7
   7
   7
dtype: int64

I will drop the first row because all its values are missing.

I am going to proceed checking missing values by columns and deciding how to deal with them in each case.

Let’s start with the column with the maximum number of missing values: required_car_parking_space.

[0.0, nan, 1.0]

In this case it makes sense to assign missing values to ‘0’ (not required car parking space).

The next column with multiple missing values is market_segment_type (how the booking was made).

['Online', 'Offline', nan, 'Aviation', 'Complementary', 'Corporate']

I could create a new category for missing values, but first I am going to sort the data frame by this column to see if there are structural paterns related to them.

../_images/0dd88a903f6b05a75df10d0d2eb0a06cbfeef3f6c11b7caa1d5855b60e21a168.png

In the matrix, we can see that missing values in the market_segment_type column come in the same rows in which all of the missing values in arrival_year and arrival_month are present. Because of that, I have decided to remove all those rows, to get rid of those missing values.

Let’s take a look at missing values in room_type_reserved.

['Room_Type 1',
 'Room_Type 4',
 nan,
 'Room_Type 2',
 'Room_Type 6',
 'Room_Type 7',
 'Room_Type 5',
 'Room_Type 3']

../_images/8894139b7f6ae3822a4131d2e71ae9f16114fea17a83ece7e4e276e43098c04b.png

As in the previous case, I am not going to create a new category for missing values in room_type_reserved because they come along with missign values in repeated_guest, so I will get rid of all of them.

Let’s take a lok at the next most numerous missing values column, arrival_date.

array([ 6., 11., 15., 18., 30., 26., 20.,  5., 10., 28., 19.,  7.,  9.,
       27., nan,  1., 21., 29., 16., 13.,  2.,  3., 25., 14.,  4., 17.,
       22., 23., 31.,  8., 12., 24.])

../_images/eee45c83099e6c5e086754ccdc5a6cc5e2081f62fc5efec4f8dd3edd3988928b.png

Missing values in arrival_date do not come along with any other significant number of missing values in other columns in the same row. After inspecting the data frame, I cannot see any pattern for these rows in any other column, so I will directly drop those rows.

Let’s take a look at the next column with more missing values: no_of_week_nights. This values could be related to the column no_of_weekend_nights, so let’s consider them together.

Week nights -> [ 3.  1.  4.  5.  0.  2. nan 10.  6. 11.  7. 15.  9. 13.  8. 14. 12. 17.
 16.]
Weekend nights -> [ 2.  1.  0. nan  4.  3.  6.  5.]

I will follow this criteria to solve missing values in these two columns:

If both are missing, I will drop that row.
If one of them is missing and the value of the other one is 0, then I will drop that row.
If one of them is missing and the value of the other one is not 0, then I will assign 0.

Let’s take a look at the next one: no_of_special_requests.

array([ 1.,  0.,  3.,  2., nan,  4.,  5.])

It makes sense to assign ‘0’ to missing values in this column.

Let’s take a look at type_of_meal_plan.

array(['Not Selected', 'Meal Plan 1', nan, 'Meal Plan 2', 'Meal Plan 3'],
      dtype=object)

I will assign missing values to ‘Not Selected’ category.

I will directly drop rows with lead_time missing values and also with missing values in avg_price_per_room.

Let’s now take a look at the no_of_adults and no_of_children columns. I will use the following criteria:

I will drop rows with missing values in no_of_adults.
I will assign ‘0’ to missing no_of_children if no_of_adults is not ‘0’.

Let’s take a look at (finally!) the last features with missing values: no_of_previous_cancellations and no_of_previous_bookings_not_canceled.

array([ 0., nan,  3.,  1.,  2., 11.,  4.,  5.,  6., 13.])

array([ 0., nan,  5.,  1.,  3.,  4., 12., 19.,  2., 15., 17.,  7., 20.,
       16., 50., 13.,  6., 14., 34., 18., 10., 23., 11.,  8., 49., 47.,
       53.,  9., 33., 24., 52., 22., 21., 48., 28., 39., 25., 31., 38.,
       51., 42., 37., 35., 56., 44., 27., 32., 55., 26., 45., 30., 57.,
       46., 54., 43., 58., 41., 29., 40., 36.])

I will direclty drop rows with missing values.

Finally, let’s confirm that we have removed all missing values.

../_images/7591de8c16285520cbbca134758cd228de37a96ff9534113ca7c47587ae32221.png

Booking_ID                              0
no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

All clean now!

After cleanning the data from duplicates and missing values the data frame has been reduced to:

Actual rows -> 23294
64 % of the initial rows

Data consistency#

I am going to explore date consistency by creating a new column with datetime format. The function will attempt to convert the year, month, and day into a valid date. If an error occurs, the date will be converted into a missing value.

Let’s find out what these date inconsistencies are about.

       arrival_year  arrival_month  arrival_date
       2018.0            2.0          29.0
       2018.0            2.0          29.0
       2018.0            2.0          29.0
       2018.0            2.0          29.0
       2018.0            2.0          29.0
       2018.0            2.0          29.0
       2018.0            2.0          29.0
       2018.0            2.0          29.0
       2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0
      2018.0            2.0          29.0

The year ‘2018’ was not a leap year, so these entries with date ‘2018-02-29’ are incorrect. I will drop them.

Check categorical variables#

Let’s see if variables of type ‘object’ (strings) contain categories.

../_images/3fa30c8ad071fec0665b185ed36b49eff6550ed9fb3c5db1ce0cb4b006db7cea.png

The target variable, booking_status, is quite imbalanced, with the class of interest, ‘canceled’, being less represented than the other class. I am going to replace the values in the target variable from ‘Not_Canceled-Canceled’ to numerical ‘0-1’ right away (without waiting for the creation of dummies) because it will facilitate some early analysis.

All the remaining ‘object’ columns can be converted to categorical.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23263 entries, 1 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype   
---  ------                                --------------  -----   
 Booking_ID                            23263 non-null  object  
 no_of_adults                          23263 non-null  float64 
 no_of_children                        23263 non-null  float64 
 no_of_weekend_nights                  23263 non-null  float64 
 no_of_week_nights                     23263 non-null  float64 
 type_of_meal_plan                     23263 non-null  category
 required_car_parking_space            23263 non-null  float64 
 room_type_reserved                    23263 non-null  category
 lead_time                             23263 non-null  float64 
 arrival_year                          23263 non-null  float64 
arrival_month                         23263 non-null  float64 
arrival_date                          23263 non-null  float64 
market_segment_type                   23263 non-null  category
repeated_guest                        23263 non-null  float64 
no_of_previous_cancellations          23263 non-null  float64 
no_of_previous_bookings_not_canceled  23263 non-null  float64 
avg_price_per_room                    23263 non-null  float64 
no_of_special_requests                23263 non-null  float64 
booking_status                        23263 non-null  int32   
dtypes: category(3), float64(14), int32(1), object(1)
memory usage: 3.0+ MB

Check numerical variables#

Numerical variables are all integer type, except fot the avg_price_per_room, which I will leave as float type.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23263 entries, 1 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype   
---  ------                                --------------  -----   
 Booking_ID                            23263 non-null  object  
 no_of_adults                          23263 non-null  int32   
 no_of_children                        23263 non-null  int32   
 no_of_weekend_nights                  23263 non-null  int32   
 no_of_week_nights                     23263 non-null  int32   
 type_of_meal_plan                     23263 non-null  category
 required_car_parking_space            23263 non-null  int32   
 room_type_reserved                    23263 non-null  category
 lead_time                             23263 non-null  int32   
 arrival_year                          23263 non-null  int32   
arrival_month                         23263 non-null  int32   
arrival_date                          23263 non-null  int32   
market_segment_type                   23263 non-null  category
repeated_guest                        23263 non-null  int32   
no_of_previous_cancellations          23263 non-null  int32   
no_of_previous_bookings_not_canceled  23263 non-null  int32   
avg_price_per_room                    23263 non-null  float64 
no_of_special_requests                23263 non-null  int32   
booking_status                        23263 non-null  int32   
dtypes: category(3), float64(1), int32(14), object(1)
memory usage: 1.8+ MB

Let’s plot numerical data ranges.

../_images/ffa24636b0218d8a4f5751c25f50df711438f9062073cc03bb660d1161a02d73.png

We can see that arrival_year has data out of the range. This is simply because we are dealing with year numbers. In reality, this variable should be considered categorical instead of numerical, it only has two year values.

[2018, 2017]

../_images/ae96e85ff4370ca370f97d70dac2ea542c559764651478ba2b63dd7af4c09ad7.png

To reduce the impact of outliers on the model outcome, I will use the winsorization method to filter them out. This will limit the extreme values to lower and upper limits based on percentiles. I will use a 5% limit for both the upper and lower bounds.

../_images/e570d07d781a2af352a3aa3750e8ca1847cf0092edc391e81b801413830e9641.png

We are now ready to proceed with the analysis!

Predictive analysis#

Data preprocessing#

This process consists of:

Separating variables (features) and target.
Converting categorical variables to numerical (avoiding multicollinearity).
Splitting the data into training and testing sets.
Scaling the data (necessary for Logistic Regression).
Reconstructing complete basetables (features + target) to perform predictive analysis.

Variable selection#

Once we have the train and test basetables ready, we can proceed with the process of selecting the variables that have the highest predictive power.

To do so, I will use a forward stepwise variable selection procedure, in which AUC scores are considered as a metric. Variables will be sorted according to the predictive power achieved if we include them progressively in a Logistic Regression model. The process will be carried out only in the training basetable to avoid data leakage.

Step 1: variable 'lead_time' added
Step 2: variable 'no_of_special_requests' added
Step 3: variable 'market_segment_type_Online' added
Step 4: variable 'avg_price_per_room' added
Step 5: variable 'arrival_year_2018' added
Step 6: variable 'market_segment_type_Offline' added
Step 7: variable 'arrival_month' added
Step 8: variable 'room_type_reserved_Room_Type 2' added
Step 9: variable 'market_segment_type_Complementary' added
Step 10: variable 'market_segment_type_Corporate' added
Step 11: variable 'type_of_meal_plan_Not Selected' added
Step 12: variable 'no_of_weekend_nights' added
Step 13: variable 'no_of_week_nights' added
Step 14: variable 'type_of_meal_plan_Meal Plan 2' added
Step 15: variable 'type_of_meal_plan_Meal Plan 3' added
Step 16: variable 'room_type_reserved_Room_Type 5' added
Step 17: variable 'room_type_reserved_Room_Type 6' added
Step 18: variable 'no_of_children' added
Step 19: variable 'room_type_reserved_Room_Type 7' added
Step 20: variable 'room_type_reserved_Room_Type 4' added
Step 21: variable 'room_type_reserved_Room_Type 3' added
Step 22: variable 'no_of_adults' added
Step 23: variable 'no_of_previous_bookings_not_canceled' added
Step 24: variable 'no_of_previous_cancellations' added
Step 25: variable 'repeated_guest' added
Step 26: variable 'required_car_parking_space' added
Step 27: variable 'arrival_date' added

We will now visualize the performance evolution as variables are included in the model in the order defined by the list. We will consider both the train and test basetables to check the validity of the results.

Show code cell source Hide code cell source

# Init lists
auc_values_train = []
auc_values_test = []
variables_evaluate = []

# Iterate over the variables in variables
for v in current_variables:
  
    # Add the variable
    variables_evaluate.append(v)
    
    # Calculate the train and test AUC of this set of variables
    auc_train, auc_test = auc_train_test(variables_evaluate, ["booking_status"], train, test)
    
    # Append the values to the lists
    auc_values_train.append(auc_train)
    auc_values_test.append(auc_test)

# Create dataframe to plot results
aucs = pd.concat([pd.DataFrame(np.array(auc_values_train),
                               columns=['Train'],
                               index=current_variables),
                  pd.DataFrame(np.array(auc_values_test),
                               columns=['Test'],
                               index=current_variables)],
                 axis=1)

# Plot
fig, ax = plt.subplots(figsize=(7, 10))

ax.plot(aucs['Train'], aucs.index, label='Train')
ax.plot(aucs['Test'], aucs.index, label='Test')

sns.despine()
ax.grid(axis="both")
ax.set_axisbelow(True)

ax.set_title('', fontsize=14)
ax.set_xlabel('AUC performance score', fontsize=14)
ax.set_ylabel("", fontsize=14)

ax.tick_params(axis='x', labelsize=12, rotation=0)
ax.tick_params(axis='y', labelsize=12)

ax.legend(title='Data set', loc='center', title_fontsize=13, fontsize=13)

ax.annotate('',
            xy=(0.727, 3),
            xytext=(0.727, 0), fontsize=12,
            arrowprops={"arrowstyle":"-|>", "color":"black", 'linewidth': '0.75','linestyle':"--"})

ax.annotate('',
            xy=(0.843, 3), 
            xytext=(0.727, 3), fontsize=12,
            arrowprops={"arrowstyle":"-|>", "color":"black", 'linewidth': '0.75','linestyle':"--"})

ax.annotate('',
            xy=(0.843, 27), 
            xytext=(0.843, 3), fontsize=12,
            arrowprops={"arrowstyle":"-|>", "color":"black", 'linewidth': '0.75','linestyle':"--"})

ax.annotate("Stepwise\nvariable selection list", (0.73, 2), size=12)
ax.annotate("Cut-off:\nperformance improvement no longer significant", (0.74, 4.6), size=12)

ax.invert_yaxis()

plt.show()

# Selected variables
n_variables = 4
selected_variables = current_variables[:n_variables]

../_images/6734675f4f0d3fed14270674e4148b036f3d0f6fea47035f5acabd8943b33422.png

After conducting the forward stepwise variable selection procedure, a total of 4 variables were selected based on their predictive power.

lead_time
no_of_special_requests
market_segment_type_Online
avg_price_per_room

To ensure that we are not missing any important variables, I will compare the accuracy, precision, and recall scores of the Logistic Regression model when fitted with all variables vs when fitted only with the 4 selected ones.

../_images/a7ed14ca65ee17b6ecc9ed0f15699ac33d02e29675fb2f27743234c1d648ea89.png

This comparison tells that we are not losing significant predictive information if we only consider the selected variables.

Coefficients of the Logistic Regression model tell us about the importance of each of the variables.

../_images/c784c9969979a522582da11471a90bbdbb6a0fb3ada6bc2b3c897d998a6e1370.png

In the graph, we can see the values of the coefficients for each variable, sorted according to their absolute values (predictive power).

However, we selected our own list of variables based on the model performance’s progressive improvement. We can see that both lists have the same variables, but the order of importance is not exactly the same.

../_images/4bc5150b8ed64e1f5f84d979ab3a1dce328d83b088938d76d6f02a65683283eb.png

In our selection lead_time comes first instead of no_of_special_requests as the most important predictive variable.

Predictor Insight Graphs#

Let’s finish our analysis plotting the Predictor Insight Graphs for the selected variables, to verify whether the variables in the model are interpretable and the results make sense.

Let’s begin with lead_time. This is a continuous variable that has many unique values, so we need to discretize it (define intervals and group values into those intervals) before plotting.

../_images/64b661e576aade558a5da13b6ef47701d3e1a12ee83282a7c615595d93e1a075.png

When the lead_time (the number of days between the booking date and the arrival date) increases, there is also an increase in the incidence on the target (booking_status -> ‘1’: ‘Cancelled’) as shown in the graph. This effect is particularly pronounced when the lead_time is more than three months, as the cancellation ratio increases dramatically.

../_images/3a705a205425bb3b22fdc1523c5739f39288210a17ff2bf50d9677502d2e7d4a.png

The variable no_of_special_requests is negatively correlated with the target (the coefficient in the Logistic Regression model was negative), which means that the more requests a customer makes as part of the booking, the greater the incidence on the cancellation.

../_images/679f2a4bf7ab839fced762c6aac80fd7477874bd740b54e08edf5eca92f94fd6.png

It seems that if the booking was made online, the chances of it being cancelled are clearly higher.

../_images/c1078feed75371e805a9968df3836a7ca5d864ac8ead32d42ebb3a5c2f3ed63c.png

Finally, the price of the room (a continuous variable that was also discretized) also has an important influence on predicting cancellations, especially in the range between low and medium-priced rooms.

Conclusions#

In summary, the main factors that contribute to the cancellation of bookings are:

The lead time between the reservation and the arrival date.
The number of special requests made by the customer.
Whether the booking was made online.
The price of the room.

To reduce the likelihood of cancellations, these variables should be closely monitored to produce a warning when the probability of cancellation reaches a certain level. Further actions should then be taken to address those customers.