Predicting Industrial Machine Downtime#

A DataCamp challenge      Mar, 2025

Classification

The project#

A manufacturer of high-precision metal components used in aerospace, automotives, and medical device applications operates three different machines on its shop floor that produce different sized components, so minimizing the downtime of these machines is vital for meeting production deadlines.

The aim is to use a data-driven approach to predicting machine downtime, so proactive maintenance can be planned rather than being reactive to machine failure. To support this, the company has been collecting operational data for over a year and whether each machine was down at those times.

Develop a predictive model that could be combined with real-time operational data to detect likely machine failure.

The data#

The company has stored the machine operating data in a single table. Each row in the table represents the operational data for a single machine on a given day:

  • "Date" - the date the reading was taken on.

  • "Machine_ID" - the unique identifier of the machine being read.

  • "Assembly_Line_No" - the unique identifier of the assembly line the machine is located on.

  • "Hydraulic_Pressure(bar)", "Coolant_Pressure(bar)", and "Air_System_Pressure(bar)" - pressure measurements at different points in the machine.

  • "Coolant_Temperature", "Hydraulic_Oil_Temperature", and "Spindle_Bearing_Temperature" - temperature measurements (in Celsius) at different points in the machine.

  • "Spindle_Vibration", "Tool_Vibration", and "Spindle_Speed(RPM)" - vibration (measured in micrometers) and rotational speed measurements for the spindle and tool.

  • "Voltage(volts)" - the voltage supplied to the machine.

  • "Torque(Nm)" - the torque being generated by the machine.

  • "Cutting(KN)" - the cutting force of the tool.

  • "Downtime" - an indicator of whether the machine was down or not on the given day.

Hide code cell source
# Import packages
import calendar

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Read the data
downtime = pd.read_csv(
    "data/machine_downtime.csv",
    parse_dates=["Date"],
    date_format="%d-%m-%Y",
    index_col="Date",
)
downtime
Machine_ID Assembly_Line_No Hydraulic_Pressure(bar) Coolant_Pressure(bar) Air_System_Pressure(bar) Coolant_Temperature Hydraulic_Oil_Temperature Spindle_Bearing_Temperature Spindle_Vibration Tool_Vibration Spindle_Speed(RPM) Voltage(volts) Torque(Nm) Cutting(kN) Downtime
Date
2021-12-31 Makino-L1-Unit1-2013 Shopfloor-L1 71.040000 6.933725 6.284965 25.6 46.0 33.4 1.291 26.492 25892.0 335.0 24.055326 3.58 Machine_Failure
2021-12-31 Makino-L1-Unit1-2013 Shopfloor-L1 125.330000 4.936892 6.196733 35.3 47.4 34.6 1.382 25.274 19856.0 368.0 14.202890 2.68 Machine_Failure
2021-12-31 Makino-L3-Unit1-2015 Shopfloor-L3 71.120000 6.839413 6.655448 13.1 40.7 33.0 1.319 30.608 19851.0 325.0 24.049267 3.55 Machine_Failure
2022-05-31 Makino-L2-Unit1-2015 Shopfloor-L2 139.340000 4.574382 6.560394 24.4 44.2 40.6 0.618 30.791 18461.0 360.0 25.860029 3.55 Machine_Failure
2022-03-31 Makino-L1-Unit1-2013 Shopfloor-L1 60.510000 6.893182 6.141238 4.1 47.3 31.4 0.983 25.516 26526.0 354.0 25.515874 3.55 Machine_Failure
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2022-02-01 Makino-L1-Unit1-2013 Shopfloor-L1 112.715506 5.220885 6.196610 22.3 48.8 37.2 0.910 20.282 20974.0 282.0 22.761610 2.72 No_Machine_Failure
2022-02-01 Makino-L1-Unit1-2013 Shopfloor-L1 103.086653 5.211886 7.074653 11.9 48.3 31.5 1.106 34.708 20951.0 319.0 22.786597 2.94 No_Machine_Failure
2022-02-01 Makino-L2-Unit1-2015 Shopfloor-L2 118.643165 5.212991 6.530049 4.5 49.9 36.2 0.288 16.828 20958.0 335.0 22.778987 NaN No_Machine_Failure
2022-02-01 Makino-L3-Unit1-2015 Shopfloor-L3 145.855859 5.207777 6.402655 12.2 44.5 32.1 0.995 26.498 20935.0 376.0 22.804012 2.79 No_Machine_Failure
2022-02-01 Makino-L2-Unit1-2015 Shopfloor-L2 96.690000 5.936610 7.109355 29.8 53.2 36.2 0.840 31.580 23576.0 385.0 24.409551 3.55 Machine_Failure

2500 rows × 15 columns

Data validation#

Check data quality#

Hide code cell source
# Get dataframe info
downtime.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2500 entries, 2021-12-31 to 2022-02-01
Data columns (total 15 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Machine_ID                   2500 non-null   object 
 1   Assembly_Line_No             2500 non-null   object 
 2   Hydraulic_Pressure(bar)      2490 non-null   float64
 3   Coolant_Pressure(bar)        2481 non-null   float64
 4   Air_System_Pressure(bar)     2483 non-null   float64
 5   Coolant_Temperature          2488 non-null   float64
 6   Hydraulic_Oil_Temperature    2484 non-null   float64
 7   Spindle_Bearing_Temperature  2493 non-null   float64
 8   Spindle_Vibration            2489 non-null   float64
 9   Tool_Vibration               2489 non-null   float64
 10  Spindle_Speed(RPM)           2494 non-null   float64
 11  Voltage(volts)               2494 non-null   float64
 12  Torque(Nm)                   2479 non-null   float64
 13  Cutting(kN)                  2493 non-null   float64
 14  Downtime                     2500 non-null   object 
dtypes: float64(12), object(3)
memory usage: 312.5+ KB

I will drop records with missing values.

Hide code cell source
# Drop records with missing values
downtime.dropna(inplace=True)

Check for duplicated.

Hide code cell source
# Check duplicated rows
downtime.duplicated().sum()
0

There are no duplicated records.

Check value ranges#

Categorical columns#

Based on the dataframe overview, it appears that Machine_ID and Assembly_Line_No refer to the same information. If that’s the case, one of these columns may be redundant.
Let’s check this out!

Hide code cell source
# Plot
fig, ax = plt.subplots(1, 2, figsize=(8, 3))
downtime["Machine_ID"].value_counts().plot(ax=ax[0], kind="bar")
downtime["Assembly_Line_No"].value_counts().plot(ax=ax[1], kind="bar")
for i in range(2):
    ax[i].set_ylabel("Counts")
    ax[i].tick_params(axis="x", labelsize=10, rotation=45)
sns.despine()
plt.show()
../_images/f2194cfe406b224bb7d3f6355ccb677b0949c1a4f36314da3084c7208557df2e.png

It looks so. Let’s confirm it labeling them with a number and calculating the relative correlation coefficient.

Hide code cell source
# Relate identifiers with numbers to encode machines and lines
machine_encoding = {"Makino-L1-Unit1-2013": 1, "Makino-L2-Unit1-2015": 2, "Makino-L3-Unit1-2015": 3}
line_encoding = {"Shopfloor-L1": 1, "Shopfloor-L2": 2, "Shopfloor-L3": 3}

# Replace machine ID name by number
downtime["Machine_ID"] = downtime["Machine_ID"].apply(lambda x: machine_encoding[x])

# Replace line name by number
downtime["Assembly_Line_No"] = downtime["Assembly_Line_No"].apply(lambda x: line_encoding[x])

# Get correlation between these columns
print(f"Correlation -> {downtime["Machine_ID"].corr(downtime["Assembly_Line_No"])}")
Correlation -> 0.9999999999999999

They are indeed identical, so we can drop one of the columns, Assembly_Line_No for instance.

Hide code cell source
# Drop column
downtime.drop("Assembly_Line_No", axis=1, inplace=True)

Let’s check how balanced the target variable is:

Hide code cell source
# Plot
fig, ax = plt.subplots(figsize=(5, 3))
downtime["Downtime"].value_counts().plot(ax=ax, kind="bar")
ax.tick_params(axis="x", labelsize=10, rotation=45)
ax.set_ylabel("Counts")
sns.despine()
plt.show()
../_images/f462539947cbe59cceb51fb46a0ca4c9e075c8b315937f127b38504d296f904e.png

Quite balanced. I will replace “Machine_Failure” by numeric value “1” and “No_Machine_Failure” by “0”.

Hide code cell source
# Replace target by numerical values
downtime["Downtime"] = downtime["Downtime"].apply(
    lambda x: 1 if x == "Machine_Failure" else 0
)

# Get dataframe info
downtime.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2381 entries, 2021-12-31 to 2022-02-01
Data columns (total 14 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Machine_ID                   2381 non-null   int64  
 1   Hydraulic_Pressure(bar)      2381 non-null   float64
 2   Coolant_Pressure(bar)        2381 non-null   float64
 3   Air_System_Pressure(bar)     2381 non-null   float64
 4   Coolant_Temperature          2381 non-null   float64
 5   Hydraulic_Oil_Temperature    2381 non-null   float64
 6   Spindle_Bearing_Temperature  2381 non-null   float64
 7   Spindle_Vibration            2381 non-null   float64
 8   Tool_Vibration               2381 non-null   float64
 9   Spindle_Speed(RPM)           2381 non-null   float64
 10  Voltage(volts)               2381 non-null   float64
 11  Torque(Nm)                   2381 non-null   float64
 12  Cutting(kN)                  2381 non-null   float64
 13  Downtime                     2381 non-null   int64  
dtypes: float64(12), int64(2)
memory usage: 279.0 KB

Numerical columns#

Apart from the target and the Machine_ID, features are all numerical float data type. I will plot their value ranges to get a sense of their scope.

Hide code cell source
# Define feature list
features = downtime.select_dtypes(include="float64").columns

# Plot
rows = 3
cols = 4
fig, ax = plt.subplots(rows, cols, figsize=(11, 11))
i, j = 0, 0
for n, feature in enumerate(features):
    downtime[feature].plot(ax=ax[i, j], kind="box")
    j += 1  # Next col
    if (n + 1) % cols == 0:  # Next row, reset col
        i += 1
        j = 0

plt.show()
../_images/84e34968d51381852c4fd1195715e7355b7a8ebaba344f8b52f546d68a1ad90b.png

Sort dates#

I will sort the dataframe by dates (the index) and print first and last recorded date.

Hide code cell source
# Sort dataframe by date, which is the index
downtime.sort_index(inplace=True)

# Print
print(f"Data readings starting at: {downtime.index[0].strftime('%Y-%m-%d')}")
print(f"Data readings ending at: {downtime.index[-1].strftime('%Y-%m-%d')}")
Data readings starting at: 2021-11-24
Data readings ending at: 2022-06-19

Exploratory Data analysis#

Downtime ratios#

By machine#

Which machine has the highest readings of machine downtime?

Hide code cell source
# Compare downtime metrics between machines
downtime.groupby("Machine_ID")["Downtime"].agg(["count", "sum", "mean"]).rename(
    columns={"sum": "downtimes", "mean": "downtime_ratio"}
)
count downtimes downtime_ratio
Machine_ID
1 830 440 0.530120
2 763 382 0.500655
3 788 409 0.519036

Machine number 1 has the highest machine downtime ratio (it is also counted for more times).

Monthly#

Hide code cell source
# Group by month
downtime.groupby(downtime.index.month)["Downtime"].agg(["count", "sum", "mean"])
count sum mean
Date
1 170 84 0.494118
2 554 282 0.509025
3 1005 526 0.523383
4 512 274 0.535156
5 110 48 0.436364
6 7 3 0.428571
11 1 1 1.000000
12 22 13 0.590909

Months 11 (November) and 6 (June) have scarce records, and 12 (December) has little more, so their values should not be representative.

By machine and monthly#

Hide code cell source
# Group by month and by machine
downtime.pivot_table(
    index=downtime.index.month,
    columns="Machine_ID",
    values="Downtime",
    aggfunc=["count", "sum", "mean"],
)
count sum mean
Machine_ID 1 2 3 1 2 3 1 2 3
Date
1 61.0 54.0 55.0 33.0 22.0 29.0 0.540984 0.407407 0.527273
2 182.0 165.0 207.0 85.0 85.0 112.0 0.467033 0.515152 0.541063
3 349.0 323.0 333.0 191.0 166.0 169.0 0.547278 0.513932 0.507508
4 188.0 171.0 153.0 107.0 85.0 82.0 0.569149 0.497076 0.535948
5 39.0 40.0 31.0 16.0 18.0 14.0 0.410256 0.450000 0.451613
6 1.0 4.0 2.0 0.0 3.0 0.0 0.000000 0.750000 0.000000
11 NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.000000
12 10.0 6.0 6.0 8.0 3.0 2.0 0.800000 0.500000 0.333333
Hide code cell source
# Plot by month
fig, ax = plt.subplots(figsize=(6, 5))
sns.pointplot(
    ax=ax,
    x=downtime.index.month,
    y="Downtime",
    data=downtime,
    hue="Machine_ID",
    dodge=0.2,
)
ax.set_xlabel("Month (1: January - 12: December)")
ax.set_ylabel("Downtime Ratio")
sns.despine()
plt.show()
../_images/192c19a714ddc3cac0fb79afa21799ce6529f24bc86b3cfd20e6931ee53a38f7.png

As mentioned before, values in months 6, 11 and 12 are not representative.

By machine and day of the week#

Hide code cell source
# Plot by day of the week
fig, ax = plt.subplots(figsize=(6, 5))
sns.pointplot(
    ax=ax,
    x=downtime.index.dayofweek,
    y="Downtime",
    data=downtime,
    hue="Machine_ID",
    dodge=0.2,
)
ax.set_xlabel("")
ax.set_ylabel("Downtime Ratio")
ax.set_xticks(range(7), labels=list(calendar.day_abbr))
sns.despine()
plt.show()
../_images/219d8ad4c51124fc67995f6bb7a527ec39fd10ba563f0cfb1bbf4fb77926c330.png

Confidence intervals overlap, so there is no significative difference between days of the week.

I’ve concluded that the time series in the index of this dataframe does not appear to have any meaningful predictive power regarding machine downtime.

Correlations of features#

Let’s calculate correlation coeficients between the features.

Hide code cell source
# Get pearson correlation matrix
corr = downtime.corr()

# Create a mask to only show half of the matrix
mask = np.triu(np.ones_like(corr, dtype=bool))

# Create a diverging color palette between two HUSL colors.
cmap = sns.diverging_palette(h_neg=10, h_pos=240, as_cmap=True)

# Plot the heatmap
fig, ax = plt.subplots(figsize=(8, 7))

sns.heatmap(
    corr, ax=ax, mask=mask, center=0, cmap=cmap, linewidths=1, annot=True, fmt=".2f"
)

plt.show()
../_images/7dbe3e66155f38390720377ccb80f00e172b14aadca2af1c04fb65459e143c5d.png

According ot their relatively high correlation, the following features seem to be mostly connected to machine downtime:

  • Hydraulic Pressure (negatively)

  • Torque being generated by the machine (negatively)

  • Cutting force of the tool (positively)

Let’s plot it!

Hide code cell source
# Define those correlated column names in a list
corr_columns = ["Hydraulic_Pressure(bar)", "Torque(Nm)", "Cutting(kN)"]

# Subset dataframe to rows with downtime = 1 and rows with downtime = 0
downtime_yes = downtime.loc[downtime["Downtime"] == 1, :]
downtime_no = downtime.loc[downtime["Downtime"] == 0, :]

# Plot the value of the feature in each downtime case
for column in corr_columns:
    # Plot
    fig, ax = plt.subplots(1, 2, figsize=(10, 4))
    downtime_no[column].plot(ax=ax[0], label="0")
    downtime_yes[column].plot(ax=ax[0], label="1")
    sns.boxplot(ax=ax[1], x="Downtime", y=column, data=downtime, hue="Downtime")
    ax[0].set_title(column)
    ax[0].set_xlabel("")
    ax[0].legend(title="Downtime")
    sns.despine()
    plt.show()
../_images/9494df6b3c567be1e9da388d81fb83c6c103f6d3b6be2636853785f7a562fd45.png ../_images/6e9e57c6b0e9f267f8de1bad00045d65715066ceda790c8a4408c9ad1a2ac3dd.png ../_images/e3112dfed9ddc537c25e3c8bd82d7643e114e08c0233c4a50e1ca95006463a81.png

Indeed, in these graphs we can see those correlations and their sign.

Modelling a classifier#

Logistic Regression#

I will use a Logistic Regression model for a start.

Hide code cell source
# Define function for classification
def lr_classifier(X, y):

    # Split dataset into training and test set, and stratify
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )

    # Scale
    scaler = StandardScaler()
    X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
    X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X.columns)

    # Instantiate model
    model = LogisticRegression()

    # Fit model to the training set
    model.fit(X_train_scaled, y_train)

    # Predict test set values
    y_pred = model.predict(X_test_scaled)

    # Print results
    print(f"{accuracy_score(y_test, y_pred):.2f} <- Test-set accuracy ")
    print(
        f"{accuracy_score(y_train, model.predict(X_train_scaled)):.2f} <- Train-set accuracy "
    )
    print(classification_report(y_test, y_pred))

    return model


# Define features and target
y = downtime["Downtime"]
X = downtime.drop("Downtime", axis=1)

# Classify
model = lr_classifier(X, y)
0.85 <- Test-set accuracy 
0.86 <- Train-set accuracy 
              precision    recall  f1-score   support

           0       0.84      0.85      0.84       345
           1       0.86      0.84      0.85       370

    accuracy                           0.85       715
   macro avg       0.85      0.85      0.85       715
weighted avg       0.85      0.85      0.85       715
Hide code cell source
# Plot coefficients
plt.bar(X.columns, np.abs(model.coef_[0]))
plt.title("Logistic Regression model coefficients", fontsize=11)
plt.xticks(rotation=90)
sns.despine()
plt.show()
../_images/1801baab6e5ecd7083daf183d5c7499f52338f817fbbc5d051b4e34418a4623c.png

As expected, model coefficients show that the previous 3 most correlated features have more importance in the model.

Modeling machines separately#

Let’s see results if we consider machines 1, 2 and 3 separately.

Hide code cell source
# Subset dataset for each machine
machine_1 = downtime.loc[downtime["Machine_ID"] == 1, :]
machine_2 = downtime.loc[downtime["Machine_ID"] == 2, :]
machine_3 = downtime.loc[downtime["Machine_ID"] == 3, :]

# Create list of those filtered datasets
machines = [machine_1, machine_2, machine_3]

# Create classifiers separately for each machine
for i, machine in enumerate(machines):
    # Define features and target
    y = machine["Downtime"]
    X = machine.drop("Downtime", axis=1)

    # Classify
    print(f"\nMachine {i + 1}:")
    model = lr_classifier(X, y)
Machine 1:
0.87 <- Test-set accuracy 
0.87 <- Train-set accuracy 
              precision    recall  f1-score   support

           0       0.88      0.84      0.86       117
           1       0.86      0.90      0.88       132

    accuracy                           0.87       249
   macro avg       0.87      0.87      0.87       249
weighted avg       0.87      0.87      0.87       249


Machine 2:
0.85 <- Test-set accuracy 
0.84 <- Train-set accuracy 
              precision    recall  f1-score   support

           0       0.86      0.84      0.85       114
           1       0.85      0.86      0.85       115

    accuracy                           0.85       229
   macro avg       0.85      0.85      0.85       229
weighted avg       0.85      0.85      0.85       229


Machine 3:
0.85 <- Test-set accuracy 
0.85 <- Train-set accuracy 
              precision    recall  f1-score   support

           0       0.90      0.78      0.84       114
           1       0.82      0.92      0.87       123

    accuracy                           0.85       237
   macro avg       0.86      0.85      0.85       237
weighted avg       0.86      0.85      0.85       237

Looks like results are slightly better for machine 1.

Random Forest Classifier#

Let’s see if a Ramdom Forest model can improve overall results.

Hide code cell source
# Define features and target
y = downtime["Downtime"]
X = downtime.drop("Downtime", axis=1)

# Split dataset into training and test set, and stratify
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Scale
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X.columns)

# Instantiate model
model = RandomForestClassifier(n_estimators=100)

# Fit model to the training set
model.fit(X_train_scaled, y_train)

# Predict test set values
y_pred = model.predict(X_test_scaled)

# Print results
print(f"{accuracy_score(y_test, y_pred):.2f} <- Test-set accuracy ")
print(
    f"{accuracy_score(y_train, model.predict(X_train_scaled)):.2f} <- Train-set accuracy "
)
print(classification_report(y_test, y_pred))
0.99 <- Test-set accuracy 
1.00 <- Train-set accuracy 
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       345
           1       0.99      0.99      0.99       370

    accuracy                           0.99       715
   macro avg       0.99      0.99      0.99       715
weighted avg       0.99      0.99      0.99       715

Outstanding performance! These results show the capability of the Random Forest classifier to capture complex, non-linear relationships in the data, to achive impressive results.

Hide code cell source
# Plot coefficients
plt.bar(X.columns, model.feature_importances_)
plt.title("Random Forest feature importances", fontsize=11)
plt.xticks(rotation=90)
sns.despine()
plt.show()
../_images/3625c3810e25eca83dc7094509600029934679ad32e096d97bda9143b18ada1d.png

Conclusion#

Initially, I considered incorporating the time series data for predictive purposes. However, the exploratory analysis revealed no patterns associated with the dates, indicating that this was simply a feature-based classification problem.

While the Logistic Regression classifier performed reasonably well, the non-linearities in the dataset made the Random Forest classifier a more suitable model, yielding impressive results with minimal tuning.