Predicting Industrial Machine Downtime

Predicting Industrial Machine Downtime#

A DataCamp challenge Mar, 2025

Classification

The project#

A manufacturer of high-precision metal components used in aerospace, automotives, and medical device applications operates three different machines on its shop floor that produce different sized components, so minimizing the downtime of these machines is vital for meeting production deadlines.

The aim is to use a data-driven approach to predicting machine downtime, so proactive maintenance can be planned rather than being reactive to machine failure. To support this, the company has been collecting operational data for over a year and whether each machine was down at those times.

Develop a predictive model that could be combined with real-time operational data to detect likely machine failure.

The data#

The company has stored the machine operating data in a single table. Each row in the table represents the operational data for a single machine on a given day:

"Date" - the date the reading was taken on.
"Machine_ID" - the unique identifier of the machine being read.
"Assembly_Line_No" - the unique identifier of the assembly line the machine is located on.
"Hydraulic_Pressure(bar)", "Coolant_Pressure(bar)", and "Air_System_Pressure(bar)" - pressure measurements at different points in the machine.
"Coolant_Temperature", "Hydraulic_Oil_Temperature", and "Spindle_Bearing_Temperature" - temperature measurements (in Celsius) at different points in the machine.
"Spindle_Vibration", "Tool_Vibration", and "Spindle_Speed(RPM)" - vibration (measured in micrometers) and rotational speed measurements for the spindle and tool.
"Voltage(volts)" - the voltage supplied to the machine.
"Torque(Nm)" - the torque being generated by the machine.
"Cutting(KN)" - the cutting force of the tool.
"Downtime" - an indicator of whether the machine was down or not on the given day.

	Machine_ID	Assembly_Line_No	Hydraulic_Pressure(bar)	Coolant_Pressure(bar)	Air_System_Pressure(bar)	Coolant_Temperature	Hydraulic_Oil_Temperature	Spindle_Bearing_Temperature	Spindle_Vibration	Tool_Vibration	Spindle_Speed(RPM)	Voltage(volts)	Torque(Nm)	Cutting(kN)	Downtime
Date
2021-12-31	Makino-L1-Unit1-2013	Shopfloor-L1	71.040000	6.933725	6.284965	25.6	46.0	33.4	1.291	26.492	25892.0	335.0	24.055326	3.58	Machine_Failure
2021-12-31	Makino-L1-Unit1-2013	Shopfloor-L1	125.330000	4.936892	6.196733	35.3	47.4	34.6	1.382	25.274	19856.0	368.0	14.202890	2.68	Machine_Failure
2021-12-31	Makino-L3-Unit1-2015	Shopfloor-L3	71.120000	6.839413	6.655448	13.1	40.7	33.0	1.319	30.608	19851.0	325.0	24.049267	3.55	Machine_Failure
2022-05-31	Makino-L2-Unit1-2015	Shopfloor-L2	139.340000	4.574382	6.560394	24.4	44.2	40.6	0.618	30.791	18461.0	360.0	25.860029	3.55	Machine_Failure
2022-03-31	Makino-L1-Unit1-2013	Shopfloor-L1	60.510000	6.893182	6.141238	4.1	47.3	31.4	0.983	25.516	26526.0	354.0	25.515874	3.55	Machine_Failure
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2022-02-01	Makino-L1-Unit1-2013	Shopfloor-L1	112.715506	5.220885	6.196610	22.3	48.8	37.2	0.910	20.282	20974.0	282.0	22.761610	2.72	No_Machine_Failure
2022-02-01	Makino-L1-Unit1-2013	Shopfloor-L1	103.086653	5.211886	7.074653	11.9	48.3	31.5	1.106	34.708	20951.0	319.0	22.786597	2.94	No_Machine_Failure
2022-02-01	Makino-L2-Unit1-2015	Shopfloor-L2	118.643165	5.212991	6.530049	4.5	49.9	36.2	0.288	16.828	20958.0	335.0	22.778987	NaN	No_Machine_Failure
2022-02-01	Makino-L3-Unit1-2015	Shopfloor-L3	145.855859	5.207777	6.402655	12.2	44.5	32.1	0.995	26.498	20935.0	376.0	22.804012	2.79	No_Machine_Failure
2022-02-01	Makino-L2-Unit1-2015	Shopfloor-L2	96.690000	5.936610	7.109355	29.8	53.2	36.2	0.840	31.580	23576.0	385.0	24.409551	3.55	Machine_Failure

2500 rows × 15 columns

Data validation#

Check data quality#

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2500 entries, 2021-12-31 to 2022-02-01
Data columns (total 15 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Machine_ID                   2500 non-null   object 
 1   Assembly_Line_No             2500 non-null   object 
 2   Hydraulic_Pressure(bar)      2490 non-null   float64
 3   Coolant_Pressure(bar)        2481 non-null   float64
 4   Air_System_Pressure(bar)     2483 non-null   float64
 5   Coolant_Temperature          2488 non-null   float64
 6   Hydraulic_Oil_Temperature    2484 non-null   float64
 7   Spindle_Bearing_Temperature  2493 non-null   float64
 8   Spindle_Vibration            2489 non-null   float64
 9   Tool_Vibration               2489 non-null   float64
 10  Spindle_Speed(RPM)           2494 non-null   float64
 11  Voltage(volts)               2494 non-null   float64
 12  Torque(Nm)                   2479 non-null   float64
 13  Cutting(kN)                  2493 non-null   float64
 14  Downtime                     2500 non-null   object 
dtypes: float64(12), object(3)
memory usage: 312.5+ KB

I will drop records with missing values.

Check for duplicated.

There are no duplicated records.

Check value ranges#

Categorical columns#

Based on the dataframe overview, it appears that Machine_ID and Assembly_Line_No refer to the same information. If that’s the case, one of these columns may be redundant.
Let’s check this out!

../_images/f2194cfe406b224bb7d3f6355ccb677b0949c1a4f36314da3084c7208557df2e.png

It looks so. Let’s confirm it labeling them with a number and calculating the relative correlation coefficient.

Correlation -> 0.9999999999999999

They are indeed identical, so we can drop one of the columns, Assembly_Line_No for instance.

Let’s check how balanced the target variable is:

../_images/f462539947cbe59cceb51fb46a0ca4c9e075c8b315937f127b38504d296f904e.png

Quite balanced. I will replace “Machine_Failure” by numeric value “1” and “No_Machine_Failure” by “0”.

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2381 entries, 2021-12-31 to 2022-02-01
Data columns (total 14 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Machine_ID                   2381 non-null   int64  
 1   Hydraulic_Pressure(bar)      2381 non-null   float64
 2   Coolant_Pressure(bar)        2381 non-null   float64
 3   Air_System_Pressure(bar)     2381 non-null   float64
 4   Coolant_Temperature          2381 non-null   float64
 5   Hydraulic_Oil_Temperature    2381 non-null   float64
 6   Spindle_Bearing_Temperature  2381 non-null   float64
 7   Spindle_Vibration            2381 non-null   float64
 8   Tool_Vibration               2381 non-null   float64
 9   Spindle_Speed(RPM)           2381 non-null   float64
 10  Voltage(volts)               2381 non-null   float64
 11  Torque(Nm)                   2381 non-null   float64
 12  Cutting(kN)                  2381 non-null   float64
 13  Downtime                     2381 non-null   int64  
dtypes: float64(12), int64(2)
memory usage: 279.0 KB

Numerical columns#

Apart from the target and the Machine_ID, features are all numerical float data type. I will plot their value ranges to get a sense of their scope.

../_images/84e34968d51381852c4fd1195715e7355b7a8ebaba344f8b52f546d68a1ad90b.png

Sort dates#

I will sort the dataframe by dates (the index) and print first and last recorded date.

Data readings starting at: 2021-11-24
Data readings ending at: 2022-06-19

Exploratory Data analysis#

Downtime ratios#

By machine#

Which machine has the highest readings of machine downtime?

	count	downtimes	downtime_ratio
Machine_ID
1	830	440	0.530120
2	763	382	0.500655
3	788	409	0.519036

Machine number 1 has the highest machine downtime ratio (it is also counted for more times).

Monthly#

	count	sum	mean
Date
1	170	84	0.494118
2	554	282	0.509025
3	1005	526	0.523383
4	512	274	0.535156
5	110	48	0.436364
6	7	3	0.428571
11	1	1	1.000000
12	22	13	0.590909

Months 11 (November) and 6 (June) have scarce records, and 12 (December) has little more, so their values should not be representative.

By machine and monthly#

	count			sum			mean
Machine_ID	1	2	3	1	2	3	1	2	3
Date
1	61.0	54.0	55.0	33.0	22.0	29.0	0.540984	0.407407	0.527273
2	182.0	165.0	207.0	85.0	85.0	112.0	0.467033	0.515152	0.541063
3	349.0	323.0	333.0	191.0	166.0	169.0	0.547278	0.513932	0.507508
4	188.0	171.0	153.0	107.0	85.0	82.0	0.569149	0.497076	0.535948
5	39.0	40.0	31.0	16.0	18.0	14.0	0.410256	0.450000	0.451613
6	1.0	4.0	2.0	0.0	3.0	0.0	0.000000	0.750000	0.000000
11	NaN	NaN	1.0	NaN	NaN	1.0	NaN	NaN	1.000000
12	10.0	6.0	6.0	8.0	3.0	2.0	0.800000	0.500000	0.333333

../_images/192c19a714ddc3cac0fb79afa21799ce6529f24bc86b3cfd20e6931ee53a38f7.png

As mentioned before, values in months 6, 11 and 12 are not representative.

By machine and day of the week#

../_images/219d8ad4c51124fc67995f6bb7a527ec39fd10ba563f0cfb1bbf4fb77926c330.png

Confidence intervals overlap, so there is no significative difference between days of the week.

I’ve concluded that the time series in the index of this dataframe does not appear to have any meaningful predictive power regarding machine downtime.

Correlations of features#

Let’s calculate correlation coeficients between the features.

../_images/7dbe3e66155f38390720377ccb80f00e172b14aadca2af1c04fb65459e143c5d.png

According ot their relatively high correlation, the following features seem to be mostly connected to machine downtime:

Hydraulic Pressure (negatively)
Torque being generated by the machine (negatively)
Cutting force of the tool (positively)

Let’s plot it!

../_images/9494df6b3c567be1e9da388d81fb83c6c103f6d3b6be2636853785f7a562fd45.png

../_images/6e9e57c6b0e9f267f8de1bad00045d65715066ceda790c8a4408c9ad1a2ac3dd.png

../_images/e3112dfed9ddc537c25e3c8bd82d7643e114e08c0233c4a50e1ca95006463a81.png

Indeed, in these graphs we can see those correlations and their sign.

Modelling a classifier#

Logistic Regression#

I will use a Logistic Regression model for a start.

0.85 <- Test-set accuracy 
0.86 <- Train-set accuracy 
              precision    recall  f1-score   support

           0       0.84      0.85      0.84       345
           1       0.86      0.84      0.85       370

    accuracy                           0.85       715
   macro avg       0.85      0.85      0.85       715
weighted avg       0.85      0.85      0.85       715

../_images/1801baab6e5ecd7083daf183d5c7499f52338f817fbbc5d051b4e34418a4623c.png

As expected, model coefficients show that the previous 3 most correlated features have more importance in the model.

Modeling machines separately#

Let’s see results if we consider machines 1, 2 and 3 separately.

Machine 1:
0.87 <- Test-set accuracy 
0.87 <- Train-set accuracy 
              precision    recall  f1-score   support

           0       0.88      0.84      0.86       117
           1       0.86      0.90      0.88       132

    accuracy                           0.87       249
   macro avg       0.87      0.87      0.87       249
weighted avg       0.87      0.87      0.87       249


Machine 2:
0.85 <- Test-set accuracy 
0.84 <- Train-set accuracy 
              precision    recall  f1-score   support

           0       0.86      0.84      0.85       114
           1       0.85      0.86      0.85       115

    accuracy                           0.85       229
   macro avg       0.85      0.85      0.85       229
weighted avg       0.85      0.85      0.85       229


Machine 3:
0.85 <- Test-set accuracy 
0.85 <- Train-set accuracy 
              precision    recall  f1-score   support

           0       0.90      0.78      0.84       114
           1       0.82      0.92      0.87       123

    accuracy                           0.85       237
   macro avg       0.86      0.85      0.85       237
weighted avg       0.86      0.85      0.85       237

Looks like results are slightly better for machine 1.

Random Forest Classifier#

Let’s see if a Ramdom Forest model can improve overall results.

0.99 <- Test-set accuracy 
1.00 <- Train-set accuracy 
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       345
           1       0.99      0.99      0.99       370

    accuracy                           0.99       715
   macro avg       0.99      0.99      0.99       715
weighted avg       0.99      0.99      0.99       715

Outstanding performance! These results show the capability of the Random Forest classifier to capture complex, non-linear relationships in the data, to achive impressive results.

../_images/3625c3810e25eca83dc7094509600029934679ad32e096d97bda9143b18ada1d.png

Conclusion#

Initially, I considered incorporating the time series data for predictive purposes. However, the exploratory analysis revealed no patterns associated with the dates, indicating that this was simply a feature-based classification problem.

While the Logistic Regression classifier performed reasonably well, the non-linearities in the dataset made the Random Forest classifier a more suitable model, yielding impressive results with minimal tuning.