Traffic Volume

Traffic Volume#

Aug, 2025

Linear Regression

Background#

Many local businesses in my town close in mid-August to coincide with their customers’ holidays. This inspired the idea of creating a predictive model to determine the optimal time for such closures based on seasonal population changes. My search for relevant open data on public attendance only yielded traffic information. Though not ideal, I decided this data would be a sufficient proxy for the modeling exercise.

The data#

I sourced weekly traffic volume data for 2021-2025 from Gipuzkoa Irekia and the corresponding holiday calendars from Opendata Euskadi.

Using a Python script, I processed traffic CSV files to extract daily data for a single measurement point (number 9259) at coordinates (43.072638, -2.330442). For this location, I simply aggregated the vehicle volume, without breaking it down by more granular details like hourly data, direction-lane, or vehicle type (light-heavy).

Regarding the calendars, I took bank holidays for Euskadi and Gipuzkoa, not considering local festivities as an initial approach.

	volume
Fecha
2021-01-04	6929
2021-01-05	7398
2021-01-06	3466
2021-01-07	9676
2021-01-08	9781
...	...
2025-07-23	10831
2025-07-24	10627
2025-07-25	4769
2025-07-26	4943
2025-07-27	4490

1639 rows × 1 columns

Let’s check missing data (days without measurement).

Missing 27 out of 1666 days(2%)

Missing days -> DatetimeIndex(['2022-02-04', '2022-02-05', '2022-02-06', '2024-03-13',
               '2024-03-14', '2024-03-15', '2024-03-16', '2024-03-17',
               '2024-04-29', '2024-04-30', '2024-05-01', '2024-05-02',
               '2024-05-03', '2024-05-04', '2024-05-05', '2024-05-06',
               '2024-05-07', '2024-11-18', '2024-11-19', '2024-11-20',
               '2024-11-21', '2024-11-22', '2024-11-23', '2024-11-24',
               '2024-12-16', '2024-12-17', '2025-04-09'],
              dtype='datetime64[ns]', freq=None)

I will reindex the dataframe to cover the full daily calendar from 2021 to 2025, leaving missing values in place for exploratory analysis, since they do not affect the average calculations.

	volume
2021-01-01	NaN
2021-01-02	NaN
2021-01-03	NaN
2021-01-04	6929.0
2021-01-05	7398.0
...	...
2025-12-27	NaN
2025-12-28	NaN
2025-12-29	NaN
2025-12-30	NaN
2025-12-31	NaN

1826 rows × 1 columns

Exploratory analysis#

Trend#

../_images/8d43af65b38e6a989c7d3bab8a88c79190c5d4f75a876efb0bc2819f8f059f51.png

As expected, the data shows a clear 12-month seasonality. While yearly averages appear fairly stationary, the monthly moving average indicates a slight upward trend in recent months.

The augmented Dickey-Fuller test confirms that the monthly series is not stationary.

Monthly trend stationary? False

By month#

../_images/46de85f8ce66fe92f79260a7bdd99b60a3e1b070b8412ab0392839cf5e53b7f8.png

../_images/be8ebb6c07d6791402cd0e28fc117ae003d79ee4a15e872934799fbc8a0858e4.png

As expected, traffic volume drops in August. Across the years, there appears to be a post-pandemic effect in early 2021, with slightly lower traffic levels.

By day of the week#

../_images/5f9486d5367598452338144c5b66a2223658f79c67de11b6ced9fdff56ddc9c3.png

../_images/20ed40fdc51f5c6d5927c2120a434f8364a08ef856dc7914a0daf6d68b9fc3ce.png

A noticeable difference also appears in 2021, likely due to the pandemic.

Modelling#

Two approaches are possible:

Simple Classical Model (e.g., Linear Regression with Features)
Time Series-Focused Model (e.g., Seasonal ARIMA with External Factors)

(1) treats the problem as a standard regression task. As I know that the traffic volume depends heavily on the day of the week, month of the year, holidays and other forseeable half-holiday or special days, I can prepare this information as features. Traffic patterns are often explainable by these calendar-based factors, and linear regression can capture linear relationships efficiently.

(2) emphasizes the sequential, autocorrelated nature of the data (e.g., today’s traffic influences tomorrow’s). It has these components: AR (AutoRegressive): Captures dependence on past values; I (Integrated): Differences the data to make it stationary (e.g., remove trends); MA (Moving Average): Models error terms from past forecasts; Seasonal terms: Handle repeating patterns like weekly (period=7 for daily data) or yearly (period=365) cycles. Traffic data is inherently a time series with strong autocorrelation (e.g., high traffic on Fridays persists week to week) and seasonality, which SARIMA(X) is designed to model directly without manual lag engineering.

ARIMA-based models often outperform simple linear regression for time series like traffic volume, especially when seasonality is prominent, because linear regression assumes independence between observations—which is violated in time series data.

However, as the data lacks strong trends and has minimal autocorrelation, linear regression is much simpler and could be just enough. It also more interpretable.

Features#

I will just build these features, which clearly affect the traffic volume:

Month of the year, January to December: (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
Day of the week, Monday to Sunday: (0, 1, 2, 3, 4, 5, 6)
Holidays
Half-holidays

	volume	dayofweek	month
2021-01-04	6929.0	0	1
2021-01-05	7398.0	1	1
2021-01-06	3466.0	2	1
2021-01-07	9676.0	3	1
2021-01-08	9781.0	4	1
2021-01-09	3445.0	5	1
2021-01-10	2334.0	6	1

I will get holidays from the official calendar data.

	holiday
date
2021-01-01	1
2021-01-06	1
2021-03-19	1
2021-04-01	1
2021-04-02	1
...	...
2025-11-01	1
2025-12-06	1
2025-12-08	1
2025-12-25	1
2025-07-31	1

65 rows × 1 columns

	volume	dayofweek	month	holiday
2021-01-01	NaN	4	1	1
2021-01-02	NaN	5	1	0
2021-01-03	NaN	6	1	0
2021-01-04	6929.0	0	1	0
2021-01-05	7398.0	1	1	0

Half-holidays considered are days that are not weekend days nor holidays, and:

previous day is holiday and next day is weekend
previous day is weekend and next day is holiday
previous day is holiday and next day is holiday
is 24th December
is 31th December

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1826 entries, 2021-01-01 to 2025-12-31
Freq: D
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   volume        1639 non-null   float64
 1   dayofweek     1826 non-null   int32  
 2   month         1826 non-null   int32  
 3   holiday       1826 non-null   int32  
 4   half_holiday  1826 non-null   int32  
dtypes: float64(1), int32(4)
memory usage: 57.1 KB
None

	volume	dayofweek	month	holiday	half_holiday
2021-01-01	NaN	4	1	1	0
2021-01-02	NaN	5	1	0	0
2021-01-03	NaN	6	1	0	0
2021-01-04	6929.0	0	1	0	0
2021-01-05	7398.0	1	1	0	0
...	...	...	...	...	...
2025-12-27	NaN	5	12	0	0
2025-12-28	NaN	6	12	0	0
2025-12-29	NaN	0	12	0	0
2025-12-30	NaN	1	12	0	0
2025-12-31	NaN	2	12	0	1

1826 rows × 5 columns

Linear Regression#

Data from 2021, 2022, and 2023 will be used to train the model. The year 2024 will be reserved for testing its performance. Finally, the (incomplete) 2025 data will be used for validation.

Test 2024#

Model score:
	R^2 -> 0.91
	RMSE -> 846

The model’s performance appears reasonably good, with an average error of fewer than 1,000 vehicles—roughly under 10% of the daily traffic volume (about 10,000 vehicles).

../_images/13d2fd19f127fa149ed8f99738d2dcc1cfe2fb9811858537a45cadd1234805c8.png

As expected, holidays, partial holidays, weekends, and the month of August have a greater impact than the other variables.

Once we have the model, we can make the prediction through the whole 2024 calendar.

../_images/d5a410f4e9442c4ad6c81ed2a9765e290894b8575211dffd122bfffcfac0a091.png

../_images/27e367ed2dbec1e449aa46931cfc0dc8a5ca9580cb7d3a499d13f69fcaee2b27.png

../_images/32e3923776e09c208c6e0b4fed11d13ed6bed427237c574d71beece4e09b2a64.png

../_images/36031d936a6367e4dd5f08237862a5c026112a359d17e7ebe12ae63beea06c29.png

../_images/bb1397290ead5cbc9e11c754d525098fdccc039d1576a0372f17eda1a4eaa3ec.png

../_images/c8aa7da75b2cec08da345fee0a85b40ed29f01d7e72e16f86ee60f09772ef9bf.png

../_images/1479bf41235ba1be7b46c116380b8b7b77839c99abbb43e30f330125e6a471bb.png

../_images/7bd152a70623bf0424117d2171256c4ad8736d9a3676319276f6c26db64a5f23.png

../_images/886b45bd6e664572cc20d93ec8b974baae555410a18e0e32d3fc5ed697c3fc12.png

../_images/0d7e77c9148ee7d925f449cfb0600822c48c18999db8787fbe847a7d1d707f79.png

../_images/56aafae9a9dbecb8a8c786833b5bba8241e98b850d053ea27407ec1e1f2d8f1f.png

../_images/02e9c62b51b5349719d90698ef1561ccd849e29005cdf2fba4fdcbea7f8b8ce7.png

The model predicts fairly well considering its simplicity.

The prediction model struggles in August, likely due to the limitations of a linear model in capturing such pronounced seasonal pattern changes.

In general, for most days the predictions fall below the actual values, which may indicate two things:

The 2021 training data was still affected by the pandemic, leading to lower traffic levels.
There is a slight upward trend that the features fail to capture.

To account for the recent upward trend, a time-related feature should be added—an indicator that informs the model of this effect, which a time-based model would inherently capture.

Prediction 2025#

I will retrain the Logistic Regression model, excluding 2021 and including 2024, to forecast the entire 2025 calendar year. At this stage of the project, only data through August is available for performance evaluation.

../_images/36e3a4596864161387d9c5eb0b150a1f15e5c0d0914f15be7b9d28f8bd6ff5bc.png

../_images/327a0fa69f112916b871e6e502114eecae4c2ef26cce661e5c74e36142121af5.png

../_images/15a40f72d8e3f8ee956f0f59740da7e7523e7223ffde3b0796986a5dd2f4ab64.png

../_images/d7c05b7ed427a1d40c6659583e262e981f8c375523ed79af700492c21470f3db.png

../_images/a6ff597eed018dbc421cd88575821ce2eea8e68525646f0b1e17c237c8a5697f.png

../_images/0597afe0212db9216246e6a4d6df84fb1bfdc3e42a8699853e13e55a18a85f4f.png

../_images/7b005a47581f45d1c8fc856d72875ee5dd3157a2ef304f1bc236eae025620937.png

../_images/e4b84800efb81eaa67668cc0245f7baa6eec026ca6de7db4a9e36206fb26f9df.png

../_images/d88ea36e5fcb562a1a5b6eeee39384f983b9483af2de2da7a41e964387ec777b.png

../_images/0f1081e7a152576bc40de55faa6e745a8e9b1027688add7cb4ad9c5b3a0bb47e.png

../_images/ab972af6161ab5c30ad341d29156d713463ec955d07b158dd49559154a15221d.png

../_images/451cde011b251bbae12e2c0748254fc7436e2ed7845077dec6697267f50309a5.png

The predictions still tend to fall below the actual values, suggesting that a time-related feature is necessary.

Conclusion#

In this project, a simple model was developed to predict the outcome of a seasonal attendance problem—specifically, traffic volume at a given location. Although the possibility of using a time series–focused model was considered, a straightforward linear regression was chosen for its simplicity and interpretability, focusing on the features most strongly affecting the target. Because the time series is not stationary, it appears necessary to include a time-related feature in the model. Overall, the results were reasonable, but the model struggles to predict traffic in August, when patterns deviate sharply from the rest of the year.