Predicting Daily Road Collisions with Weather, Calendars & Machine Learning

An end-to-end analytics and modelling project that explores what truly drives daily road collision risk, comparing interpretable linear models with a TensorFlow deep neural network to forecast collision frequency using calendar and weather signals.

Problem

Urban safety teams, emergency services and transport planners need reliable and explainable indicators to allocate resources, schedule enforcement and communicate risk. This project was designed to answer three practical questions:

Which factors correlate most strongly with daily road collision frequency?
Can a simple, transparent model forecast risk accurately enough for planning?
Do non-linear machine learning models add meaningful value over linear baselines?

Data

Time span: mid-2012 to 2019 (2012 partial year; COVID-era excluded)
Target variable: total daily collision count (normalised for modelling)
Calendar features: year, month, day-of-week, day-of-year, global day index
Weather features: temperature (avg/min/max), dew point, precipitation, visibility, wind speed, max wind speed, fog indicator, sea-level pressure

Data were joined at daily resolution and explored across years to assess stability, distributions and potential regime changes.

Feature Engineering & Preparation

One-hot encoding for months (12) and weekdays (7)
Creation of day-of-year and global day index to stack yearly patterns
Normalisation of collision counts for cross-year comparability
80/20 train-test split with 20% of training held out for validation

Exploratory Findings

Weekly behaviour dominates: day of week shows the strongest correlation with collisions (r ≈ 0.53)
Weather effects are modest: temperature (r ≈ 0.23) and dew point (r ≈ 0.21) are weakly positive
Minimal impact: precipitation, fog, wind and visibility correlations are near zero
Year-to-year stability: collision distributions remain tightly banded across years, indicating routine and behaviour over traditional seasonality

Modelling Approach

Linear baselines (TensorFlow/Keras):
1. Day-only model → MAE (Mean Absolute Error) ≈ 0.144 (normalised)
2. Day + temperature + dew point → MAE ≈ 0.127
Deep Neural Network:
1. Architecture: Normalisation → Dense(48, ReLU) → Dense(48, ReLU) → Dense(1)
2. Inputs: year, temperature, dew point, one-hot months, one-hot weekdays
3. Training: Adam optimiser, MAE loss, 100 epochs

Results

DNN performance: MAE ≈ 0.123 on the held-out test set
Value of non-linearity: modest improvement over linear baselines by learning interactions (e.g., warm mid-week days vs weekends)
Interpretability preserved: predictions align with exploratory findings weekday traffic intensity dominates; weather fine-tunes risk

Example predictions showed Sunday in summer consistently lower risk than mid-week days at similar temperatures, reinforcing the primacy of behavioural patterns.

Practical Applications

Weekday-weighted staffing and enforcement planning for police and EMS
Calendar-aware public risk messaging during peak commute days
Policy evaluation by comparing interventions against predicted baselines
Foundations for insurance, fleet and city-safety risk scoring

Ethics & Limitations

Predictions are intended to support resource allocation, not to stigmatise communities
Uncertainty must be communicated clearly; human oversight remains essential
Dataset lacks exposure variables (traffic volume, holidays, events), suggesting future gains from richer data and time-aware validation