Predicting Daily Road Collisions with Weather, Calendars & Machine Learning
An end-to-end analytics and modelling project that explores what truly drives daily road collision risk, comparing interpretable linear models with a TensorFlow deep neural network to forecast collision frequency using calendar and weather signals.
Problem
Urban safety teams, emergency services and transport planners need reliable and explainable indicators to allocate resources, schedule enforcement and communicate risk. This project was designed to answer three practical questions:
- Which factors correlate most strongly with daily road collision frequency?
- Can a simple, transparent model forecast risk accurately enough for planning?
- Do non-linear machine learning models add meaningful value over linear baselines?
Data
- Time span: mid-2012 to 2019 (2012 partial year; COVID-era excluded)
- Target variable: total daily collision count (normalised for modelling)
- Calendar features: year, month, day-of-week, day-of-year, global day index
- Weather features: temperature (avg/min/max), dew point, precipitation, visibility, wind speed, max wind speed, fog indicator, sea-level pressure
Data were joined at daily resolution and explored across years to assess stability, distributions and potential regime changes.
Feature Engineering & Preparation
- One-hot encoding for months (12) and weekdays (7)
- Creation of day-of-year and global day index to stack yearly patterns
- Normalisation of collision counts for cross-year comparability
- 80/20 train-test split with 20% of training held out for validation
Exploratory Findings
- Weekly behaviour dominates: day of week shows the strongest correlation with collisions (r ≈ 0.53)
- Weather effects are modest: temperature (r ≈ 0.23) and dew point (r ≈ 0.21) are weakly positive
- Minimal impact: precipitation, fog, wind and visibility correlations are near zero
- Year-to-year stability: collision distributions remain tightly banded across years, indicating routine and behaviour over traditional seasonality
Modelling Approach
- Linear baselines (TensorFlow/Keras):
- Day-only model → MAE (Mean Absolute Error) ≈ 0.144 (normalised)
- Day + temperature + dew point → MAE ≈ 0.127
- Deep Neural Network:
- Architecture: Normalisation → Dense(48, ReLU) → Dense(48, ReLU) → Dense(1)
- Inputs: year, temperature, dew point, one-hot months, one-hot weekdays
- Training: Adam optimiser, MAE loss, 100 epochs
Results
- DNN performance: MAE ≈ 0.123 on the held-out test set
- Value of non-linearity: modest improvement over linear baselines by learning interactions (e.g., warm mid-week days vs weekends)
- Interpretability preserved: predictions align with exploratory findings weekday traffic intensity dominates; weather fine-tunes risk
Example predictions showed Sunday in summer consistently lower risk than mid-week days at similar temperatures, reinforcing the primacy of behavioural patterns.
Practical Applications
- Weekday-weighted staffing and enforcement planning for police and EMS
- Calendar-aware public risk messaging during peak commute days
- Policy evaluation by comparing interventions against predicted baselines
- Foundations for insurance, fleet and city-safety risk scoring
Ethics & Limitations
- Predictions are intended to support resource allocation, not to stigmatise communities
- Uncertainty must be communicated clearly; human oversight remains essential
- Dataset lacks exposure variables (traffic volume, holidays, events), suggesting future gains from richer data and time-aware validation