What is a well-calibrated probability in football?

A probability is well calibrated when announced percentages match observed frequencies on the pitch. For example, a 60% bin should produce about 6 correct outcomes out of 10 on average. Calibration measures the statistical honesty of probabilities.

Why must a 60% probability really correspond to 6 matches out of 10?

Because a probability represents an expected frequency, not a certainty. If a model announces 60% but only delivers 50%, it is overconfident. Proper calibration ensures that percentages faithfully reflect observed outcomes.

What is the difference between accuracy and calibration?

Accuracy measures how many predictions are correct. Calibration measures whether announced percentages correspond to statistical reality. A model can be accurate but poorly calibrated if it systematically over- or underestimates true probabilities.

How does the Brier Score detect overconfident models?

The Brier Score measures the squared error between announced probability and actual outcome. It strongly penalizes confident predictions that turn out wrong. A high Brier Score often indicates overconfidence or poor calibration.

Why is LogLoss useful for analyzing major errors?

LogLoss heavily amplifies penalties for highly confident mistakes. It helps identify zones where the model is most misleading, especially when very high probabilities fail to materialize.

Why must calibration be monitored over time?

Because leagues evolve: playing styles, refereeing, offensive dynamics and volatility change. Even a strong model can gradually become misleading if calibration is not re-evaluated on a rolling time window.

Calibration of football probabilities: why 60% must really mean 6 out of 10

Probability calibration and reliability curve in football

📊

Framework

A probability only makes sense if it is reliable. This article explains how to verify that a “60%” actually corresponds to an observed frequency close to 60%, and why this is central to any serious analysis.

Accuracy ≠ calibration

Accuracy measures the number of correct predictions. Calibration measures the statistical honesty of probabilities. A model can be accurate but poorly calibrated, or the opposite.

Well calibrated: 70% announced → ~70% observed
Overconfident: 70% → ~63% observed
Underconfident: 70% → ~76% observed

Reliability metrics

Brier Score

The Brier Score measures the squared error between probability and actual outcome. It heavily penalizes unjustified certainty.

LogLoss

LogLoss is particularly sensitive to highly confident mistakes. It is useful for detecting overconfidence.

Reliability curve

Probabilities are grouped into bins and compared to observed frequencies. The diagonal represents perfect calibration.

Why calibration must be tracked over time

Leagues evolve, styles change, refereeing evolves as well. Without regular recalibration, a model can remain “good” while gradually becoming misleading.

This logic is detailed in the algorithm journal .

How to read a probability properly

a probability is an expected frequency, not a promise
calibration matters more than the raw value
thresholds are always a volume / stability trade-off

Conclusion

A probability is only useful if it is reliable. Calibration turns a percentage into actionable information, by making uncertainty readable rather than misleading.

Past results AI pillar page

Probability calibration: why 60% must really mean 6 matches out of 10