Data Analysis - Predictions - Accuracy

AI football analysis

12,000 football matches analyzed: what probability models actually get right

Published on March 18, 2026

12,337 matches 26 leagues Calibration Accuracy Data Study
i

About this study

This analysis covers 12,337 football matches across 26 leagues, played between September 2023 and March 2026. Every number in this article comes directly from recorded predictions and verified final scores. Nothing has been cherry-picked or simulated.

When a model says 65%, it actually means 82%

We tracked 12,337 football predictions across 26 leagues over 2.5 years and found something we did not expect: the model systematically beats its own stated probabilities. When it assigns a 65-70% confidence to an outcome, that outcome actually happens 82.5% of the time. That is a 15 percentage point gap between what the model claims and what it delivers.

That finding flips the usual narrative. The common assumption is that prediction models are overconfident, that they promise more than they can deliver. This dataset tells the opposite story, at least above the 60% mark. Below that line, the picture is messier, and the data exposes precisely where models stop providing useful signal and start producing noise.

We went through every match, every probability, every final score. No simulations, no hypothetical backtests. Just 12,337 recorded predictions compared against verified results.

Key insight

The central finding of this analysis is a paradox: the more confident the model is, the more it underestimates itself. At the 65-70% confidence bin, observed accuracy is 82.5%. At 75-80%, it reaches 90.3%. The model is not just "pretty good" at high confidence. It is substantially better than it claims to be.

This gap is not accidental. It is a direct consequence of how the model is built. The probability pipeline includes deliberate conservatism layers: temperature scaling, historical shrinkage, and calibration adjustments that are all designed to pull probabilities toward the center rather than push them toward extremes. The system is engineered to understate confidence, not overstate it. A model that says 55% and is right 60% of the time is, by design, preferable to one that says 60% and is right 60% of the time. The first is trustworthy; the second is merely correct. That distinction drives every calibration decision in the pipeline.

Most consumers of football probabilities take them at face value. A 65% probability feels like "maybe." The data says it is closer to "very probably." Stated confidence and true discriminative power are not the same thing. The gap between them is where the real story hides, and where most people misread what a prediction model is actually telling them.

But here is the counterweight. Beyond 80% stated confidence, accuracy flatlines around 90-92%. Football's irreducible randomness, from red cards to deflected shots, imposes a boundary that no model can cross. The last 8-10% of uncertainty in football is structural. It belongs to the sport, not to the algorithm.

Main takeaway:
A football prediction model that says "65% confidence" is actually right 82% of the time, but even at its most confident, it never breaks through the 92% ceiling. The model knows more than it admits, and less than you wish.

Key findings

55.1% Overall accuracy
(most probable outcome)
71.9% Accuracy at 50%+
confidence (4,248 matches)
88.2% Accuracy at 70%+
confidence (727 matches)
26 Leagues tracked
across 582 matchdays

What the data reveals, in short:

  • The 50% line is a cliff, not a slope. Accuracy jumps from 55.8% to 62.3% the moment the top prediction crosses 50% confidence. That 6.5pp gap is the largest between any two adjacent bins in the entire dataset.
  • When the model says 65%, reality says 82%. Above 60% confidence, observed accuracy consistently exceeds stated probability by 10-15 percentage points. The model does not overestimate its ability. It underestimates it.
  • The same model is 14 points more accurate in Norway than in Serie B. League predictability ranges from 64.0% (Norwegian Eliteserien) to 49.9% (Italian Serie B). That spread is a property of the competition, not the algorithm.
  • Draws happen in 1 out of 4 matches. The model predicts them in 1 out of 25. The 4.3% prediction rate against a 25.9% actual occurrence rate makes draws the largest structural blind spot in 1X2 forecasting.
  • Most football is statistically close to noise. 63% of matches have near-maximum entropy (>1.5 out of 1.585 bits), meaning the three-way probability distribution is nearly flat. High-confidence matches are the exception, not the norm.

Global performance

Across all 12,337 matches, the model picked the correct 1X2 outcome 55.1% of the time. That is well above the 33.3% baseline of random guessing on a three-way market, but it also means 45% of predictions miss. The raw number is underwhelming. The filtered numbers are not.

What matters is not overall accuracy. It is what happens when the model has conviction. The table below shows accuracy at different minimum confidence thresholds, and the progression is steeper than most people expect:

Minimum confidence Matches Accuracy
≥ 40%8,83961.0%
≥ 45%6,15966.9%
≥ 50%4,24871.9%
≥ 55%2,88976.5%
≥ 60%1,87581.3%
≥ 65%1,17886.0%
≥ 70%72788.2%
≥ 75%39591.1%
≥ 80%18992.1%
≥ 85%6995.7%

The trade-off is clear: higher thresholds give better accuracy, but you lose coverage fast. At 70%+ confidence you are right nearly 9 times out of 10, but that only covers 727 out of 12,337 matches (5.9%). At 50%+, you cover about a third of all matches with 72% accuracy. There is no free lunch.

The table reveals a non-obvious pattern: the accuracy gains slow down sharply above 70%. Going from 50% to 60% threshold buys you 9.4 percentage points of accuracy (71.9% to 81.3%). Going from 70% to 80% buys you only 3.9 points (88.2% to 92.1%). The model hits diminishing returns well before it reaches certainty. Chasing the highest-confidence predictions yields rapidly shrinking marginal gains. The "smart money" in this table is in the 55-65% range: still substantial volume, and accuracy already north of 76%.

Probability vs. accuracy

This is where it gets interesting. The chart below plots the model's stated confidence (binned in 5-point ranges) against the observed accuracy. A perfectly calibrated model would sit exactly on the diagonal.

100% 80% 60% 40% 20% 0% Perfect calibration 36.4% 40.5% 47.3% 55.8% 62.3% 67.6% 73.5% 82.5% 84.6% 90.3% 90.0% 100% 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 n=385 n=3113 n=2680 n=1911 n=1359 n=1014 n=697 n=451 n=332 n=206 n=120 n=46 Model confidence (% bin) Observed accuracy Accuracy vs. model confidence (1X2, n=12,337)

Look at where the bars break away from the diagonal. Below 50%, they track the line closely, sometimes falling slightly short. The model is roughly honest in that range, perhaps even a touch overconfident. Then, starting at 60%, every single bar overshoots the diagonal by a wide and consistent margin. At 65-70%, the gap is +15pp. At 75-80%, it is +13pp. The model is not wrong when it says 65%. It is underselling itself.

The most valuable predictions are not the ones the model is most confident about. The 60-70% range delivers the best combination of accuracy (73-82%) and sample size (1,148 matches). Above that, accuracy barely improves: the 80-85% bin (90.0%) is only 5.4 points above the 70-75% bin (84.6%), despite requiring substantially more confidence. This plateau is not a modeling failure. It is the point where football's inherent chaos begins to dominate. The signal runs out before the model does.

League analysis

Not all leagues are created equal when it comes to predictability. The chart below ranks leagues by prediction accuracy (most probable outcome matches the actual result), limited to leagues with at least 200 matches in the dataset.

1X2 prediction accuracy by league (min. 200 matches) 45% 55% 65% 33.3% baseline Norway 64.0% (247) Champ. League 63.8% (486) China 60.9% (253) Ligue 1 60.3% (539) Liga Portugal 59.6% (542) Bundesliga 59.6% (539) Sweden 58.8% (240) Serie A 58.5% (670) La Liga 57.0% (660) MLS 56.0% (566) Premier League 55.7% (680) Ligue 2 54.1% (549) Bundesliga 2 53.0% (540) Brazil 52.7% (759) Championship 52.2% (1007) La Liga 2 52.0% (794) Serie B 49.9% (680) J-League 50.7% (381)

The chart splits cleanly into two tiers. The top group (Norway through Bundesliga, 58-64%) shares a structural feature: dominant teams that produce skewed probability distributions. When one side regularly sits above 65%, the model has an easier target. Predictability is not a function of model quality. It is a function of how unequal the league is.

The bottom tier (Championship through J-League, 50-54%) tells a different story. In these competitions, the top prediction is often just 2-3 points above the second-best outcome. The model is not failing in Serie B or the Championship. Those leagues are structurally resistant to prediction. There is simply less signal available, and no algorithm can extract what does not exist.

The Premier League, arguably the most data-saturated football competition on the planet, sits at a modest 55.7%. That places it below Ligue 1, below the Bundesliga, below Liga Portugal. This contradicts the intuition that more data and more analysis should produce better predictions. They do not. The league's competitive depth actively works against forecasting accuracy.

Distribution of predictions

Where does the model actually end up in terms of confidence? The chart below shows how many matches fall into each confidence bin for the top prediction.

Distribution of top-prediction confidence (n=12,337) 3500 2500 1500 500 0 385 3,113 2,680 1,911 1,359 1,014 697 451 332 206 120 46 21 2 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-75 75-80 80-85 85-90 90-95 95+ Top-prediction confidence bin (%) Number of matches

This chart might be the most important one in the article, because it shows what the model actually thinks about most football. Nearly half of all matches (47%) land in the 35-50% confidence zone: the model sees a slight favorite but cannot meaningfully separate it from the alternatives. This is not a "maybe." It is barely above a guess.

Only 5.9% of matches reach the 70%+ confidence band where accuracy exceeds 88%. A mere 0.5% go above 85%. For every match where the model is genuinely confident, there are 16 where it is essentially shrugging. Football does not produce many easy calls. The overwhelming majority of matches sit in a zone where prediction is possible but fragile, where a single unexpected event (a red card, an early injury, a goalkeeper mistake) can flip the outcome.

The model is not uncertain because it lacks information. It is uncertain because football is.

What this means

"Only bet on high-confidence predictions" sounds like wisdom. The data says otherwise. Restricting to 80%+ predictions does improve accuracy from 55% to 92%, but the marginal gain above 70% is negligible (88.2% to 92.1%), and the coverage collapse is brutal: from 12,337 matches to just 189. Most of the useful accuracy gain is already captured at the 55-60% threshold. Beyond that, you are sacrificing 98.5% of your volume for 3.9 extra points of accuracy.

If you read "65% home win" and expect 6.5 out of 10 such predictions to hit, you are wrong. The actual rate is 8.2 out of 10. The stated probability is not a calibrated frequency. It is a conservative lower bound. Anyone using these numbers for decision-making, whether for analysis, modeling, or just evaluating prediction quality, needs to account for this systematic and consistent gap. The model knows more than it says.

Competitive balance is the single strongest predictor of model difficulty. Not data quality. Not model complexity. The exact same model produces 64% accuracy in Norway and 50% in Serie B. No amount of feature engineering can overcome a league where the top prediction regularly sits at 38%. The league you choose to analyze sets the ceiling before you write a single line of code.

The draw problem is not solvable within the 1X2 framework. Draws occur in 25.9% of matches but are predicted as the most likely outcome in only 4.3% of cases. This is not a calibration failure. It reflects the fact that draws are inherently unstable: a single late goal eliminates or creates one. Any system that evaluates models on 1X2 accuracy is implicitly penalizing them for an outcome that is, by its nature, the hardest to forecast.

Football carries 8-10% irreducible uncertainty, and no model will ever eliminate it. Even at the highest confidence levels, accuracy never consistently exceeds 92%. That residual 8% is not noise in the algorithm. It is noise baked into the sport itself: own goals, red cards, penalty decisions, injuries in minute 3. This is the hard theoretical ceiling. It applies to every prediction system ever built for football, and every one that will be.

Methodology

The dataset consists of 12,337 matches from 26 football leagues, spanning September 19, 2023 to March 17, 2026 (582 matchdays). Each match has pre-game 1X2 probabilities generated by a Poisson-based model with Elo ratings, home advantage adjustment, and historical calibration layers.

"Accuracy" in this article means: the outcome assigned the highest probability was also the actual result. All scores are final results as recorded by official sources. Probability bins group matches by the model's stated confidence for its top prediction (e.g., 55-60% means the most probable outcome had between 55% and 60% probability). League accuracy uses the same top-prediction metric.

No matches were excluded from the analysis. The threshold-based accuracy table (section "Global Performance") filters by minimum confidence and counts how many of those filtered predictions were correct.

A note on calibration philosophy. The model's probability pipeline applies temperature scaling and historical shrinkage after computing raw probabilities. These are not corrections for error. They are deliberate design choices that prioritize robustness over aggressive confidence. The goal is to produce probabilities that are consistently safe to act on: a stated 60% should never be observed at 55% over a large sample. The trade-off is that stated probabilities at the high end understate the model's actual discriminative ability, as the data in this article confirms. This is the intended behavior, not a limitation.

Limitations

This analysis covers a single model architecture (Poisson-based with Elo inputs) over a specific time window (September 2023 to March 2026). Results may differ for models built on different principles (e.g., machine learning ensembles, market-implied probabilities). The 26 leagues included vary significantly in sample size: the Championship contributes 1,007 matches while Norway contributes 247. Smaller samples increase the confidence interval around league-level accuracy figures. Finally, the calibration gap observed at high confidence may partially reflect the specific temperature or shrinkage applied during this model's probability calibration phase, and should not be assumed to generalize to all football prediction systems.

Conclusion

Three numbers summarize 12,337 matches: 55%, 82%, and 92%. The model is right 55% of the time overall, delivers 82% when it says 65%, and hits a wall at 92% no matter how confident it gets. Those three numbers define the useful range, the hidden conservative bias, and the hard ceiling of football prediction.

Football resists prediction not because models are bad, but because the sport is, by construction, close to maximum entropy. Most matches are genuinely competitive on a three-way market. The games where a model can confidently pick a winner are the exception, not the rule.

If probabilities were perfectly reliable, high-confidence predictions would clearly dominate. They do not. The highest-confidence bin barely outperforms the mid-range. The model that appears most certain is not the one delivering the most marginal value. Football remains a noisy system, even for models that understand it well.

A probability model does not predict football. It measures how much football resists being predicted.

Frequently asked questions

What does "55% accuracy" mean exactly?

It means that across all 12,337 matches, the outcome with the highest assigned probability matched the actual result 55.1% of the time. Since there are three possible outcomes (home win, draw, away win), random guessing would hit 33.3%.

Why is the model "conservative" at high confidence?

This is a deliberate design decision, not a side effect. The probability pipeline includes temperature scaling and shrinkage layers that intentionally pull extreme probabilities toward the center. The reasoning: a model that says 55% and delivers 60% is more useful than one that says 60% and delivers 60%. The first builds trust over time because it never overpromises. The trade-off is that at the high end (65%+), stated confidence systematically understates true accuracy by 10-15 percentage points. The data in this article confirms this is working exactly as intended.

Why can't the model predict draws well?

Draws are structurally unstable events. A single goal changes the entire outcome classification. The data shows 25.9% of matches end in draws, but the model only assigns "draw" as the most probable outcome in 4.3% of cases. This is consistent across the entire prediction industry.

How many leagues were included?

26 leagues: Premier League, La Liga, Serie A, Bundesliga, Ligue 1, Liga Portugal, Championship, La Liga 2, Serie B, Bundesliga 2, Ligue 2, Liga Portugal 2, Belgian Pro League, Eredivisie, Turkish Super Lig, Swiss Super League, Danish Superliga, Norwegian Eliteserien, Swedish Allsvenskan, Finnish Veikkausliiga, Chinese Super League, J-League, K-League, Brazilian Serie A, MLS, and the Champions League.

Top match readings today

Continue with practical pages to read today's matches.

See today's match reading