Why were goal markets removed from Foresportia?

Goal markets were derived from the same engine as the 1X2. As match-outcome predictions improved, the adjustments distorted goal distributions without anyone noticing. The markets were pulled until the root cause was fixed.

How are BTTS and Over/Under probabilities calculated?

A score grid gives the probability of every possible scoreline (0-0, 1-0, 1-1, etc.). BTTS sums every cell where both teams score at least once. Over 2.5 sums every cell where the total is 3 or more.

What is Brier Score and why does it matter?

Brier Score measures the average squared gap between an announced probability and the actual outcome. Lower is better. A Brier of 0 is perfect; 0.25 means the model is no more informative than flipping a coin.

Are Foresportia goal markets reliable now?

Early results are encouraging, especially for BTTS and Over/Under 2.5. The model is not perfect yet, but it provides a solid enough foundation to be used with appropriate caution.

Goal markets rebuilt: how Foresportia fixed BTTS and Over/Under

Why goal markets were pulled in the first place

For a long time, Foresportia's BTTS and Over/Under probabilities were computed from the same engine that powers the 1X2 (home win / draw / away win) predictions.

That sounds logical — after all, goals determine the result. But in practice, the more we tuned the engine to predict who wins, the more we injected outcome-specific adjustments: Elo corrections, draw recalibration, dynamic confidence shifters. Each of those made the 1X2 more accurate, yet they quietly warped the underlying goal distribution.

The consequence was subtle but damaging: goal-market probabilities still looked plausible, but they no longer tracked reality. A displayed 65 % for BTTS might only materialise 50 % of the time. Rather than ship misleading numbers, we took them offline and set out to fix the root cause.

How goal markets are actually computed

The maths behind every goal market starts with one object: a score grid. Each cell in the grid holds the probability of a specific scoreline — 0–0, 1–0, 2–1, and so on.

Let $P(i,j)$ be the probability that the home side scores $i$ goals and the away side scores $j$. Every common goal market is simply a sum over the right cells:

$$P(\text{BTTS}) = \sum_{i \ge 1}\;\sum_{j \ge 1} P(i,j)$$

In plain English: add up every cell where both teams find the net at least once.

$$P(\text{Over 2.5}) = \sum_{i+j \;\ge\; 3} P(i,j)$$

Here, we sum every cell where the total reaches 3 goals or more. Under 2.5 is the complement: every cell where the total stays at 2 or below.

The same logic extends to Over/Under 1.5, 3.5, clean sheets (one team concedes zero), win-to-nil, and so on. Every goal market is derived from the same grid — just by summing different regions.

The coupling problem

In the old setup, the score grid was built by the 1X2 engine and then reweighted so its marginals matched the published match-outcome probabilities.

That guaranteed internal consistency: if we displayed "60 % home win", the most probable scorelines were compatible. But the consistency was misleading for goals.

Every time we improved the 1X2 with outcome-centric tweaks — ranking signals, Elo corrections, a sharper draw prior — the reweighted grid shifted in ways that had nothing to do with how many goals would actually be scored. The engine was optimised to answer "who wins?", not "how many goals will there be?".

This is the key insight: what helps predict the winner does not automatically help predict the number of goals. Coupling the two led to a model that improved on one task while quietly degrading on the other.

The fix: a dedicated goal engine

Rather than patching the old system, we separated the two responsibilities entirely:

A match-outcome engine, optimised to predict who wins (1X2)
A goal engine, with its own score grid, optimised to model the distribution of goals

The goal engine shares some useful signals with the match engine — advanced form, league-level scoring pace, a dampened home advantage — but it does not inherit the outcome-specific adjustments: no draw recalibration, no dynamic confidence shifters, no Elo overrides designed to sharpen match results.

The principle is straightforward: predicting who wins and predicting how many goals are scored are two different problems. They deserve two different engines, each free to learn from the signals that actually matter for its own task.

Going further: model fine-tuning

A dedicated goal engine is necessary but not sufficient. A raw probability — even from a well-designed model — is not always actionable.

On top of the goal engine sits a fine-tuning and calibration layer. Its job is to adjust per-market probabilities so displayed percentages stay reliable and informative.

In practice, this layer uses contextual and confidence signals to correct base-model outputs. When the signal is too weak, we prefer not to publish rather than show a bland default probability.

The trade-off is deliberate: fewer matches covered, but markedly better reliability when a number is shown.

Measuring quality: a quick guide to the metrics

Before diving into the results, here is a plain-language tour of the four metrics we use. No data-science background required.

Brier Score (average error)

The Brier Score captures the average squared gap between what the model announced and what actually happened:

$$\text{Brier} = \frac{1}{N}\sum_{n=1}^{N} (p_n - y_n)^2$$

where $p_n$ is the announced probability and $y_n$ is 1 if the event occurred, 0 otherwise.

Lower is better.
A Brier of 0 is perfect — every call was spot-on.
A Brier of 0.25 is what you get by always predicting 50/50 — zero information.

LogLoss (penalises overconfidence)

LogLoss is harsher than Brier when the model is confident and wrong. Announcing 90 % and being wrong costs far more than announcing 55 % and being wrong.

Lower is better.
It complements Brier by zeroing in on costly overconfidence.

ECE (Expected Calibration Error)

ECE checks whether probabilities mean what they say. Group all predictions in bins (say, everything between 55 % and 65 %), then compare the average announced probability to the actual hit rate.

Lower is better.
A high ECE means the displayed percentages are misleading: a stated 60 % does not correspond to a 60 % success rate.

BSS (Brier Skill Score)

BSS benchmarks the model against a statistical baseline. In this study, that baseline is already strong: team goals scored/conceded averages, league context, date and season effects, all built without an AI layer.

Positive BSS = the model outperforms the baseline.
Negative BSS = the model does worse than the baseline.
For readability, BSS is shown as % gain/loss in the table (e.g. +5.2%).
Even a +5% gain is meaningful against such an information-rich baseline.

Practical guideposts:

BSS < 0%: worse than baseline
BSS ≈ 0%: barely better
0% to 5%: slight gain
5% to 10%: real, often respectable gain
> 10%: generally good
> 20%: often very strong

Here, the full system sits between +0.8% and +5.2% depending on the market: already a real gain (at worst, close to a robust pure statistical/probabilistic model), and one of the metrics we keep tracking to push performance further.

Results: three approaches compared

We evaluated three approaches on the same set of recent matches:

Legacy system: goal markets derived from the 1X2 engine (reweighted grid)
Dedicated engine alone: base model (no per-market fine-tuning)
Dedicated engine + model fine-tuning: calibrated model as published

Market	Method	Brier	LogLoss	ECE	BSS (%)
BTTS	Legacy	0.268	0.731	0.105	-7.3%
	Dedicated engine	0.282	0.768	0.131	-12.9%
	Engine + fine-tuning	0.248	0.689	0.039	+0.8%
Over 2.5	Legacy	0.267	0.732	0.100	-7.1%
	Dedicated engine	0.288	0.788	0.152	-15.6%
	Engine + fine-tuning	0.236	0.666	0.065	+5.2%
Under 2.5	Legacy	0.267	0.732	0.100	-7.1%
	Dedicated engine	0.288	0.788	0.152	-15.6%
	Engine + fine-tuning	0.236	0.666	0.065	+5.2%

Key takeaways

Engine + model fine-tuning dominates across all three markets and all four metrics.
It is the only approach with positive BSS (+0.8% to +5.2%), meaning it outperforms the statistical baseline.
The legacy and standalone-engine approaches both have negative BSS: on this sample, they were worse than always predicting the market average.
The full system's ECE of 0.039 for BTTS is excellent: when it says 60 %, observed reality is very close to 60 %.

The real test: do high-confidence calls actually hit more often?

Aggregate metrics are useful, but the question that matters most is: when the model is confident, does the hit rate follow?

This is where the gap becomes stark.

BTTS

Legacy → almost no high-confidence predictions. Above 60 %, only 2 matches.
Dedicated engine alone → at the 65 % threshold: 48 % hit rate. Not actionable.
Engine + fine-tuning → at 60 %: 65 % hit rate across 20 matches.

Over 2.5

Dedicated engine alone → 65 % threshold: 39 % hit rate. Calibration broken.
Engine + fine-tuning → 60 %: 69 %. At 65 %: 80 %.

Under 2.5

Legacy → 60 % threshold: 51 %. Overconfident.
Engine + fine-tuning → 55 %: 78 %. At 60 %: 82 %.

i

What this means

The old system was unable to make hit rates climb as the displayed probability climbed. Showing 70 % was no better than showing 50 %. The new system finally produces probabilities that genuinely discriminate — higher confidence corresponds to higher observed accuracy.

Staying honest: what the model does not do yet

The numbers above are encouraging, but intellectual honesty demands a clear statement of limits.

The sample is still young. Results come from a few hundred matches. Enough to validate a direction, not enough to declare the problem solved.
Strongest evidence is on BTTS, Over 2.5 and Under 2.5. For other markets (Over 1.5, Over 3.5, clean sheets, etc.) proof is thinner at this stage.
Coverage is intentionally limited. The calibrated model does not publish on every match. That is by design: fewer but more reliable predictions beats high volume with low signal.
The model is a work in progress. It provides a solid foundation for use, but it will keep evolving.

What comes next

The new architecture opens concrete avenues for further improvement:

Per-market calibration

Learn a dedicated calibration curve for each market (BTTS, Over 2.5, Under 2.5, etc.) so that a "70 % displayed" always corresponds to a "70 % observed".

League-level adjustments

Some leagues are structurally high-scoring (Eredivisie), others are tactically tight (Serie A). Tuning goal-engine parameters per league is a natural next step.

Automated quality monitoring

Integrate continuous tracking of Brier, ECE and hit rates by threshold directly into the production pipeline, so any quality drop on a market or league is flagged immediately.

Expanding to more markets

Next candidates: Over/Under 1.5, Over/Under 3.5, clean sheets and win-to-nil.

Conclusion

Goal markets are back on Foresportia — not because the match-outcome engine got better, but because we stopped making goals depend on it.

The recipe:

A dedicated goal engine with its own signals
A fine-tuning + calibration layer that only publishes when the signal is clear
Transparent evaluation with metrics you can verify

The model is not perfect. The sample is still young. But for the first time, when the system displays a high probability on BTTS or Over/Under 2.5, the observed hit rate follows. That is the foundation everything else can be built on.

Today's matches More on calibration

Quick FAQ

Why were goal markets taken offline?

Because they were derived from the 1X2 engine. Improving match-outcome predictions inadvertently distorted goal estimates. The displayed probabilities no longer reflected reality.

What counts as a good Brier Score?

Lower is better. Below 0.25 (the coin-flip baseline) the model adds information. The current system sits at 0.236 for Over 2.5 and 0.248 for BTTS.

Why doesn't the model cover every match?

Because the model does not have a reliable signal for every fixture. Rather than publishing a tepid default probability, we prefer showing nothing when the signal is too weak.

Will goal markets keep improving?

Yes. The current system is a solid but evolving foundation. We are working on per-league calibration, additional markets and automated quality monitoring.

Goal markets rebuilt: why Foresportia started from scratch

TL;DR