Probability Calibration: Why 60% Should Mean 60%

By Tactiq AI · 2026-05-05 · 7 min read · Methodology

If a weather forecaster says "60 percent chance of rain tomorrow", a sensible reader expects rain to happen on roughly 60 percent of the days the forecaster makes that prediction. Not 100 percent, not 30 percent. Sixty percent.

That property is called calibration. It is the statistical foundation of any probability that means anything. And it is the most-misread concept in football analysis.

This article walks through what calibration is, why it matters more than accuracy, why most football pundits are not calibrated, and how Tactiq tracks it.

The definition that actually clarifies

A probability is calibrated when, over many predictions at the same level, the outcome happens at that frequency. The math is simple. Take all your predictions of "60 percent home win". Count how many of those fixtures the home side actually won. Divide by the total. If the result is 60 percent, you are calibrated.

This sounds trivial. It is not. Most predictions, made by most people, are not calibrated.

Casual fans assigning rough probabilities ("I'd give it 75 percent") tend to overstate their certainty in fixtures they have a rooting interest in. The 75 percent usually corresponds, in practice, to maybe a 55 to 60 percent hit rate.

Pundits on television speak in confident tones that imply probabilities of 90 percent or more, but if you scored the actual hit rate of their stated reads, you would find them clustered around 65 to 70 percent. The confidence is theatre; the calibration is mediocre.

Even sharp analysts with strong process can miss on calibration if they have not been measuring it. The temptation to express conviction inflates the stated probability away from the underlying read.

Why calibration matters more than accuracy

Accuracy is the simpler measure: were you right or wrong? Calibration is the deeper measure: was the conviction you assigned justified by the underlying probability?

Consider two analysts. Both predict 100 fixtures. Both get 60 right. Their accuracy is identical at 60 percent.

Analyst A predicted each home side at 60 percent. Their predictions and their hit rate matched exactly. They were calibrated.

Analyst B predicted each home side at 90 percent. Their predictions implied near-certainty, but their hit rate was 60 percent. They were dramatically over-confident.

The two analysts had identical accuracy but completely different probabilistic skill. Analyst A produced predictions that contained real information about each fixture's true difficulty. Analyst B produced predictions that systematically misled anyone who read them at face value.

Calibration is the property that makes a probability number useful. Accuracy alone tells you nothing about that.

Why football is hard to calibrate on

Football is one of the harder sports to calibrate on. Three reasons.

Outcomes are sparse. Each fixture has only one decided outcome (or three, if you count win-draw-loss). Compared to NBA basketball where the score difference itself contains continuous information, football's lumpy outcome distribution makes calibration measurement noisier per fixture.

Variance is high. Football is a low-scoring sport where individual events (a single shot, a single mistake) can decide a fixture. The model can be perfectly calibrated and still see surprising individual outcomes. Calibration only emerges over hundreds of fixtures.

Stakes vary. Cup fixtures, derbies, and end-of-season matches do not behave like routine fixtures. A model calibrated on regular-season league play may be miscalibrated on these special-stake fixtures unless it explicitly models the difference.

The combination means football calibration takes a long sample to verify. Twenty fixtures is not enough; you need hundreds.

How Tactiq verifies its calibration

Tactiq tracks the Brier score of the analysis output against actual outcomes across all decided fixtures in the 50 featured leagues. The Brier score is the standard probabilistic-prediction error metric (lower is better; 0 is perfect, 0.667 is random).

The model's per-league Brier scores currently sit in the 0.18 to 0.24 band. The lower end (around 0.18) corresponds to leagues with the deepest historical data: Premier League, Bundesliga, La Liga. The upper end (0.24) corresponds to leagues with sparser historical samples or higher fixture variance.

A 0.20 Brier corresponds roughly to the following calibration: when the model says 60 percent, the actual hit rate is between 55 and 65 percent across a large sample. The deviations within that band are within the noise the underlying sport allows.

The model is recalibrated periodically against the most recent two seasons of data, so calibration is maintained as squad strength shifts and league dynamics evolve.

How user calibration is tracked

Premium users get a personal calibration tracker on the History page. The tracker uses the same Brier methodology to score the user's predictions (which, for most users, are equivalent to the model's predictions, since the user is reading the model's probabilities directly).

Where users diverge from the model is when they apply the simulator's overrides (lineup-out, motivation, recent-form). The user's predictions then differ from the base analysis based on their judgment about the override inputs. The personal calibration score captures whether those judgments improve calibration or hurt it.

A user who applies overrides skillfully, only when they have specific information the model lacks, will see their personal calibration above the model's. A user who applies overrides recklessly or based on hopes will see their personal calibration below the model's.

The score is a feedback signal. It is most useful interpreted as a long-run trend rather than a fixture-by-fixture measure.

What calibrated probabilities cannot do

A calibrated probability tells you the long-run hit rate of predictions at that level. It does not tell you:

Which specific fixture will hit. A 60 percent home win prediction does not say this fixture will be won by the home side. It says this fixture is in a class of fixtures where the home side wins six times out of ten.

Whether the model has all the relevant inputs. A model can be calibrated against the inputs it has seen and still miss on fixtures with unusual context the model does not capture (manager change, unusual injuries, stadium relocation).

Whether you should act on the probability. Calibration is a property of the prediction. What you do with the prediction is a separate question, governed by your decision-making framework, your risk tolerance, and (for fans interested in market context) the prices being offered. Tactiq produces calibrated probabilities; what users do on top of them is up to them.

Putting it together

Calibration is the most precise way to measure whether a probabilistic prediction is meaningful. A calibrated 60 percent prediction is more valuable than an uncalibrated 90 percent prediction, even though the latter sounds more confident. Calibration is the foundation underneath every claim that "the probability says X percent".

Tactiq tracks model calibration at the league level and user calibration at the personal level. The two together give a Premium user a clear sense of whether the analyses they are reading are well-grounded and whether their own judgment about overrides is improving or hurting that grounding.

A 60 percent prediction should mean 60 percent. When it does, the probability is doing real work. When it does not, the number is decoration. Calibration is the difference.

Frequently Asked Questions

What does it mean for a probability to be calibrated?
A probability is calibrated when, over many predictions of the same probability, the outcome happens that fraction of the time. If you predict 60 percent home win 100 times and the home side wins 60 of those fixtures, you are calibrated. If they win 40, you are over-confident; if they win 80, you are under-confident.
Why does calibration matter more than accuracy?
Accuracy treats all predictions as binary (right or wrong). Calibration captures the strength of conviction. A 70 percent prediction that hits is more impressive than a 51 percent prediction that hits, because the 70 percent prediction took on more risk by being more specific. Calibration measures whether you backed your conviction at the right level.
Are most football analysts calibrated?
Most are not. Casual fans systematically over-predict outcomes they want to see. Pundits often predict at 80 to 90 percent confidence levels that, in practice, hit only 55 to 65 percent of the time. The calibration gap is one of the clearest tells for whether someone is reasoning probabilistically or just expressing confidence.
How does Tactiq verify its model is calibrated?
By tracking the Brier score of model output against actual outcomes across all decided fixtures in featured leagues. The model's Brier scores per league sit in a 0.18 to 0.24 band, well below the 0.667 random-guessing baseline. The model is meaningfully better calibrated than naive baselines.
What happens if I am consistently miscalibrated?
The personal calibration tracker in Tactiq's History page surfaces your gap. If you are systematically over-confident in home sides, your calibration score will be lower than if you read the model's probabilities at face value. Over time, the score gives you a self-correcting signal.
Can I trust the model's calibration enough to use it for decisions?
Tactiq does not produce decision recommendations. The model produces calibrated probabilities. What decisions you make on top of those probabilities is your responsibility. Tactiq does not provide betting advice, tips, or guarantees of outcome.