Brier Score Explained: How Football Forecasts Get Graded

By Tactiq AI · 2026-05-06 · 8 min read · AI & Football

Most football prediction apps market accuracy. "70% accurate" sounds impressive. "80% of our top picks came through" sounds more impressive. Accuracy claims dominate the space. They're also almost meaningless.

The right way to grade a forecaster is not accuracy. It's calibration. A forecaster whose 70% probability picks are right 70% of the time (not 90%, not 50%) is doing the job. A forecaster whose 70% picks are right 85% of the time is under-confident (probably valuable) but not calibrated. A forecaster whose 70% picks are right 55% of the time is loud (probably useless).

The Brier score is the metric that grades calibration. It's been standard in academic forecasting research for 75 years, and it's how any football forecaster worth listening to gets graded honestly.

This article walks through what Brier actually measures, how to compute it yourself, what the benchmarks are, and why calibration is the signal you should demand from any prediction tool.

What Brier actually measures

Brier is a squared-error score between forecast and reality. The lower, the better calibrated.

For three-way football outcomes (home win, draw, away win), each match produces three forecast numbers that sum to 1.0. The actual result produces three 0-or-1 numbers (1 for the outcome that happened, 0 for the others).

Per-match formula: Brier = Σ (forecast - actual)^2 / 3

So a forecast of [0.60, 0.25, 0.15] for home/draw/away on a match that ended in home win:

Home: (0.60 - 1.00)^2 = 0.16
Draw: (0.25 - 0.00)^2 = 0.0625
Away: (0.15 - 0.00)^2 = 0.0225
Sum: 0.245
Divide by 3: 0.0817

A forecast that said [0.95, 0.03, 0.02] on the same match (home did win):

Sum: 0.0025 + 0.0009 + 0.0004 = 0.0038
Divide by 3: ≈ 0.0013

The confident correct forecast scores far better. But the confident wrong forecast gets punished hard:

[0.95, 0.03, 0.02] on a match that ended in draw:
Home: (0.95 - 0)^2 = 0.9025
Draw: (0.03 - 1)^2 = 0.9409
Away: (0.02 - 0)^2 = 0.0004
Sum: 1.8438
Divide by 3: ≈ 0.6146, a disaster

Confidence is rewarded when justified and punished when not. Averaged over hundreds of matches, Brier separates calibrated forecasters from loud ones automatically.

Why calibration matters more than accuracy

Consider two forecasters.

Forecaster A always says 95% home, 3% draw, 2% away on every home match of a top-six side. Gets about 60% of them right.

Forecaster B says 62% home, 24% draw, 14% away on the same fixtures. Gets 62% of them right.

Who's better? Accuracy-wise, A is ahead (60% vs 62% is close). Calibration-wise, B is vastly ahead. A's 95% picks go wrong 40% of the time, which is terrible. B's 62% picks go right 62% of the time, which is honest.

Brier scores tell you which one is reading the underlying signal correctly. A's Brier will be dreadful because the 95% probabilities square up when they're wrong. B's Brier will be excellent because the probabilities match reality.

This matters in three practical ways:

Risk calibration. If you use a prediction to make any decision downstream (even a casual "which match is most interesting to watch"), knowing how reliable the probability actually is matters. A 95% from a bad forecaster is worth less than a 62% from a good one.

Comparison between forecasters. You cannot compare two forecasters on raw accuracy. Someone who only picks favourites will look more "accurate" than someone who includes underdogs in their forecasts. Brier works regardless of the distribution of probabilities.

Honesty. Calibrated forecasters are less tempted to over-claim. A forecaster who knows they'll be Brier-scored doesn't boast. A forecaster who knows they'll only be accuracy-scored has incentive to only call favourites and pump the accuracy line.

Brier benchmarks for football

Rough benchmarks on Brier for three-way football outcomes (home / draw / away):

Always-draw baseline (says every match is 33/33/33): about 0.25
Random probability baseline: about 0.25
Basic form-based model (wins-losses only): about 0.225
Decent public model using xG + form: 0.195 to 0.215
Bookmaker closing line: around 0.195
Elite model with event data + careful calibration: 0.185 to 0.195

Scores below 0.185 are rare; scores above 0.22 are underperforming. Most serious analytical work lives in the 0.19 to 0.21 range, close to but usually not beating bookmaker markets (which have pricing pressure and sharp money as their calibration mechanism).

How Tactiq thinks about Brier and calibration

Tactiq runs internal calibration tracking across its analysis output to confirm that the confidence indicators on match cards correspond to real-world outcome frequencies at the expected rate. A confidence indicator that says "high confidence" should map to matches where the top probability genuinely plays out at a high rate. A confidence indicator that says "tight" should map to matches where outcomes are genuinely variable.

The specific Brier values, the calibration dashboards, and the re-tuning cadence stay within the product. What reaches the user is a confidence-qualified analysis where the confidence indicator has been calibrated against actual outcomes rather than invented as marketing signal. Published methodology gets copied and miscalibrated within weeks.

What the user sees on the match card:

Probability triples for the outcome, with a visible confidence indicator that maps to a genuine calibration band.
Expected goals for each side.
A written analysis that explains the read in plain language.
No external market data anywhere. No redirects to third-party platforms. No virtual currency. Statistical analysis only.

The confidence indicator is the user-facing handle on calibration. "High confidence" means the signal is strong; "tight" means outcomes have been genuinely variable.

The takeaway

Brier score is how forecasters actually get graded. A tool that advertises accuracy without showing calibration is asking you to trust the loud-forecaster pattern. A tool that is willing to be graded on calibration, that surfaces confidence indicators which genuinely reflect outcome variability, is the one that will hold up over time.

You can compute Brier yourself on any forecaster's predictions, if they publish the probabilities alongside results. The formula is simple, the benchmarks are well-known, and the honest grade takes a few minutes of spreadsheet work.

Tactiq builds calibration into the user-facing confidence indicator and validates it internally. The analysis reads each match with confidence that matches the underlying uncertainty, rather than loud claims that don't survive sample scrutiny. 1,200-plus competitions, 32-language localisation, free tier of eight analyses per day, no credit card required.

If you've been following the series, the metrics vocabulary now covers how AI predicts football matches, xG, xA, npxG, PPDA, Field Tilt, progressive actions, SCA/GCA, xPts and Elo ratings. Brier joins them as the meta-metric that grades every other metric's honesty.

Frequently Asked Questions

What is a Brier score in one sentence?

Brier score measures how far a forecaster's probability estimates are from reality, averaged across all their forecasts. Lower is better. A perfect forecaster gets a Brier score of 0; a coin-flipping fool gets a score around 0.25 on three-way football outcomes.

How is it actually calculated?

For each match, take the forecaster's probability for each outcome (home, draw, away), and the actual result (1 for the outcome that happened, 0 for the others). Compute (forecast - actual)^2 for each outcome, sum them, and divide by the number of outcomes. Average that across all matches. The lower the resulting number, the tighter the forecaster's probabilities map to what actually happened.

Why is calibration more important than accuracy?

An 'accurate' forecaster might get the top pick right 60% of the time. But what you really want is a forecaster whose 70% picks are right 70% of the time and whose 45% picks are right 45% of the time. A forecaster who says '90%' on everything and is right 60% of the time is loud. A calibrated forecaster matches their confidence to reality.

Does Tactiq publish its Brier score?

Tactiq does run internal calibration tracking across its analysis output to confirm that the confidence indicators match real-world outcomes at the expected rate. The specific methodology and current Brier score values stay within the product. For a user, the effect shows up as a confidence indicator on each analysis that genuinely reflects how uncertain the read is.

What's a good Brier score for football?

For three-way match outcomes, a naive coin-flip benchmark scores about 0.25 Brier. A bookmaker market scores around 0.195. A well-built football model scores in the 0.19-0.21 range. Anything lower than 0.19 on a large sample is elite. Anything above 0.22 is underperforming. These are ballpark figures; exact thresholds depend on league mix and sample size.

Can I compute my own Brier score on predictions?

Yes. You need a list of forecasted probabilities (home/draw/away for each match) and the actual result. Apply the formula, average across matches, compare to the benchmarks above. If you do this regularly against any tool or tipster, you get an honest grade that doesn't rely on marketing claims.