Brier-Calibrated Success Score: What 78% Calibration Actually Means

By Tactiq AI · 2026-05-05 · 8 min read · Methodology

When you open the History page in Tactiq Premium, you see a number called Prediction Calibration. It might say 78 percent (Good), or 82 percent (Good), or 67 percent (Average). It is colored according to which tier it falls in, and the categorical label appears in parentheses next to the percentage.

This article is about what that number actually means. Because calibration is one of the most-misread statistics in probabilistic analysis, and reading it correctly changes how you use the rest of the app.

Calibration is not accuracy

The most common misread is to assume Prediction Calibration is the same thing as accuracy. It is not.

Accuracy is binary. You said something would happen, it either happened or it did not. The accuracy score is the percentage of times you were right.

Calibration is graded. You assigned a probability to something happening. Calibration measures whether your probability was close to the truth, not whether you happened to be on the right side of 50 percent.

A worked example. Suppose over 100 fixtures you say each home side has a 65 percent chance of winning. The home side wins 65 of those fixtures. Your accuracy is 65 out of 100, which sounds mediocre. But your calibration is perfect: the probability you assigned matched reality exactly.

Now suppose another analyst goes through the same 100 fixtures and says each home side has a 99 percent chance of winning. The home side wins 65 of those fixtures. Their accuracy is also 65 out of 100, but their calibration is terrible: they were wildly overconfident. Their predictions, while equally accurate by the binary measure, contained much less information about each fixture's actual difficulty.

Calibration is the deeper measure of probabilistic skill, and it is what Tactiq tracks.

The Brier score, briefly

The math underneath Prediction Calibration is the Brier score, a 1950 metric named after meteorologist Glenn Brier. The score is the average squared difference between predicted probability and actual outcome.

For a single fixture: take the predicted probability of home win, subtract whether the home side actually won (1 or 0), square the difference. Do the same for away win and draw. Sum them. That is the Brier score for that fixture. Average across all decided fixtures to get an overall Brier score.

The math has three convenient properties:

A perfect prediction (probability 1.0 for the outcome that actually happens) gets a Brier score of 0.
A maximally wrong prediction (probability 0 for the outcome that happens, 1.0 for one that does not) gets a Brier score of 2.
Random guessing (33 percent each for the three outcomes) averages to a Brier score of about 0.667 over a large sample.

The Brier score has a clear top and bottom, and it weights misses by how badly they missed, not just whether they missed.

Why Tactiq inverts and rescales

The raw Brier score is unintuitive for fans. "Your Brier score is 0.42" does not communicate. Most fans read "lower is better" backwards on the first encounter.

Tactiq reverse-maps the Brier score to a 0 to 100 percent calibration score using the formula round((1 - brier / 2) * 100), clamped to 0 to 100. The math:

Brier 0.0 (perfect) maps to calibration 100 percent.
Brier 1.0 (poor) maps to calibration 50 percent.
Brier 2.0 (worst case) maps to calibration 0 percent.

A typical user's Brier score lands somewhere between 0.30 and 0.55, which corresponds to a calibration score of 73 to 85 percent. That range is the meat of the distribution.

The four-tier color labels in Tactiq map as follows:

85 percent and above (Very Good). Brier 0.30 or below. Genuinely strong calibration. Unusual without significant time spent reviewing analyses.

75 to 84 percent (Good). Brier 0.32 to 0.50. The most common range for engaged Premium users. Indicates the user's reads are reasonably calibrated against the model and against reality.

65 to 74 percent (Average). Brier 0.52 to 0.70. Calibration is meaningfully better than random but has clear room to improve.

Below 65 percent (Needs Work). Brier 0.70 or worse. Closer to random than to skilled. Worth reviewing which analyses went wrong and why.

The 10-fixture threshold

Tactiq does not show a calibration percentage until at least 10 decided fixtures are in your history. Until then, the History header shows "Not enough decided analyses yet".

The reason is statistical. With three or four decided fixtures, a single sharply wrong prediction can move the Brier score by 0.10 or more. The user sees their score swing from "Good" to "Needs Work" between two consecutive fixtures, even though their underlying skill has not changed.

Ten fixtures is enough sample to suppress most of that volatility. The score still moves between fixtures, but the moves are smaller and the user is no longer flipping between tiers. The threshold is documented in the app and was chosen specifically to avoid the early-history flip-flop UX.

What calibration does not measure

Calibration measures whether your probabilistic reads track reality. It does not measure:

Edge. Calibration says nothing about whether your reads beat the closing market price. A user can be perfectly calibrated and still have no economic edge if their probabilities exactly match the market.

Selection. Calibration is computed across whatever fixtures you actually analyzed. It does not penalize you for analyzing easy fixtures. A user who only analyzes top vs bottom fixtures will likely have higher calibration than a user who analyzes tight midweek mid-table fixtures, but the lower-calibration user is doing the harder analytical work.

Consistency over time. Calibration is computed over your full history. A user whose calibration was 60 percent for 50 fixtures and then 90 percent for the next 50 fixtures will show 75 percent overall, even though they have clearly improved.

For these reasons, calibration is one input to your analytical self-assessment, not the whole picture. But it is the most precise single number Tactiq can give you about your probabilistic skill, and it is the one we surface.

How to use the score

The calibration score is most useful when you treat it as a feedback signal rather than a leaderboard.

If your score drops below your usual baseline by 5 points or more, that is a signal to review your last 10 to 20 analyses. Look for systematic biases: were you consistently too confident in home sides? Did you miss on a specific league? Were the misses concentrated on fixtures where you applied simulator overrides?

If your score climbs above your usual baseline, do not assume you have suddenly become a sharper analyst. Variance moves Brier scores both directions. Wait for 20 to 30 more fixtures before crediting the improvement to skill.

The score is most useful as a long-run mirror. Your calibration after 200 decided fixtures is a real read on your probabilistic skill. Your calibration after 12 is a starting point. The number's value compounds with time spent in the app.

Frequently Asked Questions

What is the Brier score?

The Brier score is a measure of how well-calibrated probabilistic predictions are. It compares the predicted probability of an outcome with whether the outcome actually happened. A perfect Brier score is 0. A worst-case score for a binary outcome is 2. Lower scores mean better calibration.

Why does Tactiq show calibration as a percentage instead of the raw Brier score?

Raw Brier scores between 0.20 and 0.70 do not communicate intuitively. Most fans read 'lower is better' as counterintuitive. Tactiq reverse-maps the Brier score to a 0 to 100 percent calibration score, where 100 means perfect calibration and 0 means random guessing. The reverse-mapping uses the formula round((1 - brier / 2) * 100), clamped to 0 to 100.

How is calibration different from accuracy?

Accuracy asks whether the prediction was right. Calibration asks whether the predicted probability matched reality. If you predict 60 percent home win 100 times and the home side wins 60 of them, you are perfectly calibrated even though 40 of your predictions were 'wrong'. Calibration is the deeper measure of probabilistic skill.

Why does Tactiq require 10 decided analyses before showing a calibration score?

With fewer than 10 decided fixtures, the Brier score swings sharply between matches. A single lopsided wrong call can move the score by 0.10 or more. The 10-match threshold gives a stable estimate. Below 10, Tactiq shows a 'not enough decided analyses yet' state instead.

What calibration percentage should I aim for?

The four tiers in Tactiq are: 85 percent and above (very good), 75 to 84 percent (good), 65 to 74 percent (average), under 65 percent (needs work). Most users land in the 75 to 84 percent range after 50 plus decided fixtures. Above 85 percent is genuinely strong calibration and is unusual.

Does the calibration score include simulator outputs?

No. The calibration score is computed only on base analyses, not on simulator outputs that depend on user-provided overrides. Including overrides would conflate model calibration with the user's judgment about the override inputs.