Brier-Calibrated Success Score: What 78% Calibration Actually Means
When you open the History page in Tactiq Premium, you see a number called Prediction Calibration. It might say 78 percent (Good), or 82 percent (Good), or 67 percent (Average). It is colored according to which tier it falls in, and the categorical label appears in parentheses next to the percentage.
This article is about what that number actually means. Because calibration is one of the most-misread statistics in probabilistic analysis, and reading it correctly changes how you use the rest of the app.
Calibration is not accuracy
The most common misread is to assume Prediction Calibration is the same thing as accuracy. It is not.
Accuracy is binary. You said something would happen, it either happened or it did not. The accuracy score is the percentage of times you were right.
Calibration is graded. You assigned a probability to something happening. Calibration measures whether your probability was close to the truth, not whether you happened to be on the right side of 50 percent.
A worked example. Suppose over 100 fixtures you say each home side has a 65 percent chance of winning. The home side wins 65 of those fixtures. Your accuracy is 65 out of 100, which sounds mediocre. But your calibration is perfect: the probability you assigned matched reality exactly.
Now suppose another analyst goes through the same 100 fixtures and says each home side has a 99 percent chance of winning. The home side wins 65 of those fixtures. Their accuracy is also 65 out of 100, but their calibration is terrible: they were wildly overconfident. Their predictions, while equally accurate by the binary measure, contained much less information about each fixture's actual difficulty.
Calibration is the deeper measure of probabilistic skill, and it is what Tactiq tracks.
The Brier score, briefly
The math underneath Prediction Calibration is the Brier score, a 1950 metric named after meteorologist Glenn Brier. The score is the average squared difference between predicted probability and actual outcome.
For a single fixture: take the predicted probability of home win, subtract whether the home side actually won (1 or 0), square the difference. Do the same for away win and draw. Sum them. That is the Brier score for that fixture. Average across all decided fixtures to get an overall Brier score.
The math has three convenient properties:
- A perfect prediction (probability 1.0 for the outcome that actually happens) gets a Brier score of 0.
- A maximally wrong prediction (probability 0 for the outcome that happens, 1.0 for one that does not) gets a Brier score of 2.
- Random guessing (33 percent each for the three outcomes) averages to a Brier score of about 0.667 over a large sample.
The Brier score has a clear top and bottom, and it weights misses by how badly they missed, not just whether they missed.
Why Tactiq inverts and rescales
The raw Brier score is unintuitive for fans. "Your Brier score is 0.42" does not communicate. Most fans read "lower is better" backwards on the first encounter.
Tactiq reverse-maps the Brier score to a 0 to 100 percent calibration score using the formula round((1 - brier / 2) * 100), clamped to 0 to 100. The math:
- Brier 0.0 (perfect) maps to calibration 100 percent.
- Brier 1.0 (poor) maps to calibration 50 percent.
- Brier 2.0 (worst case) maps to calibration 0 percent.
A typical user's Brier score lands somewhere between 0.30 and 0.55, which corresponds to a calibration score of 73 to 85 percent. That range is the meat of the distribution.
The four-tier color labels in Tactiq map as follows:
- 85 percent and above (Very Good). Brier 0.30 or below. Genuinely strong calibration. Unusual without significant time spent reviewing analyses.
- 75 to 84 percent (Good). Brier 0.32 to 0.50. The most common range for engaged Premium users. Indicates the user's reads are reasonably calibrated against the model and against reality.
- 65 to 74 percent (Average). Brier 0.52 to 0.70. Calibration is meaningfully better than random but has clear room to improve.
- Below 65 percent (Needs Work). Brier 0.70 or worse. Closer to random than to skilled. Worth reviewing which analyses went wrong and why.
The 10-fixture threshold
Tactiq does not show a calibration percentage until at least 10 decided fixtures are in your history. Until then, the History header shows "Not enough decided analyses yet".
The reason is statistical. With three or four decided fixtures, a single sharply wrong prediction can move the Brier score by 0.10 or more. The user sees their score swing from "Good" to "Needs Work" between two consecutive fixtures, even though their underlying skill has not changed.
Ten fixtures is enough sample to suppress most of that volatility. The score still moves between fixtures, but the moves are smaller and the user is no longer flipping between tiers. The threshold is documented in the app and was chosen specifically to avoid the early-history flip-flop UX.
What calibration does not measure
Calibration measures whether your probabilistic reads track reality. It does not measure:
Edge. Calibration says nothing about whether your reads beat the closing market price. A user can be perfectly calibrated and still have no economic edge if their probabilities exactly match the market.
Selection. Calibration is computed across whatever fixtures you actually analyzed. It does not penalize you for analyzing easy fixtures. A user who only analyzes top vs bottom fixtures will likely have higher calibration than a user who analyzes tight midweek mid-table fixtures, but the lower-calibration user is doing the harder analytical work.
Consistency over time. Calibration is computed over your full history. A user whose calibration was 60 percent for 50 fixtures and then 90 percent for the next 50 fixtures will show 75 percent overall, even though they have clearly improved.
For these reasons, calibration is one input to your analytical self-assessment, not the whole picture. But it is the most precise single number Tactiq can give you about your probabilistic skill, and it is the one we surface.
How to use the score
The calibration score is most useful when you treat it as a feedback signal rather than a leaderboard.
If your score drops below your usual baseline by 5 points or more, that is a signal to review your last 10 to 20 analyses. Look for systematic biases: were you consistently too confident in home sides? Did you miss on a specific league? Were the misses concentrated on fixtures where you applied simulator overrides?
If your score climbs above your usual baseline, do not assume you have suddenly become a sharper analyst. Variance moves Brier scores both directions. Wait for 20 to 30 more fixtures before crediting the improvement to skill.
The score is most useful as a long-run mirror. Your calibration after 200 decided fixtures is a real read on your probabilistic skill. Your calibration after 12 is a starting point. The number's value compounds with time spent in the app.