Prediction market performance can be assessed using a variety of methods. Recently, SciCast researchers have been taking a closer look at the market accuracy, which is measured in a variety of ways. A commonly used scoring rule is the Brier score that functions much like squared error between the forecasts and the outcomes on questions.
One component of the Brier score is calibration, AKA reliability. Calibration refers to the agreement between observed probabilities of outcomes and predictions of probabilities of outcomes. Well-calibrated forecasts are good indicators of the chances that events will occur. For example, if our market is well calibrated and reports that the chance of a new invention is 0.6, the chance must really be about 0.6. That invention might not occur, and the forecast would look inaccurate in some ways, but if we could observe the real probability close to the estimated probability, we would say that the forecast is well calibrated.
Because we cannot observe the real probability of a one-time event, we cannot tease out calibration from other forms of accuracy without grouping forecasts. After questions in a selected group have resolved, we can estimate the chance that they would occur as the average of their resolutions. For binary questions, this requires computing only the portion of questions that resolved as 1 rather than 0 out of all questions in the group. We then compute the average of forecasts on the group of questions and compare to our estimate of the real probability.
The grouping approach to estimating calibration is not without drawbacks, but it has become standard practice. Rather than create a summarizing statistic, I’m presenting the calibration measurement in a graph below. The graph is the observed probability of events as a function of the average forecast of events. Continuous questions had their forecasts rescaled onto the [0,1] interval as in binary questions and multiple-choice questions’ options. To set the groups of questions, I created 20 equally sized bins on the range of forecasts. Questions were placed in bins based on their average forecasts over time.
The diagonal line represents perfect calibration so that forecasts and observed probabilities for every group of questions would always be equal. On the other lines we should expect some noise unless we have thousands of resolved questions, but we should hope that the points aren’t consistently on one side of that thin black line. However, the points for all types of questions tend to fall to the upper left of the diagonal, indicating underconfidence in the chances of events in question occurring. That result is disappointing, but it could be worse: the market could be systematically overconfident, which tends to have more severe consequences.
A second result worth noting is that the underconfidence seems to increase as events become more likely. We have a problem: all of SciCast’s long-run forecasts (on resolved questions) above 0.75 were for events that ultimately occurred. The market should have forecast close to 1.0 on those events to be well calibrated, but we don’t even have one question on which the market did so. There are few questions that averaged forecasts above 0.85 over their runs and none that averaged forecasts above 0.95.
There is a reasonable possibility that the low end of the blue line would look more like the red line if some of the options on multiple-choice questions were evaluated independently. Because SciCast strives for clean resolutions of questions, all possibilities for the outcome on a multiple-choice question must be considered even if any sensible user recognizes that their chances of occurring are quite low. Maybe being well calibrated on all those little possibilities is easy, and if it weren’t, the graph would show over-estimation of low probabilities and under-estimation of high probabilities for both the blue and red lines.
I’d love to hear suggestions in the comments on how to improve our calibration without post-hoc corrections. Some of the underconfidence might come from questions that resolved positively early, and when the preset resolution dates for more questions arrive, the observed probabilities on groups of questions will decrease, but that won’t get rid of all of the underconfidence.
Edit: At @sflicht’s request, a file with the forecasts on scaled, continuous questions is avg_forecasts_continuous.