Josh Blackman » FantasySCOTUS: Randomness, Accuracy, “Power Predictors,” and Implications

FantasySCOTUS: Randomness, Accuracy, “Power Predictors,” and Implications

May 6th, 2011

Since we posted our article on FantasySCOTUS, we have received an outpouring of interest about a Supreme Court prediction market. We have also received a number of comments about how we can improve our analysis. We are in the process of refining our article to reflect these suggestions. In this blog post, we’d like to explain our strategy. First, we address how randomness affects our data. Second, we note how we selected our “experts,” or as we probably should have called them, “Power Predictors.” Third, we explain what we mean when we say our predictions are “accurate,” distinguishing between ex post aggregate analysis and ex ante individual case predictions. Fourth, we look at the implications of FantasySCOTUS.

Randomness, Flipping Coins, and Monkeys on Typewriters

A number of professors have asked us whether our results are distinguishable from random predictions of Supreme Court outcomes. More simply, is FantasySCOTUS more than a bunch of monkeys with typewriters?With a large number of predictors (we had nearly 5,000 members), due to mere chance and randomness, some predictions would tend to be accurate. How do the top predictors’ accuracy rates compare to the expected random distribution of predictors with a 50% chance of getting each case right, such as by flipping a coin? Or invoking the Infinite Monkey Theorem, wouldn’t a never-ending room packed with prognosticating primates yield at least one “expert” who predicts every case correctly?

We have insulated our analysis from randomness as well as any statistical method can be insulated. In our analysis, where we presented data we indicated whether it was significant, and at what confidence level (90, 95%, or 99%). With data at these confidence levels, we were able to assert that the result was not due to randomness or chance.

To test this, we reran the tests as a 2×2 matrix with FantasySCOTUS “experts” against a hypothetical group that predicted the same number of cases, who got half wrong and half right (50/50–the same accuracy as flipping a coin).

Here were our results:

FantasySCOTUS “Experts” v. Supreme Court Forecasting Project “Experts”: P-value of 0.00057436585, so we reject the null hypothesis of a difference due to randomness at a 99% confidence level.
FantasySCOTUS “Experts” v. Supreme Court Forecasting Project Decision Tree: P-value of .19, so we fail to reject the null hypothesis of a difference due to randomness.
FantasySCOTUS “Experts” v. Random Group: P-value of 8.57195568 × 10-8 , so we reject the null hypothesis of a difference due to randomness at a 99% confidence level.

For this comparison, we generated a P-value of 8.57195568 × 10-8. Accordingly, we were able to reject the null hypothesis of a difference due to randomness at a 99% confidence level. So, there is a 99% chance the difference is not due to randomness.

When attempting to generalize from statistical samples, we must first determine whether the trends we observe are due to chance. Although there are multiple methods of simulating the outcomes due to randomness, the main idea is that there will be some range of values that fall around the baseline neutral results due to randomness. A skewed result occurs when non-random factors change the results from the baseline, but are not the results being measured.

It is important to distinguish between randomness, through flipping a coin, and randomness with an infinite number of monkeys making predictions on their iPads (much more useful than a typewriter for FantasySCOTUS). During the October 2009 Term, there were two options to predict the outcome: Affirm or Reverse/Remand (this term we added “Recuse” to account for Justice Kagan’s numerous recusals). Flipping a fair coin yields a 50% chance of heads and a 50% chance of tails. A coin flip has no memory from one flip to the next. One flip does not impact a subsequent flip.

This model accurately describes potential random predictions made in FantasySCOTUS. When judging whether a particular system is accurate or not, it must do better than the random coin flip; in other words, for FantasySCOTUS to be accurate, it must do better than the 50% likelihood of affirm or reverse, achievable by flipping a coin.

Comparisons of FantasySCOTUS to the infinite monkey theorem are not precise. First, the infinite monkeys are not truly random in the sense that possible outcomes are evenly distributed. Second. the idea of an infinite number of monkeys, or at least a really really really large number of monkeys, practically speaking, is bananas. It would be impossible to ever achieve a sample size that large. FantasySCOTUS had approximately 5,000 members. Our expert group had about 30 members. Those numbers are not even close to the same order of magnitude to compare with all those prescient primates.

Experts as Power Predictors

It is important to stress that the FantasySCOTUS “experts” are experts in nomenclature only. Perhaps a better term, which we will use in future works, is “Power Predictors.” Experts suggests a certain credentialed level of expertise that is not present in the ranks of our top performers.Unlike the “experts” selected in the Forecasting Projects, who were selected based on credentials and work experience, the FantasySCOTUS Power Predictors selected themselves by predicting more than 75% of the cases. When comparing the FantasySCOTUS Power Predictors with the Forecasting Project’s experts we are not comparing two similar groups. The former is effectively a crowd, while the latter is a group of specialized experts.

Power Predictors were not selected on the basis of correctness, but rather based on the total number of predictions made. Even if a person randomly predicted over 75% of the cases, he or she would have been considered an “expert.” With this approach, we narrowed the pool of predictors so that our final group of “experts” was only about 30 people, versus the several thousand that constituted the crowd. We did not consider the accuracy when predicting the “expert” group. Any accuracy derived from the increased participation was not deliberate on our part. This considerably narrows the potential that we merely selected the best results out of thousands of predictions

Admittedly, our Power Predictors were selected ex post. This selection process may be faulty if success was due to chance. As demonstrated above, and below, we can be fairly certain that our Power Predictors predictions were not based on chance. Most importantly, there was no way to select Power Predictors ex ante in the first season of FantasySCOTUS. For our second season, however, we will identify the top performers from season 1 who returned, and designate this crew as our repeat Power Predictors, in addition to the next generation of “Power Predictors” who predict over 75% of the cases. We are interested to see how repeat performers do.

Our Power Predictors represented a wide swath of the legal community. Who are they?

I reached out to the 30 members of the “expert” group. The composition of this cadre is quite varied. Only one has any experience arguing before the Supreme Court. A few others have written amici for the Supreme Court. A number have had appellate or district clerkships. Some have no appellate experience at all, and work in small, general practice law firms. Others did not attend law school, and have political science background. One user in particular has never attended law school, taught himself constitutional law, and has no formal training in the law. These members lack the credentials that the “Experts” from the Supreme Court forecasting project–mostly appellate litigators, Supreme Court clerks, and professors–possessed.

Additionally, in the Forecasting Project, the “experts” were subject matter experts–that is, they would make predictions in cases they were familiar with, such as corporate law, criminal law, constitutional law, etc. FantasySCOTUS Power Predictors made predictions across the board, for all cases, from noteworthy 2nd Amendment cases to less popular original jurisdiction water rights cases.

Defining Accuracy- Ex Post Aggregate Analysis v. Ex Ante Individual Case Predictions

How “accurate” were FantasySCOTUS predictions? FantasySCOTUS Power Predictors–who made predictions for more than 75% of the cases–correctly predicted 64.7% of cases correctly. The Gold, Silver, and Bronze medalists in FantasySCOTUS scored accuracy rates of 80%, 75% and 72% respectively (an average of 75.7%).One comment we received is that our Power Predictors getting it wrong 35.3% of the time may not be fairly characterized as “accurate.” Specifically, the Supreme Court typically reverses about 70% of the cases (between 70-73% the past few terms). During the October 2009 Term, for example, the Court reversed 72% of the cases decided on the merits. If someone simply predicts “reverse” for every case, in theory, he would have scored a 72% reversal rate.

Admittedly, at the end of the term, the FantasySCOTUS Power Predictors accurate rate mirrors the overall reversal rate. Though, the unique power of FantasySCOTUS is not providing an ex post aggregate analysis of the entire term. Plenty of empirical works do this, such as the SCOTUSBlog Stat Pack.

The novel contribution of FantasySCOTUS, something no other product can do, is to provide real-time ex ante predictions during the term for individual cases.

It is important to distinguish this overall average reversal rate of 72%, and the accuracy rate for individual cases from our Power Predictors. Simply concluding ex post that the Court reversed approximately 70% of all cases argued during a term provides no information about individual cases. In contrast, the FantasySCOTUS prediction tracker provides real-time predictions for each and every pending case; not just an aggregate overall prediction for the term. When we say that our Power Predictors had an average accuracy rate of 60-70%, that number consists of data points for each and every case, with an attendant confidence level of 90%, 95%, or 99%.

Further the 72% overall reversal rate provides no information about which 72% of the docket will be reversed. The reversals do not necessarily occur during the first or last cases decided, and are distributed throughout the term, with the reversal granted based on the merits of the case, not the remaining number of cases and outcomes.

To put it another way, armed solely with the 72% aggregate reversal rate, a predictor would have no way ex ante of knowing how an individual case will turn out. To say that any individual case has a 72% likelihood of reversal is a statistical fallacy. One would have to know the specifics of the case to make that type of estimate.

In contrast, with FantasySCOTUS data for each case, we are able to establish at a 90%, 95%, or even 99% confidence level how each case would be decided. Viewed this way, an accuracy rate of 60-70% becomes much more impressive. Comparing ex post and ex ante analyses is imprecise.

In theory, on the aggregate, if a user predicted that every single case would be reversed, his accuracy rate would approach 70%. However, the FantasySCOTUS point system ensured this prediction strategy — reverse in every case — would not succeed. We combed through the data and found that none of our Power Predictors attempted to predict that all cases would be reversed–hoping to cash in on the high reversal rate. Not one. The Power Predictors who achieved accuracy rates as high as 80% did so by making good faith informed predictions for each case.

Why? Simply put, smart law nerds don’t like to lose. At its heart, FantasySCOTUS is game. Players, often very competitive, want to win. None of our Power Predictors gamed the system by predicting all reversals.

The scoring structure was designed to ensure that those who wanted to do well, essentially had to make good faith predictions, rather than a blanket reversal policy. Although our analysis only looked at the predicted outcome (affirm or reverse/remand), users were asked to predict several elements about a case.

Members would make predictions based on 11 parameters. First, members predict whether the Supreme Court would affirm or reverse/remand the lower court. Members were awarded one point for getting the outcome correct. Second, members predict how the Court will split: 9-0 Affirm, 8-1 Affirm, 7-2 Affirm, 6-3 Affirm, 5-4 Affirm, 5-4 Reverse, 6-3 Reverse, 7-2 Reverse, 8-1 Reverse, 9-0 Reverse, and Other (including 4-1-4 splits and where less than 9 Justices vote). Three points were awarded for correctly predicting the split. Third, members predicted whether each of the nine Justices were in the majority or in the dissent. One point was awarded for each correct prediction. For a single case, members can earn up to 13 points.

It would not be enough to simply vote every case as a reverse. A user would also have to select votes for Justices and the split . To maximize this strategy, a user likely predict that a case would be reversed, with all 9 justices joining that opinion–a 9-0 reverse.

However, no one employed this strategy. A few of our Power Predictors, including Chief Justice Donoho, noted that they started their prediction process by assuming a case would be a 9-0 Reverse. Some Power Predictors indicated that when predicting a case, they would start with the assumption that the case would be reversed 9-0. If a user adopted this strategy selectively, as opposed to applying it across the board, the putative statistical benefits of picking 9-0 reverse for each case (how can you know which case will be among the 70% reversed?) are essentially eliminated .

To eliminate this problem for version 2.0 of FantasySCOTUS for the October 2010 Term, we revamped the scoring system, and have minimized the possibility of blanket all-reverse voting. Now, rather than selecting a single affirm or reverse option, users have to select the vote (affirm, reverse/remand, or recuse) for each Justice. Even if the Court reverses 70% of the cases each term, users now have no way of selecting *just* a reverse. Now, they will have to lock themselves into a reverse prediction for each of the 9 Justices. Selecting that each Justice will reverse will likely yield a very weak score.

Implications of FantasySCOTUS

One item we did not spend much time discussing in our paper, but has been a subject of keen interest, is what are the implications of FantasySCOTUS.FantasySCOTUS 1.0 provided us with data set that allowed us to being to develop an analytical framework to devise an information market for the Supreme Court. FantasySCOTUS provides new insights into predicting justices, and not cases. As we continue to gather data, we can see what this information teaches us about the models of judicial decision making, and whether applying different models–i.e., attitudinal–yields different types of predictions. In learning about how people predict the justices will interact, we may learn something about how they actually interact and thus something about the institution of the Court itself.

Further, collecting data over several terms allows us to develop chains of precedents. For example, this term we were able to make predictions for Schwarzenneger (now Brown) v. EMA based on data for a similar First Amendment case decided last term, United States v. Stevens. Similarly, we used data from Citizens United v. FEC to generate predictions for the consolidated Arizona campaign finance cases this term.

FantasySCOTUS makes a unique contribution to the burgeoning literature about crowdsourcing and the wisdom of the crowds. Our unique data set, which we intend on releasing as an open source file at the end of the term, will be available to number crunchers and SCOTUS wonks around the world to find new patterns and insight in our data.

Season 2 of FantasySCOTUS, for the October 2010 is shaping up to be a fascinating term. We now capture a lot of information about our players: where do they go to law school, what year are they; what do they do (practicing attorney, law student, non-lawyer, etc.); what is their political ideology. With this data, and additional members (our enrollment has doubled to about 10,000), we can track and correlate how these different factors, ideology in particular, impacts a user’s predictions.

In the future, FantasySCOTUS could prove to be of some use to help attorneys–both civil and criminal litigators–make decisions that rely on the resolution of pending Supreme Court cases.

This post was co-authored by Josh Blackman, Adam Aft, & Corey Carpenter.