Some thoughts about the future of legal research

December 26th, 2011

I’ve spent the better part of today (other than getting some Chinese food, as is tradition) thinking about the state of legal research, and where I can fit in.

So at the most abstract level, what I want to accomplish is to amass a single database that contains every single item on every single docket in every single federal and state court. Not just opinions, but the briefs as well. The purpose of amassing this data set would be to permit some to-be-determined algorithm to sort through the opinions, understand them, index them (automatically), and create basis on which to make informed predictions–by knowing what courts have done, it can know what courts will do. Plus, it will determine what makes good arguments.

There are two primary obstacles. Well really many more than two–most entrepreneurial ventures fail, but people still try anyway (I guess I’m one of them) but I’ll focus on two here). First, is technology. I don’t think technology is close enough today to accomplish what I want to do.

The second obstacle (and probably the more insurmountable one) is acquiring the data. All of the information I need for the federal courts resides in PACER. (Information for the Supreme Court is available at the Supreme Court Database, but the cool stuff is in the lower courts). As you know (I’m sure), PACER charges $0.08 a page (this rate will increase to $0.10 a page in April of 2012). They also don’t offer some sort of API or easy-to-access public interface to download this information.

WEXIS no doubt has some kind of proprietary means of accessing this information, which is likely very complicated and labor-intensive. Some other services–such as Justia, FindLaw (owned by West), Fastcase, (Run by CALI), Cornell LII, Google Scholar, and others–offer access to opinions for free, but they don’t offer an easy way to bulk-import the information.

The primary repository for publicly available opinions is, organized by Carl Malamud, with bulk access available to all federal reported cases.

For 2011 and 2012, is offering a service, RECOP, courtesy of Fastcase, that indexes all opinions from courts every week.  This will be excellent bulk data for these two years. This gets opinions, but not briefs and dockets.

And of course, RECAP run by at the Center for Information Technology Policy at Princeton has that cool plugin for Firefox (where is the Chrome plugin???) that lets you automagically store all PACER docs you purchase. There search feature is rudimentary, and I’m sure their database is so spotty and incomplete., developed by Michael Lissner pulls all opinions off the circuit court websites daily. It’s a smart, back-door way of acquiring the data. Also, this only gets opinions, not dockets.

Malamud’s efforts of getting the federal courts to open up have been heroic, but unsuccessful, even with some big names behind it (Balkin, Lemley, Lessig, Wu, Ohm, and others). It strikes me if these luminaries, backed with big money from donors (Google among others) weren’t able to get the courts to open up, I won’t. So not even going to try. I’ll take a different route.

In any event, I see this process as consisting of several phases.

In the near-term,the first phase is to get a team in place. I have been very fortunate, and am quite grateful, that so many people have taken an interest in me and my work. I hope that a number of these individuals would care to join me on this ride.

The second phase is to work with what we got. Between, Recap, and CourtListener, there is a wealth of data. I would be interested in amassing it all together in one place, to try to get a sense oh how complete the set is.

The third phase would be to get it online. This data would be publicly available, sorted by docket number. It could be drilled down in the same manner PACER would work. At this stage, it would be nothing more than a watered-down version of Justia or Find Law. I would run this as a Harlan Institute project. All free, no ads. Perhaps find a better way to navigate through the cases (Justia and FindLaw are not particularly user-friendly), but the search feature, at first, will no doubt be weak. Offering some cool graphical tools like the LegalLanguageExplorer could be a way to differentiate it, but it’s just important to get something up at a low cost. Also, displaying all stuff in HTML format, rather than PDF would be good. Some SEO-style URLS would be good (for example Features like following a case or court or judge through RSS & Twitter would be relatively easy to manage.

All of the above tasks can be accomplished at a relatively low price.  The tech would use Amazon Web Services EC2, so it could ramp up quickly as we get more data.

Now, for the long-term, we would have to start on the business model, modelled something after Google’s. I see the product as two-tiered. The primary goal is to amass as much information as possible and make it publicly available. For the more premium stuff (which only law firms could possibly want), you charge. The investing for the latter would help fund the former, and build up good will.

With this kind of funding, we start working on getting more data. I read somewhere that the courts have erected a veritable tower-of-babel to keep people out, outside of WestLaw and Lexis and a few others. I imagine it would be very expensive, and I would have to learn from the industry leaders how to get in. FastCase seems to have a way in, and offered data to Resource.Org. Perhaps Ed Walters could be a help. Perhaps we could work with RECAP to index their stuff smarter. Having a complete repository, open-sourced, is the key.

A data-driven model is totally different from what WEXIS (and even the new entrant Bloomberg) offers. We won’t have research librarians. We won’t h ave people poring over the cases analyzing them. Algorithms trained to understand and index cases will do it automagically (I said long-term, didin’t I?)

If we have the data, and can future-crunch it, we would do some cool stuff.

Now for some of the fascinating metrics we could calculate:

  • Jurisdiction Profiles- What do we know about a specific court? What is the breakdown of cases decided in that case by category? What is the breakdown of parties who litigate? How often are plaintiffs or defendants victorious in certain cases? What are the most common dispositions? Summary judgment? Motion to dismiss? What about speed of the docket–how long does it take from complaint to motion to dismiss to summary judgment to trial? What about trials? Jury selection?
  • Judge Profiles- Every judge has his or her own quirks. Law firms and local practitioners have huge files on all judges with this kind of information. There’s no reason why this information could not be amassed from data about that court. How does the court handle things? Summary judgment? Motion to dismiss? Discovery disputes? Trials? Etc. What kinds of arguments are more persuasive? Does the judge have a pattern of ruling a certain way in a certain case?
  • Attorney Profiles- if you know all of the appearances and briefs an attorney submits in the federal system, you could compile a summary of him. WEXIS provides something like this with links to the cases he or she participated in, and any know affiliations. That is fine to start, but what does that tell you, other than who his clients are. Wouldn’t it be cool to calculate what the attorney actually did. Did he get a jury verdict? If so, how much? Was the case tossed on summary judgment or a motion to dismiss? If there was a settlement, was it on the record? If so, how much was it for? Were there any discovery disputes that required the courts attention? Or–and this is cool–even assuming the lawyer was victorious, did the judge buy his arguments? Did the arguments in the briefs make it into the opinion? (And this will feed into the broader goal of determining what makes effective advocacy–assisted decision making).

The coolest part would be what I call assisted-decision-making. One of the virtues of analyzing briefs and opinions is that you can determine what constitutes good advocacy.

For each jurisdiction/court/judge, we can determine what kinds of arguments, suits, parties, litigation tacticcs, etc. are successful/unsuccessful. Figure out this information at any stage of the litigation–when a client proposes a case; if a complaint is filed against your client; before MTD; before MSJ; before trial; etc. This is the kind of intelligence which people can only assemble anecdotally. And attitudinal models can only determine these matters, roughly, at a very late stage in the game. Now, the wisdom of the crowds–so to speak, as it is really data, and not crowdsourcing–can provide you with the answers you need.

Ideally, legal research would transform from what we know today–searching for key words and hoping the cases westlaw or lexis give you are relevant an on point-to being guided to the answer you need, in much the same way that Siri guides you to the right answer. I have blogged about such a tool that I call Harlan here, here, and here (and generally here).

The timing of this post is particularly poignant because of the tragic, and sudden passing of Larry Ribstein. Larry inspired me like few others. Many of the ideas I discuss in this post would never have come to fruition without Larry’s brilliant work. I was hoping to ask him to get involved with these projects. There were so many touching tributes–this one from Andrew Morriss is one of my favorites: “I suspect he’s already been named Associate Archangel for Research in heaven and doubled scholarly output there.” I hope Larry, from above, can continue to work with us, and guide us in the way only he could.

Much more to come. Stay tuned.