Redacting Private Information from Public Court Documents

January 22nd, 2012

Court documents, unless under seal, are public records. Yet, they frequently contain what we would think of as private information–mainly social security numbers. takes these steps to strip the docs of such info:

In both of these scenarios, we have taken a middle ground that we hope strikes a balance between the public’s need for court documents and an individual’s desire or need for privacy. Instead of either proactively blocking search engines from indexing cases or keeping cases in search results against a party’s request, our current policy is to block search engines from indexing a web page as each request comes in. We currently have 190 cases that are blocked from search results, and the number increases regularly.

Where we do take proactive measures to block cases from search results is where we have discovered unredacted or¬†improperly redacted¬†social security numbers in a case. Taking a cue from the now-defunct Altlaw, whenever a case is added, we look for character strings that appear to be social security numbers, tax ID numbers or alien ID numbers. If we find any such strings, we replace them with x’s, and we try to make sure the unredacted document does not appear in search results outside of CourtListener.

The methods we have used to block cases from appearing in search results have evolved over time, and I’d like to share what we’ve learned so others can give us feedback and learn from our experiences. There are five technical measures we use to keep a case out of search results:

  1. robots.txt file
  2. HTML meta noindex tags
  3. HTTP X-Robots-Tag headers
  4. sitemaps.xml files
  5. The webmaster tools provided by the search engines themselves

And from Freedom to Tinker, redacting data in PACER:

To summarize, out of 6208 redacted documents, there are 4315 Social Security that can be redacted automatically by machine, 449 addresses whose redaction doesn’t seem to be required by the rules of procedure, and 419 “trade secrets” whose release will typically only harm the party who fails to redact it.

That leaves around 1000 documents that would expose risky confidential information if not properly redacted, or about 0.05 percent of the 1.8 million documents I started with. A thousand documents is worth taking seriously (especially given that there are likely to be tens of thousands in the full PACER corpus). The courts should take additional steps to monitor compliance with the redaction rules and sanction parties who fail to comply with them, and they should explore techniques to automate the detection of redaction failures in these categories.