On Technology-Assisted Electronic Discovery

January 10th, 2012

The past year’s most seminal article on technology-assisted review (commonly known as “automated document classification” or “predictive coding”) was Maura Grossman and Gordon Cormack’s law review piece, which effectively debunked the notion that manual review offers an unimpeachable gold standard. The authors succinctly summarized their statistically validated findings as follows:

This article offers evidence that . . . technology-assisted processes, while indeed more efficient, can also yield results superior to those of exhaustive manual review, as measured by recall and precision.

Maura R. Grossman & Gordon Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective And More Efficient Than Exhaustive Manual Review, XVII Rich. J.L. & Tech 11 (2011). Anne Kershaw and Joe Howie agree. In a survey of 11 e-discovery vendors who use technology-assisted review in the form of predictive coding, they found not only that technology-assisted review outpaced their aptly termed “brute force [human] linear review of electronic data,” but also technologies that have been used in the not-so-distant past. They write:

The results report that, on average, predictive coding saved 45% of the costs of normal review – beyond the savings that could be obtained by duplicated consolidation and email-threading. Seven respondents reported that in individual cases the savings were 70% or more.

Anne Kershaw & Joe Howie, Crash or Soar: Will The Legal Community Accept “Predictive Coding?” (Law Technology News Oct. 2010).

From a purely pragmatic standpoint, the volume of electronically stored information now doubles every 18-24 months. Forrester Research maintains that 70% of e-discovery costs are spent on processing, analysis, review, and production. These costs are not abating. Moreover, reducing costs isn’t just a monetary concern, but also a strategic one. As Chris Dale points out, if technology-assisted review “can save significant costs without significantly reducing accuracy then the burden falls on its opponents to point out its flaws.” Chris Dale, Having The Acuity to Determine Relevance with Predictive Coding (e-Disclosure Information Project Oct. 15, 2010).

Coding cases–something research assistants are wont to do, is quite ineffective and imprecise. Technology-assisted case-coding would be a game changer.

So where does human-expertise fit in?

The question thus becomes: What are the new roles and responsibilities for human expertise in this paradigm? The answer is that humans will continue to apply their insights and intelligence strategically to guide the technology. Automated document review technology is a tool like any other with potential that cannot be realized fully without the worldly knowledge and creativity that only humans can bring to bear in solving complex problems.

Statistical algorithms for text classification are capable of amazing feats when it comes to detecting and quantifying meaningful patterns amongst large data sets, but they are not capable of making the type of subjective qualitative assessments that constitute the art of discovery.

Chris Dale aptly points out that “[n]one of this technology solves the problem on its own. It needs a brain, and a legally trained brain at that . . . to [meet] the clients’ objective . . .  [of] disposing of a dispute in the shortest time by the most cost-effective method.” Chris Dale, Having The Acuity (supra); see Fed. R. Civ. P. 1 (“These rules . . . should be construed and administered to secure the just, speedy, and inexpensive determination of every action and proceeding.”).

Accordingly, humans will continue to define the methodology deemed so critical in the judicial guidance discussed above. For defensibility considerations, it will be less important to dissect the technology than it will be to scrutinize the ongoing involvement of experts—e.g., lawyers, linguists, and statisticians—who must attempt to optimize technology-assisted review to (i) maximize precision and recall, (ii) find the appropriate balance between the two, and (iii) ensure that technology-generated results meet the unique demands of a given matter, regardless of what the quantitative picture alone may indicate.

Humans do still play a role–lawyers and statisticians and linguists:

Only attorneys can make the type of subjective determinations required for assessment of proportionality and reasonableness in e-discovery. They also play an essential role in guiding the assessments of any technology-assisted review; they are typically the sole source of coding decisions for training sets; and they are ultimately responsible for certifying the quality of the review’s results. In this sense, their active involvement forms the bedrock upon which every aspect of automated classification is built and validated.

However, relying upon technology and legal and subject matter knowledge alone—without the support of any additional expertise—will rarely allow attorneys to achieve the best possible results, and it may weaken the overall defensibility of the approach. Given that most technology-assisted review is founded on statistical algorithms and linguistic pattern detection, empowering these systems with the expertise of linguists and statisticians results in much greater flexibility and often higher quality and more readily defensible results in less time. It also enables a more effective allocation of resources, since statisticians and linguists can develop protocols for attorneys’ sampled reviews, perform in-depth data analyses, generate reports and summaries of findings, and implement innovative solutions that would, at best, be distractions for attorneys, who should ideally be free to focus their attention on case strategy. With each team member playing to his or her talents and training, the review effort realizes greater efficiency, higher quality results, and reduced production time and costs.

Statisticians, in addition to serving as a resource for the generation of sound performance metrics, provide a wealth of data-mining tools and techniques that can be utilized to supplement and enhance built-in classification algorithms for more tailored results. Linguists, meanwhile, have specialized analytic skills that make them especially well-suited to the task of leveraging patterns in language to expedite and improve the quality of document classification. Both linguists and statisticians bring unique perspectives and a rich set of tools to the automated document classification process that provide attorney teams with options and alternatives from which they may not otherwise benefit. . . .

Considering further the type of “rare event” documents described above, linguists and statisticians would certainly take steps to train the system to recognize these materials more readily. However, these documents are often by their very nature idiosyncratic and difficult to generalize based on statistical frequencies alone. Important documents of this type, though, present an ideal opportunity for the application of linguistic modeling techniques. Linguistic modeling offers more flexibility and greater precision for targeting special topics of particular interest that are low in frequency but high in importance. In this way, linguists and statisticians, collaborating closely with attorneys, can offer additional assurance that the most critical documents in their review will be discovered, even when relying upon an expedited technology-assisted approach to review.

Finally, the modeling techniques and algorithms that perform best for any given matter will vary, but it is often the case that multiple inputs generate outcomes that are superior to results generated by any single algorithm. Identifying which techniques to utilize and the specific weighting principles that will be used to synthesize them for final results generation requires special skills and on-demand experimentation. A team that includes statisticians and linguists will have the proper resources to engage in this type of real-time analysis for fully optimized results, whereas an attorney team alone may not.