Are People the Weakest Link in Technology-Assisted Review?

In mid-October, 2018, our friend Michael Quartararo wrote a post for Above the Law asking whether people were the weakest link in technology-assisted review (TAR). Michael offered some thoughts around whether this may be the case, but he didn’t really answer the question. So, we have to ask:

Why aren’t more people using TAR?

One answer is that there is still a lot of confusion about different types of TAR and how they work. Unfortunately, it appears that Michael’s post may have added to the confusion because he did not differentiate between legacy TAR 1.0 and TAR 2.0. Our first thought was to let the article pass without rejoinder or correction. To our surprise, however, it has been cited and reposted as authoritative by several others. To that end, we want to help clarify a number of Michael’s points. We will quote from his post.

“I’m not aware of any scientific study demonstrating that any particular TAR software or algorithm is dramatically better or, more importantly, significantly more accurate, than any other.”

We were surprised by this statement. There have been numerous studies focused on the difference between different TAR protocols. Without exception, TAR 2.0 with continuous active learning (CAL) outperformed the one-time training inherent in TAR 1.0, often by large margins. There is a reason why the industry has abandoned TAR 1.0, at least in their marketing materials.

Let’s start with the landmark (and peer-reviewed) study, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” in which Gordon Cormack and Maura Grossman tested TAR 1.0 and 2.0 protocols and found that CAL outperformed TAR 1.0 in every test. They have published at least a half dozen follow-on papers exploring different aspects of a CAL review including “Multi-Faceted Recall of Continuous Active Learning for Technology-Assisted Review,”and “Continuous Active Learning for TAR.”

There are plenty of other studies we might mention. Not a single study suggests that all algorithms and methods achieve the same results. See, for example:

Retrieval and Richness when Querying by Document. Eugene Yang, David Grossman, Ophir Frieder, and Roman Yurchak.
Scalability of Continuous Active Learning for Reliable High-Recall Text Classification. Gordon Cormack and Maura Grossman.

“In the end, it seems to me that the only real problem with TAR software—all of them—is the people who use it.”

This statement lumps different types of TAR systems into one batch with the notion that they all work the same and integrate people in the same way. That isn’t the case. TAR 1.0 requires a lot of expertise to make it work. TAR 2.0 does not. It is simple and error tolerant. People are not the problem with TAR 2.0. At least not the TAR users themselves.

TAR 1.0: The hallmark of TAR 1.0 is one-time training whose effectively relies on humans. In essence, a senior lawyer or SME tags a reference set and then reviews a few thousand documents to train the algorithm against the reference set. The trained algorithm then runs over the whole document population and the team reviews the portion that is deemed likely relevant. The algorithm does not do any further training and cannot take advantage of the review team’s tags to get smarter about the unseen population.

In TAR 1.0 systems, the control set functions as the “gold standard” for testing the ranking/classification algorithm, with the assumption that these documents have been correctly tagged as relevant or non-relevant. If not, the control set hasn’t been tagged correctly, potential bias could be infected into the control set. There is a second and somewhat questionable assumption here as well: that the text of these documents is representative of the larger population. The SME continues the training process by reviewing batches selected randomly or chosen by the system. The training rounds typically involve a review of between 1,500 and 5,000 documents dished up to the SME. This training takes time and focus, and it could take the SME 65 hours to review and tag approximately 4,000 documents (a 500 document control set, 3,000 documents for training and another 500 for final testing).

Even the author might concede that there can be room for human error in a TAR 1.0 system.

TAR 2.0: Continuous active learning is the hallmark of a TAR 2.0 protocol. TAR 2.0 solves the “human element” problem commonly associated with early TAR protocols. A CAL system doesn’t have a separate training process, and it is error tolerant. It continually learns as the review progresses, and regularly reranks the document population based on what it has learned. As a result, the algorithm gets smarter and the team reaches its goal sooner, reviewing fewer documents than would otherwise be the case with one-time training.

“I’ve managed my share of TAR projects. I’ve used or seen used the various flavors of TAR and the outcomes these products produce. To be clear, none of them are perfect and not all of them exceed all expectations in all circumstances.”

This misses the point and does so in a way that is a real problem in our industry. You simply can’t tell how any TAR system is doing unless you have something to compare it to. Seeing how a TAR system does in a single project tells you nothing other than, perhaps, it did a lot better than linear review.

Scientists learned long ago that the best way to know something is through testing. In the information retrieval space, most use the Cranfield method, which to be overly simplistic, keeps all but one variable constant and then compares results over a number of tests. Without comparative testing, which we have been doing for many years, you can’t make such a statement. You just can’t.

“At this point, I have read I think most of the literature, the majority of which, by the way, does not originate in the legal industry…”

The majority of the research on TAR for e-discovery actually has originated in the legal industry. While scientists have been writing about what they call “relevance feedback” for 30 years, the application of these algorithms to legal documents has been a relatively recent phenomenon. And, a lot of the literature has come from lawyers and people actively involved in the legal world. Think Ralph Losey, Herb Roitblat, Maura Grossman, Gordon Cormack, Gareth Evans, Tom Gricks, Andrew Bye and John Tredennick.

“TAR is not artificial intelligence. I know, I know, some folks have taken to generally lumping TAR under the general umbrella of AI-related tools. And I get it. But when you cut through the chaff of the marketing hype, TAR is machine learning—nothing more, nothing less.”

This was an interesting comment because — well, yes it is. TAR is based on supervised machine learning, which is a classic form of artificial intelligence. Don’t take our word for it, just check out Wikipedia:

Machine learning, a fundamental concept of AI research since the field’s inception, is the study of computer algorithms that improve automatically through experience. Unsupervised learning is the ability to find patterns in a stream of input, without requiring a human to label the inputs first. Supervised learning includes both classification and numerical regression, which requires a human to label the input data first.

Types of artificial intelligence

The post also suggests that TAR is “the same machine learning that’s been used since the 1960s to analyze documents in other industries.” It is true that scientists have been experimenting with “relevance feedback,” since the early sixties but our TAR algorithms, and those of at least some others, were written specifically to meet the needs of the legal industry. Indeed, we created contextual diversity and our special QC algorithm just for e-discovery.

More to the point, why does it even matter that machine learning has been around since the 1960’s?

“Perhaps the single most important component of any TAR process is the thoughtful, deliberate, and consistent input provided to the TAR software by human reviewers. If anything goes wrong in this ‘training’ process, one could not realistically expect a satisfactory outcome.”

It’s critical to point out that this is true of TAR 1.0 talk. When early TAR systems hit the market, they were based on one-time training, covering a few thousand documents. In that world, training seemed important. The fear was that if the training was inconsistent, the whole process would fail.

That could be true, although we have never seen studies to back the point. However, it isn’t meaningful in a TAR 2.0 world, which nearly every vendor out there has adopted. Or, at least claims to have adopted.

In a CAL world, training is review, and review is training. These modern algorithms don’t use a control set and they don’t limit training to a few thousand documents. Rather, they use every judgment and do so in a way that is fault tolerant, similar to the “Wisdom of Crowds.”

Subject Matter Experts have not been shown to be any more effective in judging documents than review teams. To the contrary, we have published research along with others that suggests strongly the opposite. Why? You might ask. We are just guessing here but it seems that experts often overthink things, tagging based on finely, hewn distinctions. They may make sense to a senior lawyer charged with arguing the case but they might not be good for teaching the algorithm how to find relevant documents.

“That’s not just the opinion of a somewhat cynical operations guy. It’s true. And I would not write it if it weren’t.”

This sums up a major impediment to implementing TAR or other methods to improve legal processes. We are legal professionals. We have been trained to use logic to develop hypotheses, distinguish facts and make arguments. We believe what we say is true and sometimes it is. But not always.

There is a term called “truthiness” which we need to better understand in the legal profession. Truthiness is:

the quality of seeming to be true according to one’s intuition, opinion, or perception without regard to logic, factual evidence, or the like.

We legal professionals like truthiness because it flows naturally from our logic training and, frankly, feels good. But, without testing how can one know? We need more truth in this industry and less truthiness. As Mark Twain once famously said, “it ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”

Are people the weakest link?

In the end, people may just be the weakest link in TAR, but for a different reason. When articles are put out into the blogosphere as truth, but don’t appropriately differentiate between various types of technologies, it prevents legal professionals from understanding why, and most importantly, how new technologies can provide massive cost savings and greater control over the entire discovery process and the business of law. For more thoughtful information on TAR 2.0, we encourage you to check out the OpenText^TM eBook TAR for Smart People.

OpenText

OpenText, The Information Company, enables organizations to gain insight through market-leading information management solutions, powered by OpenText Cloud Editions.

See all posts