How do you know what you don’t know? This is a classic problem when searching a large volume of documents in litigation or an investigation.
In technology-assisted review software (TAR), a key concern for some is whether the algorithm has missed important relevant documents, especially those that you may know nothing about at the outset of the review. This is because most modern TAR systems focus exclusively on relevance feedback, which means that the system feeds you the unreviewed documents that are likely to be the most relevant because they are most like what you have already coded as relevant. In other words, what is highly ranked depends on the documents that were tagged previously.
When you train a TAR algorithm using documents with which you are already familiar, or documents you located using a focused keyword search, the algorithm assumes you know the full scope of your review. The TAR tool assumes you generally know what topics, concepts and themes to look for.
But what about other relevant documents you didn’t find? Maybe they arrived in a rolling collection. Or maybe they existed all along but no one know to look for them. How would you find them based on your initial terms? When there are unexpected documents, concepts or terms in the collection, you could miss them simply because you don’t know to search for them.
This introduces the potential for what is called review bias, that is, looking only for documents with concepts that you know, and essentially ignoring potentially relevant documents that you don’t know anything about.
Contextual diversity, which can be used as part of a continuous active learning (CAL) process, is a powerful tool to combat the risk of missing pockets of potentially relevant documents, by finding documents that are different from those already seen and used for training. It ensures that reviewers aren’t missing documents that are relevant but different from the mainstream of documents being reviewed.
Below we provide an overview of contextual diversity, a brief summary of how it works and its important use cases and benefits in many types of reviews.
What is contextual diversity?
A typical TAR 1.0 (e.g., first generation TAR system) workflow involves a subject matter expert (often a senior lawyer) reviewing several thousand documents for training purposes before the TAR algorithm can rank the remainder of the population. It is an iterative process that entails significant human time, effort and cost for training and re-training the system (particularly when new documents are added to the collection). The review team can’t begin until an SME does the training, and depending on their inclination to look at random documents, the review can be held up for days or weeks.
In a TAR system based on CAL, we continuously use all the judgments of the review team to make the algorithm smarter (which means that you find relevant documents faster). Documents ranked high for relevance are fed to the review team, who uses their judgments to train the system. The CAL approach can also include contextual diversity, which improves performance, combats potential bias, and ensures topical coverage.
Contextual diversity refers to documents that are different from the ones already seen and judged by human reviewers. Because the system ranks all of the documents on a continual basis, we know a lot about documents—both those the review team has seen but also (and more importantly) those the review team has not yet seen. The contextual diversity algorithm identifies documents based on how significant and how different they are from the ones already seen, and then selects training documents that are the most representative of those unseen topics for human review.
It’s important to note that the algorithm doesn’t know what those topics mean or how to rank them. But, it can see that these topics need human judgments on them and then select the most representative documents it can find for the reviewers to assess their relevance (or not).
This process is iterative, meaning that it’s re-computed every time the TAR system re-ranks your collection, which is often several times an hour during active review. Another way to think about contextual diversity could be “continuous active exploration.” As the review progresses and more documents are reviewed, it explores deeper into smaller and smaller pockets of different, unseen documents.
The system feeds in enough of the contextual diversity documents to ensure that the review team gets a balanced view of the document population.
Here’s why this is so important in a TAR review.
Practical ways contextual diversity is used
Contextual diversity serves a number of important purposes in a TAR review and can save substantial costs of review in almost any review in which TAR is used. In addition to training efficiency, discussed above, here are some additional practical benefits and use cases.
1. Rolling collections
Rolling collections provide one of the best examples of how contextual diversity works. TAR 1.0 systems typically train against a reference set, which makes training difficult since it requires new training every time new documents are added to the collection.
With contextual diversity, you can integrate rolling document uploads into the review process. When you add new documents to the mix, they simply join in the ranking process and become part of the review. Depending on whether the new documents are different or similar to documents already in the population, they may integrate into the rankings immediately or fall to the bottom. In the latter case, the contextual diversity algorithm pulls samples from the new documents for review. As the new documents are reviewed, they integrate further into the ranking. This process of seeking contextually diverse groups of documents continues throughout the review to penetrate deeper and deeper into the collection to locate groups of unknown documents.
2. Proving a negative in an investigation
Proving to a government agency that you simply don’t have any responsive documents can be a costly proposition; by adding contextual diversity as a strategy, you maximize the breadth of your search. If you still fail to locate any document of value, you have essentially shown there are no responsive documents in the collection and any further review would not be worth the effort. Contextual diversity, in a sense, is another set of eyes looking for the requested documents.
3. More thoroughly reviewing an opponent’s rolling productions
Finding relevant documents in an opposing party’s production is rarely easy. And when those productions are large and arrive on a rolling basis (and the opposing party is trying to bury revealing or damaging documents within a large, late production), the search can be even more cumbersome, costly and time-consuming—and key documents may not be noticed for some time, if at all. However, with a contextual diversity engine re-ranking and analyzing the entire document set every time, a pocket of new documents unlike anything reviewers have seen before is immediately recognized, and exemplars from those new pockets will be pulled as contextual diversity seeds and puts in front of reviewers in the very next batch of documents to be reviewed.
4. Satisfying your obligation to make a reasonable inquiry in responding to discovery
Finally, employing contextual diversity helps establish that you’ve met the standard of reasonable inquiry by using every tool at your disposal to ensure that your search for responsive documents was reasonable, thorough and proportional.
Of course, contextual diversity also improves performance for ECA and other relevant searches when finding themes, topics and concepts is important (rather than a recall search). For example, you want to find the best examples of all the different things the documents can tell you rather than finding all the documents of a certain type. After you’ve searched for the topics and themes you know to look for, you can use contextual diversity sampling to efficiently review whatever topics are still unseen.
In sum, when a TAR system with continuous active learning includes a companion contextual diversity algorithm, the system is generally better at locating responsive documents than a TAR system that doesn’t have this feature. In our experience, contextual diversity allows the algorithm to penetrate much more deeply into the collection to effectively let you know precisely what you don’t know.
For a more thorough discussion on contextual diversity, download the free OpenText™ eBook TAR for Smart People.