Have you ever wondered why so many of your database records look the same even after they’ve been de-duplicated? In our efforts to reduce redundant content, we often employ many tricks of the trade. This may include de-duplication, near duplicate identification, end of branch email identification, document comparisons, and so on. But the most fundamental of all of these is identifying and filtering out duplicate records. Why is removing redundant content so important? The easiest answer is that it saves time and reduces costs.
Hash values and why they are important
What makes a duplicate a true duplicate? To understand the answer to this question, you first need to understand hash values. These values are created by a mathematical formula that assigns a string of characters to an electronic file or a family group. This value is like a fingerprint that can be used as a check sum to ensure data hasn’t been corrupted when copied or transmitted via SFTP. And it can also be used to identify true duplicates.
Hash values used in eDiscovery are commonly generated from either MD5 or SHA. For the purposes of identifying duplicates, both of these algorithms are equivalent. Due to its simplicity, MD5 operates much faster, making it a more frequent method for duplicate identification. Below is an example of what these values look like when generated:
How duplicates are identified
How does the hash value play into identifying duplicates? That depends on the type of data. If you are talking about free standing native files, each file is assigned a hash value. But emails are a different, because they are essentially exported metadata from another system, such as MS Outlook. In this case, selected metadata (e.g., From, To, CC, etc.) is individually hashed and then a final hash value is created by combining these values together. Last, a separate hash value is created for each family (e.g., parent with attachments). Below is a common breakdown on this process.
A good eDiscovery product will allow customizations for how it identifies email duplicates.
How to remove redundant content
Now that you have unique hash values for all of your documents and their families, the next step is to apply some strategies to remove redundant content from the review process but still provide a complete document production.
Strategy 1: De-duplicate at a family level at publish. This is where you leave behind duplicates in your culling or processing software and only promote to your review database unique family complete records.
Why only family complete duplicates? At production time, it is generally expected that you produce family complete records. In other words, if an attachment to an email was removed as a duplicate to another email attachment or stand-alone file, then that email would be produced without that attachment included. There are other strategies that can be employed to reduce review of redundant content without compromising family completeness (see Strategies 2 and 3 below).
Enrichment and de-duplication
Enrichment is the process of adding valuable metadata to fields associated with our documents. This process is commonly used to append any unique metadata at de-duplication. This allows for fields like custodian and folder names to contain unique values from both the published record and every de-duplicated record. This unique content is preserved and can be included with the metadata file at production time. See an illustration of this process below:
This strategy isn’t employed by all eDiscovery products – in fact, many still just associate duplicate record IDs to the so-called primary record as reference. This arbitrary assignment of first-in records isn’t practical if you employ a one-to-many strategy with your culling/ingestion database (i.e., one culling database associated with many review databases containing different subsets of content). It also often involves manual overlays of any enriched values required at production time.
Strategy 2: Use search and filtering strategies to remove redundant content prior to batching documents for review. This option is useful if your eDiscovery product doesn’t include a means to review the records in context without redundant content (see Strategy 3 below). In OpenTextTM Axcelerate Review & AnalysisTM (“Axcelerate”), there are multiple options to filter out redundant content.
For example, using the Restrictions options on the Associations fly-in will allow you to filter down to only end of branch emails.
Another example is running an expanded search by relationships. This will allow you to keep, drop, or expand your search based on duplicates, near duplicates, email threads, end of branch emails, parent/child relationships or custom associations. For instance, the query * KEEP duplicate.primary will filter out all duplicates from your current result list.
Strategy 3: Utilize automation that identifies redundant content during review by using Axcelerate’s Review-in-Context (“RiC”) or a similar eDiscovery product feature. RiC offers a unique user interface for reviewers that allows them to focus primarily on unique content prior to coding these sets altogether. For example, a collection of email threads will be distilled down into the end-of-branch email plus any unique attachments found within the entire email chain. It also ensures that all unique content is reviewed and offers the reviewer the option to apply tagging across the redundant content.
Strategy 4: Use expanded search strategies to ensure consistent coding of documents with redundant content post document review. This is particularly useful when you are dealing with rolling data collections, and it isn’t practical to begin review when all of it is loaded (which is most of the time😊). For example, the query * DROP duplicates.withIdenticalValues(FIELD1, FIELD2, FIELD3) will remove documents with consistent coding from your result list and allow you to double-check the remaining inconsistencies.
Documents that look alike but really aren’t the same
There are common reasons why documents with the same text inside may not be a true duplicates (remember the hash values mentioned earlier). Below are the top three:
- Free-standing documents with different create, modified, or last edit dates. This often occurs due to the nature of how the documents were copied and saved during the normal course of business. Utilizing near-duplicate detection may help address these items effectively.
- Email threads that have the same exact content but the individual email threads within the body of the text reflect a different time zone. This is common when emails are exchanged across time zones. Redundant email threads can effectively be removed from review when utilizing end-of-branch detection.
- Pre-processed data containing duplicate records. It is usual and customary to load data processed by other vendors/parties in a manner that doesn’t attempt to identify duplicates. This is primarily because it is assumed the entire set is desired to be kept intact. It is also common for these data sets to not contain natives or sufficient metadata to properly deduplicate the records.
Key Take Aways
While there are multiple ways we can use technology to remove redundant content from review, it always needs to be balanced – minor differences may actually be relevant. For instance, the date a key witness downloaded a copy of a Word document from a repository maybe meaningful under the right set of circumstances and can only be identified by the different last edit date. Or an email thread that isn’t end of branch, but has a critical document attached might bear some extra scrutiny. My personal recommendation is to use all the tricks in the bag to gain efficiencies, vet any inconsistencies with the legal team, and document your process.