E-Discovery Update: To Globally Dedupe or Not ... That is the Question

When lawyers are asked by a document review vendor whether they want to dedupe “globally” or “by custodian,” they may not appreciate the difference between the two methods and the impact to the case, in terms of both discovery costs and consistency of treatment of documents in the review.

Deduping globally means that exact duplicate families are removed from the document review population. A “family” refers to a parent document (such as an e-mail) with attachments. Documents dedupe out of the review population only if the entire family is an exact duplicate match. It is common when collecting data from multiple employees at a company to have a high rate of exact duplicates in the data set because, for example, employees often copy multiple people on e-mails, so each of those employees would have the same e-mails in their “in box.”

“Globally” deduping means that only one of each of the unique document families will remain in the set that is loaded to the review platform. The vendor can capture each of the “deduped” custodians in a “duplicate custodian” field, however, so that the custodians who had the duplicate “families” are documented on the review platform. The only potential for lost metadata when following this process is that the original file path generally is not pulled into a “duplicate filepath” field, although if this metadata field was important to a particular case the vendor can go back to the original data set and collect the original file path for deduped custodians if needed.

Global deduping is preferable in most cases because it can identify a substantial amount of data that does not need to be loaded and hosted (thereby reducing data hosting and other processing costs) and it reduces review costs and production costs. It also reduces the likelihood for inconsistent treatment of duplicates.

By contrast, deduping “by custodian” will remove only exact duplicate families within one employee’s set of data. The cost savings impact of custodian deduping is much smaller than that of global deduping.

One urban legend among those with only passing involvement in e-discovery is that deduping globally will dedupe all duplicate documents and, therefore, no duplicates will remain in the data set. This is not true. Deduping prior to loading to the review platforms eliminates only exact family duplicates, but does not eliminate all duplicate documents. If the same document is attached to two different e-mail strings – meaning the attachments are exact duplicates of each other and have the same “md5Hash” – they still will both remain in the review population because the parent e-mails are different. We have seen duplication rates as high as 50 to 60 percent in collections that have been globally deduped. Thus, it is important to use a review platform that allows automatic tagging of duplicates (i.e., applies certain tags to duplicates) to avoid having reviewers review the same documents, which increases review costs and increases the risk that reviewers might mark documents with inconsistent responsiveness and privilege tags.