What (I Think) I Know About De-duplication – Corrections Welcomed



This is one of the most important tools in the ESI processing arsenal.  There are essentially three methods used to effectively de-duplicate document collections.  They each present their own set of strengths and weaknesses.


Hash Values Generally

  • Hash values are unique alphanumeric codes generated by a number of available algorithmic tools.
  • The hash value for a particular document, or collection of documents, is based on the layout of the bit content of that document or collection, including the Application Metadata contained within a given document or collection of documents.
    • Hash values can be generated for a single sentence of text, a paragraph, a document in its entirety, a family of e-mails and attachments, or for all the data stored on a single hard drive or server.
    • If even one bit is changed within the specified data, an entirely new value would be assigned by the same algorithmic tool.
      • Bit flipping is when a single bit changes from 0 to 1, or vice versa, without having been manipulated to do so.  Even this seemingly small change will dramatically affect the previously assigned hash value.
      • The two most common algorithmic tools used to generate hash values are MD5 and SHA-1.
        • While these algorithms may now have been “broken,” they have been broken only in terms of creating a collision with two different documents.
        • No one has been able to reverse engineer the content of a document from a known hash value at this point.
        • Different tools will generate different hash values for the same exact document or collection of documents.
          • Using the same tool on the same document should always produce identical hash values, regardless of the party running hashing the document.  The use of “identical” here includes all application metadata, not just what is seen on the face of the document.


Custodial (Vertical) De-duplication vs. Global (Horizontal) De-duplication

  • Global de-duplication will retain the first copy of all loose electronic documents, duplicate e-mails or document families encountered during the process.
  • Subsequent copies of those documents or families with matching hash values will be removed from the database or will not be added to the review database during processing, depending upon when de-duplication occurs.
    • It would seem the more cost effective approach must be to de-duplicate prior to some of the other processes and loading the data into a review database.
    • Global de-duplication is not the common practice at this point.
    • Custodial de-duplication will likely result in there being multiple copies of the same document across the various custodial collection in the database.


Whole Document Hash Value De-duplication (WDHVD – exact duplicates)

  • The constant and almost entirely unique nature of hash values for unaltered documents allows for the use of those hash values to locate and remove exact duplicate documents from a review database.
  • For this method of de-duplication all documents in a collection are run through the selected algorithmic tool and an index of hash values is created.
  • WDHVD will typically be able to identify many exact duplicates of loose electronic documents in a collection based upon duplicate hash values.
    • Even though these documents are technically “exact duplicates,” additional steps should be taken to ensure that critical data is not missed during review.
      • Example – documents with different file names can still be identical duplicates for the purposes of obtaining a hash value.
      • The file name is system metadata that is not part of the standard hash value encryption / analysis.
      • This could be a cause for concern where a document with a damning file name is excluded as a duplicate based on the hash value of a document encountered earlier in the de-duplication process, but later somehow discovered by opposing counsel.
  • Additional metadata fields should be included in the hash value encryption / analysis to ensure that such items are not cast out of the collection as duplicates.
  • Additional metadata fields should also be in included in the WDHVD analysis of document families (e-mail families, zip files, etc.).
    • Family member should never be eliminated as duplicates based upon standalone document hash values
    • A family member is only truly considered a duplicate if the exact family exists elsewhere in the collection.
    • Generating hash values for families of documents likely decreases the chances of successfully detecting exact duplicate families.
    • Analyzing bundled document likely equals more variables where even the slightest difference (even just an extra space in one document) will result in a dramatically different hash value.
    • Review of these duplicate families is much better than some of the imaged alternatives though.
    • I suspect there are few true duplicate families when de-duping vertically,, as opposed to horizontally.
    • Strengths – very reliable and exact form of de-duplication.
    • Weaknesses – lacks flexibility in its rigid determination of exact bit by bit duplicates.

Document Segment Hash Value De-duplication (DSHVD – near duplicates)

  • This method functions much like the WDHVD approach, but with a bit more flexibility.
  • I suppose the proper definition of this process would be any attempt to de-duplicate a collection of documents where you select some subset of the application metadata that is less than the full available set.
    • An example of this would be to exclude the Sent On Time metadata for a set of e-mail because Sent On times can vary by a few tenths or hundredths of a second.
    • In the strictest sense, even those small differences mean the documents are not duplicates.
    • Selecting all other commonly used e-mail metadata fields (i.e., text of document, e-mail subject, sent on date, etc.) for hash analysis will provide reasonable certainty that the documents are in fact duplicates.
    • If you are reduced to examining only the text of the document then I believe you have crossed the line in the text comparison de-duplication category.
    • One of the major weaknesses of WDHVD is its rigidity in determining duplicates.
    • DSHVD is a bit more flexible in that it analyzes only specified segments of the document structure and creates a hash value based on the bits of those collective segment.


Text Comparison De-duplication (near duplicates)

  • This approach involves analysis of the text of the document field(s) to determine the level of similarity between documents.
  • Two approaches are common: (1) comparison of “chunks” of text and (2) counting shingles.
    • For the first approach you can divide the number of similar words found in both documents by the total number of words found in the longer document to provide you with a value of similarity between 0 and 1.
      • An example would be to compare two document.  The first longer has 99 words.  The shorter has 54 words.  The common chunk of text they share is 23 words.
      • [23 / 99 = 0.23]
      • The weakness here is that the similarity value does not depend on the length of the shorter document.
      • If multiple documents have the same common text, the similarity value will be the same when comparing the longest document to either of the two shorter ones.
      • This fails to account for one of the shorter documents containing more unique (not seen in the longest document) text than the other.
  • The second approach is much like the first, but employs a slightly different equation to determine similarity.
    • The equation divides the total number of words the documents have in common (23) by the total number of word found in both document (153) minus the words in common. [23 / ((99 + 54) – 23) = 0.18]
    • This a less optimistic approach than the option outlined above.
  • A third approach is counting “shingles” of similar text that appear within two documents (http://en.wikipedia.org/wiki/MinHash)
    • This approach entails setting a (usually small) minimum requirement for the number of consecutive words two document must have in common to have a matching “shingle” (e.g., five consecutive matching words).
    • The similarity of two given documents is determined by dividing the number of shingles the two documents have in common by the total number of unique shingles found across both documents (http://en.wikipedia.org/wiki/W-shingling#Resemblance)
    • Short document generally give very low similarity scores when counting shingles.
    • The MinHash algorithm discards duplicate shingles encountered within a document.
    • Determining similarity with the MinHash algorithm is closer to an information matching approach, than a literal matching approach.
    •  MinHash reduces overall workload by selecting a sub group of shingles for hashing and comparison.  This comparison provides more of an approximation of similarity, than other methods fully analyzing all common shingles.
  • Consider Literal and Information matching approaches for all of these options.
    • Literal matching requires one to one matching of similar text blocks / shingles as a condition of determining similarity.
    • Information matching allows for one vs. all matching of duplicate text blocks or shingles.  A document with only one block of repeating text will be 100% similar to another document as long as the full matching block appears within the longer document, even if it appears only once in the longer document.
    • Information matching seems like a more sensible approach.  In the case smaller shingles, it allows for two overlapping shingles to be counted as 100% duplicates, while literal matching will only find one matching shingle because the words remaining after the first match don’t meet the minimum shingles size (e.g., 5 word minimum shingle size, 9 words in common).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s