What is a near-dupe, really?

When you try to quantify how similar near-duplicates are, there are several subtleties that arise.  This article looks at three reasonable, but different, ways of defining the near-dupe similarity between two documents.  It also explains the popular MinHash algorithm, and shows that its results may surprise you in some circumstances.document_comparison_toolNear-duplicates are documents that are nearly, but not exactly, the same.  They could be different revisions of a memo where a few typos were fixed or a few sentences were added.  They could be an original email and a reply that quotes the original and adds a few sentences.  They could be a Microsoft Word document and a printout of the same document that was scanned and OCRed with a few words not matching due to OCR errors.  For e-discovery we want to group near-duplicates together, rather than discarding near-dupes as we might discard exact duplicates, since the small differences…

