What is a near-dupe, really?

Clustify Blog - eDiscovery, Document Clustering, Predictive Coding, Information Retrieval, and Software Development

When you try to quantify how similar near-duplicates are, there are several subtleties that arise.  This article looks at three reasonable, but different, ways of defining the near-dupe similarity between two documents.  It also explains the popular MinHash algorithm, and shows that its results may surprise you in some circumstances.document_comparison_toolNear-duplicates are documents that are nearly, but not exactly, the same.  They could be different revisions of a memo where a few typos were fixed or a few sentences were added.  They could be an original email and a reply that quotes the original and adds a few sentences.  They could be a Microsoft Word document and a printout of the same document that was scanned and OCRed with a few words not matching due to OCR errors.  For e-discovery we want to group near-duplicates together, rather than discarding near-dupes as we might discard exact duplicates, since the small differences…

View original post 3,787 more words

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s