What is a near-dupe, really?

Clustify Blog - eDiscovery, Document Clustering, Predictive Coding, Information Retrieval, and Software Development

When you try to quantify how similar near-duplicates are, there are several subtleties that arise.  This article looks at three reasonable, but different, ways of defining the near-dupe similarity between two documents.  It also explains the popular MinHash algorithm, and shows that its results may surprise you in some circumstances.document_comparison_toolNear-duplicates are documents that are nearly, but not exactly, the same.  They could be different revisions of a memo where a few typos were fixed or a few sentences were added.  They could be an original email and a reply that quotes the original and adds a few sentences.  They could be a Microsoft Word document and a printout of the same document that was scanned and OCRed with a few words not matching due to OCR errors.  For e-discovery we want to group near-duplicates together, rather than discarding near-dupes as we might discard exact duplicates, since the small differences…

View original post 3,787 more words

EDiscovery leaders and career opportunities highlighted by US legal publications

eDisclosure Information Project

Electronic Discovery / eDisclosure is a new discipline. It has passed the Wild West stage but it is still new enough and small enough that the contribution of its founding members can be recognised with the perspective of time. Three US legal publishing companies have produced lists recently of individuals whose contributions have helped shape the industry.

What value lies in reciting other people’s lists? I hear you ask. Well for one thing, my coverage is selective, with a bias towards people I know or have some connection with. For another, two of these articles are not readily available and it is not open to me to just link to them. Third and more importantly, I am keen to encourage people to see a promising career path in eDiscovery; we can’t all be Laura Kibbe or Andrew Sieja, but we can see opportunities in this young and growing business…

View original post 1,309 more words

“Derogation of the Search for Truth”

Ball in your Court

search for truthIn my last post, I addressed why search terms used to cull data sets in discovery should not be protected as attorney work product.  Today, I want to distinguish an attorney’s “investigative queries” (for case assessment, to hone searches or to identify privileged content) from “culling queries” (to generate data sets meeting a legal obligation, whether conceived by an attorney, client, vendor or expert).   I contend culling queries warrant no work product protection from disclosure.

Let’s assume a producing party has a sizable collection of potentially responsive electronic information.  Producing party concludes that it would be too costly, slow or unreliable to segregate the ESI by reading everything and, instead, decides to examine just those items that contain particular words or phrases.  Keyword queries thus serve to divide the ESI into two piles: one that will be reviewed by counsel and another that no one…

View original post 1,544 more words