Fair Comparison of Predictive Coding Performance

Clustify Blog - eDiscovery, Document Clustering, Predictive Coding, Information Retrieval, and Software Development

Understandably, vendors of predictive coding software want to show off numbers indicating that their software works well.  It is important for users of such software to avoid drawing wrong conclusions from performance numbers.

Consider the two precision-recall curves below (if you need to brush up on the meaning of precision and recall, see my earlier article):precision_recall_for_diff_tasks

The one on the left is incredibly good, with 97% precision at 90% recall.  The one on the right is not nearly as impressive, with 17% precision at 70% recall, though you could still find 70% of the relevant documents with no additional training by reviewing only the highest-rated 4.7% of the document population (excluding the documents reviewed for training and testing).

Why are the two curves so different?  They come from the same algorithm applied to the same document population with the same features (words) analyzed and the exact same random sample…

View original post 636 more words

Advertisements

A Guide to Forms of Production

Ball in your Court

forms_iconSemiannually, I compile a primer on some key aspect of electronic discovery.  In the past, I’ve written on, inter alia, computer forensics, backup systems, metadata and databases. For 2014, I’ve completed the first draft of the Lawyers’ Guide to Forms of Production, intended to serve as a primer on making sensible and cost-effective specifications for production of electronically stored information.  It’s the culmination and re-purposing of much that I’ve written on forms heretofore along with new material extolling the advantages of native and near-native forms.

Reviewing the latest draft, there is still so much I want to add and re-organize; accordingly, it will be a work-in-progress for months to come.  Consider it something of a “public comment” version.  The linked document includes exemplar verbiage for requests and model protocols for your adaption and adoption.  I plan to add more forms and examples over time.

View original post 210 more words

The Importance of Cybersecurity to the Legal Profession and Outsourcing as a Best Practice – Part Two

e-Discovery Team ®

Attorney_client_data-protection This is part two of this article. Please read part one first .

4. Risk. Risks of error are inherent in Lit-Support Department activities. What they do is often complex and technical, just like any e-discovery vendor. So too are risks of data breach. There is always a danger of hacker intrusions. Just ask Target.

Do you know what your exposure is for a data breach? What damages could be caused by the accidental loss or disclosure of your client’s e-discovery data? How many terabytes of client data are you holding right now? How much of that is confidential? What if there is an ESI processing error? What if attorney-client emails were not processed and screened properly?

_data_breach_cost_by_typeMistakes can happen, especially when a law firm is operating outside of its core competency. What if an error requires a complete re-do of a project? What will that cost the firm? You cannot…

View original post 1,450 more words

The Importance of Cybersecurity to the Legal Profession and Outsourcing as a Best Practice – Part One

e-Discovery Team ®

Cyber_security_wordsCybersecurity  should be job number one for all attorneys. Why? Because we handle confidential computer data, usually secret information that belongs to our clients, not us. We have an ethical duty to protect this information under  Rule 1.6 of the ABA Model Rules of Professional Conduct . If we handle big cases, or big corporate matters, then we also handle big collections of electronically stored information (ESI). The amount of ESI involved is growing every day. That is one reason that Cybersecurity is a hard job for law firms. The other is the ever increasing threat of computer hackers.

Chinese-cyber-warThe threat is now increasing rapidly because there are now criminal gangs of hackers, including the Chinese government, that have targeted this ESI for theft. These bad hackers, knows as crackers, have learned that when they cannot get at a company’s data directly, usually because it is too well defended, or too…

View original post 2,089 more words