Guest Blog: Talking Turkey

e-Discovery Team ®

Maura-and-Gordon_Aug2014EDITORS NOTE: This is a guest blog by Gordon V. Cormack, Professor, University of Waterloo, and Maura R. Grossman, Of Counsel, Wachtell, Lipton, Rosen & Katz. The views expressed herein are solely those of the authors and should not be attributed to Maura Grossman’s law firm or its clients. 

This guest blog constitutes the first public response by Professor Cormack and Maura Grossman, J.D., Ph.D., to articles published by one vendor, and others, that criticize their work. In the Editor’s opinion the criticisms are replete with misinformation and thus unfair. For background on the Cormack Grossman study in question, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014, and the Editor’s views on this important research seeLatest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One and Part Two and Part Three. After remaining…

View original post 4,281 more words

Latest Grossman and Cormack Study Proves Efficacy of Multimodal Search for Predictive Coding Training Documents and the Folly of Random Search – Part Two

e-Discovery Team ®

This is a continuation of my earlier blog with the same title: Latest Grossman and Cormack Study Proves Efficacy of Multimodal Search for Predictive Coding Training Documents and the Folly of Random Search – Part One. 

Latest Grossman Cormack Study

grossman_cormack_filteredThe information scientist behind this study is Gordon V. Cormack, Professor, University of Waterloo. He has a long history as a search expert outside of legal search, including special expertise in spam searches. The lawyer who worked with Gordon on this study is Maura R. Grossman, Of Counsel, Wachtell, Lipton, Rosen & Katz. In addition to her J.D., she has a PhD in psychology, and has been a tireless advocate for effective legal search for many years. Their work is well known to everyone in the field.

The primary purpose of their latest study was not to test the effectiveness of training based on random samples. That was a…

View original post 2,865 more words

Predictive Coding Confusion

Clustify Blog - eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

This article looks at a few common misconceptions and mistakes related to predictive coding and confidence intervals.

Confidence intervals vs. training set size:  You can estimate the percentage of documents in a population having some property (e.g., is the document responsive, or does it contain the word “pizza”) by taking a random sample of the documents and measuring the percentage having that property.  The confidence interval tells you how much uncertainty there is due to your measurement being made on a sample instead of the full population.  If you sample 400 documents, the 95% confidence interval is +/- 5%, meaning that 95% of the time the range from -5% to +5% around your estimate will contain the actual value for the full population.  For example, if you sample 400 documents and find that 64 are relevant (16%), there is a 95% chance that the range 11% to 21% will…

View original post 1,626 more words

Latest Grossman and Cormack Study Proves Efficacy of Multimodal Search for Predictive Coding Training Documents and the Folly of Random Search – Part One

e-Discovery Team ®

Nasreddin_(17th-century_miniature)There is a well-known joke found in most cultures of the world about a fool looking for something. This anecdote has been told for thousands of years because it illustrates a basic trait of human psychology, now commonly called after the joke itself, the  Streetlight Effect. This is a type of observational bias where people only look for whatever they are searching by looking where it is easiest. This human frailty, when pointed out in the right way, can be funny. One of the oldest known forms of pedagogic humor illustrating the Streetlight effect comes from the famous stories of Nasrudin, aka, Nasreddin, an archetypal wise fool from 13th Century Sufi traditions. Here is one version of this joke attributed to Nasreddin:

One late evening Nasreddin found himself walking home. It was only a very short way and upon arrival he can be seen to be upset about something. Alas, just then a…

View original post 2,133 more words

Less Sales, More Guidance in Predictive Coding and Analytics

A sales person can tell you all about how great the view is from up on the mountain, but only a guide who knows the way can get you to the top.

Guide Us To The Top

Like most attorneys interested in the field of eDiscovery, a few years back I became quite curious about a series of new litigation support product offerings most commonly referred to as predictive coding or technology assisted review.  It was hard not to get excited about a “new” technology that was reportedly offering so much promise in an area of the law that so desperately needed it.  Being the type of person who is not very likely to just take someone’s word for it (i.e., I learn most of my lessons the “hard way”), I set out to understand the technology behind these new offerings, including understanding some of the not-so-basic mathematical concepts that support the various predictive coding engines.

Over the years I have jotted down many of the predictive coding nuggets I have gathered, including some of my thoughts and opinions on offered information that I felt was not entirely accurate.  Some of those thoughts and opinions may not be well received by some product developers or providers, especially when it comes to how I feel the role of metadata in the predictive coding process is properly described.  I write this post to share those nuggets and opinions knowing well that I have plenty to learn about the process and the technology, and hoping that other eDiscovery nerds out there will be willing to step in and offer thoughtful and explanatory corrections where my opinions have gone astray.

Before I begin sharing, I think it is important to frame what my thought process was when jotting down most of these points.  As everyone now knows, the hype surrounding predictive coding’s arrival in the eDiscovery industry was vastly overblown.   This became quite clear to me a couple years ago when a vendor proposed the use of a home-grown Latent Symantec Indexing (“LSI”) based analytics tool to assist our team of attorneys with the identification and review of the most relevant documents in a corpus of about 350,000 that the vendor had been hosting for us for a couple months.  They made their case and, like many other predictive coding pitches, it sounded too good to be true.  In this case, it was.

I had become very familiar with the characteristics of this particular corpus, and it was a bit of an odd set.  Many of the most important documents in the case went back as far as the ‘50s and ‘60s.  The OCR text for those documents was mostly poor, and occasionally fair.  Beyond understanding the corpus, I had done a small amount of research on LSI and analytics on Content Analyst’s website, by reviewing some of their older white papers.  My understanding of that research told me that LSI was unreliable in cases where text of a given document is degraded beyond 35 percent (i.e., 35 percent or more of the characters cannot be identified reliably through OCR).

I came into that pitch meeting armed with these two very important pieces of information, and thankfully so.  I am fairly sure that we would have wasted some valuable time and money experimenting with the offered tool had I not been able to articulate my concerns regarding text degradation and the known limitations of the technology behind the vendor’s analytic tool.  It was on that day that I decided I would do my best to understand this “new” technology, and do my best to break it down into the simplest terms so that I could help others understand it as well.  I was motivated primarily by the unfortunate idea that most attorneys at that time would have entered that meeting entirely unprepared to escape the allure of an offering so full of promise but, unbeknownst to the vendor sales team, entirely ineffective under the circumstances.

Despite all the promise that predictive coding brings to eDiscovery, most lawyers are still reluctant to embrace it.  I feel very strongly that this is because no one takes the time to explain how it really works in terms that attorneys can understand.  I am not saying that all attorneys need to be able to translate calculus algorithms into Matlab code.  What I am saying is that attorneys would more readily embrace the technology if vendors spent more time explaining how the most critical components of the tool work.  Instead, I have listened to many presentations where a sales team glosses over how their base LSI, PLSI or SVM algorithms work to classify the documents, in favor of spending most of the meeting focusing on the less essential bells and whistles of their tool.

I see two possible reasons for this.  The first is that the sales people making the pitch don’t have a thorough understanding of the core technology behind their product.  Instead, they have a script to follow, and struggle whenever they have to deviate.  The second is that they understand it, but don’t bother to explain for fear that the attorneys in the audience would become lost and disinterested.  These are both unfortunate situations that I fear happen all too often.  We should all know by now that attorneys are not very good at adopting anything that they don’t understand, and rightfully so.  They are the ones who are going to have to defend the use of it when things go sideways.

I feel that what is most needed is an honest accounting of the strengths and weaknesses of the core technology, in practical terms that can allay the logical (or in some cases illogical) fears of the attorneys who need to use these tools.  The various tools are all too often pitched as a one-size-fits-all solution by sales people who really don’t understand the practical applications and limitations of the science as applied to document review.  This was understandably initially as everyone was in a rush to get to the market with their offering.  What I can’t understand is why this problem persists after a few years of trying to sell it.

One of my goals for the past few years has been to remedy this problem as much as personally possible.  I have tried to do this by going outside of my comfort zone to investigate the technology behind the tools, and then combine that knowledge with my practical eDiscovery experience in an effort to identify functional strengths and weaknesses.  With that frame of mind and goal, I submit the following points. I feel that these points are some of the best practical guidance I have received over the past few years, with my occasional thoughts and opinions mixed in.  Please have a look and, especially if you disagree, share your thoughts and some educational nuggets with the rest of us.

Predictive Coding Technology and Training the Machine

  • The most fundamental action of the algorithms behind the various predictive coding tools is to analyze only the text on the face of a given document when attempting to determine relationships between that document, seed set documents, and the other non-categorized documents in the review corpus.
  • Dates and other similar data consisting of numbers on the face of a document cannot be considered by the machine when attempting to determine relevance.
  • Non-text of document metadata can be used to filter and cull document sets, but it is not part of the core algorithmic analysis of any given document
  • I have heard many claims that metadata analysis is part of predictive coding “technology”, but I have yet to hear anything to convince me that its role is anything more than that of a filter applied as part of the predictive coding “process.”
  • I voiced this opinion when speaking with an “expert” last year and he agreed that, while metadata filtering can certainly be incorporated into the overall process, anyone suggesting that metadata is part of the algorithmic analysis of a given document misunderstands how the core algorithmic analysis occurs.
  • I feel very strongly that the role of non-text-of-document metadata needs to be crystal clear so attorneys understand what factors should be considered when training the machine for relevance and non-relevance.  Without a crystal clear understanding of which data fields the machine is studying to determine conceptual relationships, one cannot possibly hope to effectively and consistently train the machine.
  • The text of the documents to be analyzed via predictive coding must be indexed prior to analysis.
  • Only documents that serve as “good” examples (i.e., lots of text) should be used to train the machine during the predictive coding process.
  • Relevance of a document has nothing to do with whether it is a “good” example for training the machine.  Whether a document is a “good” example is determined by the richness of text in that document.  It helps to have “good” examples of both relevant and non-relevant documents for training.
  • When assessing whether a document is a “good” training example, you should always compare what you see in the “viewer” of a given tool with what you see in the extracted text view.  What you see in a viewer may actually be an image of a document, and thus no text exists for analysis unless that image text is captured via OCR.
  • Numbers are not indexed for the purposes of analyzing and determining conceptual relationships in predictive coding.
  • Headers and footers (i.e., boilerplate at the bottom of an e-mail) are also generally not indexed for analysis.
  • Spreadsheets and other similar documents need to be considered very carefully during the training process.  Spreadsheets that contain mostly numbers are ”bad” training examples.  Spreadsheets with lots of text may possibly be “good” examples, but keep in mind that the formatting of a document is not considered when assessing contextual relationships between words within a document via predictive coding.  This can lead to some “confusion” for the machine where large, text-heavy spreadsheets are broken up into cells or tabs of varied subject matter.
  • You’ll need to do some digging early on to decide if it is possible to make a categorical decision about whether spreadsheets can be effectively reviewed by the machine, or whether you will need to review them manually.
  • When using an analytics tool like Relativity’s Excerpt Text, you should limit it to instances where you have twenty or more words that can be used to train during the machine learning process.  Selection of key words or phrases will not provide enough text for tools like Relativity’s Excerpt Text feature to work effectively.
  • A combination of random sampling and keyword searching should be used to gather seed set documents to train the machine.  Training on documents gathered through random sampling will help to avoid the “more like this” effect where you are getting only a few document types back.  Supplementing random sampling with documents gathered through keyword search will likely speed up process, and will likely reduce the number of iterations required.
  • Culling data sets that will be analyzed with predictive coding tools through the use of non-text-of-document metadata or keyword filtering is not something to be done lightly.  My understanding is that culling swaths of non-relevant documents and training on only relevant or interesting documents will limit the type and subject matter of relevant documents returned, and will lead to an incomplete understanding of the case and available evidence.
  • That is not to say that custodial collections should never be trimmed with metadata filtering.  Assuming a corpus of sufficient size will remain; narrowing a collection based upon overly safe date ranges is not likely to have a negative impact on training since predictive coding algorithms do not understand dates for training purposes.  This type of filtering may also lessen the chance of a trainer mistakenly labeling documents as not relevant based on things like document date, rather than correctly assessing relevance based on the facial text of a document.
  • It is critical that the subject matter experts of a case, who are responsible for the training the various predictive coding systems, do not confuse responsiveness and relevance during training.
  • Consideration of whether a document is “responsive” during a particular review will often not be limited to just the facial text of that document.  Responsiveness determinations will also usually be subject to several secondary considerations, including document date.
  • Treating a content relevant document as irrelevant because it is outside the responsiveness date range will do significant harm.  It will confuse machine by contradicting similar content you have previously, and rightfully, instructed the machine to understand as relevant.

Fair Comparison of Predictive Coding Performance

Clustify Blog - eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

Understandably, vendors of predictive coding software want to show off numbers indicating that their software works well.  It is important for users of such software to avoid drawing wrong conclusions from performance numbers.

Consider the two precision-recall curves below (if you need to brush up on the meaning of precision and recall, see my earlier article):precision_recall_for_diff_tasks

The one on the left is incredibly good, with 97% precision at 90% recall.  The one on the right is not nearly as impressive, with 17% precision at 70% recall, though you could still find 70% of the relevant documents with no additional training by reviewing only the highest-rated 4.7% of the document population (excluding the documents reviewed for training and testing).

Why are the two curves so different?  They come from the same algorithm applied to the same document population with the same features (words) analyzed and the exact same random sample…

View original post 636 more words

A Guide to Forms of Production

Ball in your Court

forms_iconSemiannually, I compile a primer on some key aspect of electronic discovery.  In the past, I’ve written on, inter alia, computer forensics, backup systems, metadata and databases. For 2014, I’ve completed the first draft of the Lawyers’ Guide to Forms of Production, intended to serve as a primer on making sensible and cost-effective specifications for production of electronically stored information.  It’s the culmination and re-purposing of much that I’ve written on forms heretofore along with new material extolling the advantages of native and near-native forms.

Reviewing the latest draft, there is still so much I want to add and re-organize; accordingly, it will be a work-in-progress for months to come.  Consider it something of a “public comment” version.  The linked document includes exemplar verbiage for requests and model protocols for your adaption and adoption.  I plan to add more forms and examples over time.

View original post 210 more words

The Importance of Cybersecurity to the Legal Profession and Outsourcing as a Best Practice – Part Two

e-Discovery Team ®

Attorney_client_data-protection This is part two of this article. Please read part one first .

4. Risk. Risks of error are inherent in Lit-Support Department activities. What they do is often complex and technical, just like any e-discovery vendor. So too are risks of data breach. There is always a danger of hacker intrusions. Just ask Target.

Do you know what your exposure is for a data breach? What damages could be caused by the accidental loss or disclosure of your client’s e-discovery data? How many terabytes of client data are you holding right now? How much of that is confidential? What if there is an ESI processing error? What if attorney-client emails were not processed and screened properly?

_data_breach_cost_by_typeMistakes can happen, especially when a law firm is operating outside of its core competency. What if an error requires a complete re-do of a project? What will that cost the firm? You cannot…

View original post 1,450 more words

The Importance of Cybersecurity to the Legal Profession and Outsourcing as a Best Practice – Part One

e-Discovery Team ®

Cyber_security_wordsCybersecurity  should be job number one for all attorneys. Why? Because we handle confidential computer data, usually secret information that belongs to our clients, not us. We have an ethical duty to protect this information under  Rule 1.6 of the ABA Model Rules of Professional Conduct . If we handle big cases, or big corporate matters, then we also handle big collections of electronically stored information (ESI). The amount of ESI involved is growing every day. That is one reason that Cybersecurity is a hard job for law firms. The other is the ever increasing threat of computer hackers.

Chinese-cyber-warThe threat is now increasing rapidly because there are now criminal gangs of hackers, including the Chinese government, that have targeted this ESI for theft. These bad hackers, knows as crackers, have learned that when they cannot get at a company’s data directly, usually because it is too well defended, or too…

View original post 2,089 more words