A sales person can tell you all about how great the view is from up on the mountain, but only a guide who knows the way can get you to the top.
Like most attorneys interested in the field of eDiscovery, a few years back I became quite curious about a series of new litigation support product offerings most commonly referred to as predictive coding or technology assisted review. It was hard not to get excited about a “new” technology that was reportedly offering so much promise in an area of the law that so desperately needed it. Being the type of person who is not very likely to just take someone’s word for it (i.e., I learn most of my lessons the “hard way”), I set out to understand the technology behind these new offerings, including understanding some of the not-so-basic mathematical concepts that support the various predictive coding engines.
Over the years I have jotted down many of the predictive coding nuggets I have gathered, including some of my thoughts and opinions on offered information that I felt was not entirely accurate. Some of those thoughts and opinions may not be well received by some product developers or providers, especially when it comes to how I feel the role of metadata in the predictive coding process is properly described. I write this post to share those nuggets and opinions knowing well that I have plenty to learn about the process and the technology, and hoping that other eDiscovery nerds out there will be willing to step in and offer thoughtful and explanatory corrections where my opinions have gone astray.
Before I begin sharing, I think it is important to frame what my thought process was when jotting down most of these points. As everyone now knows, the hype surrounding predictive coding’s arrival in the eDiscovery industry was vastly overblown. This became quite clear to me a couple years ago when a vendor proposed the use of a home-grown Latent Symantec Indexing (“LSI”) based analytics tool to assist our team of attorneys with the identification and review of the most relevant documents in a corpus of about 350,000 that the vendor had been hosting for us for a couple months. They made their case and, like many other predictive coding pitches, it sounded too good to be true. In this case, it was.
I had become very familiar with the characteristics of this particular corpus, and it was a bit of an odd set. Many of the most important documents in the case went back as far as the ‘50s and ‘60s. The OCR text for those documents was mostly poor, and occasionally fair. Beyond understanding the corpus, I had done a small amount of research on LSI and analytics on Content Analyst’s website, by reviewing some of their older white papers. My understanding of that research told me that LSI was unreliable in cases where text of a given document is degraded beyond 35 percent (i.e., 35 percent or more of the characters cannot be identified reliably through OCR).
I came into that pitch meeting armed with these two very important pieces of information, and thankfully so. I am fairly sure that we would have wasted some valuable time and money experimenting with the offered tool had I not been able to articulate my concerns regarding text degradation and the known limitations of the technology behind the vendor’s analytic tool. It was on that day that I decided I would do my best to understand this “new” technology, and do my best to break it down into the simplest terms so that I could help others understand it as well. I was motivated primarily by the unfortunate idea that most attorneys at that time would have entered that meeting entirely unprepared to escape the allure of an offering so full of promise but, unbeknownst to the vendor sales team, entirely ineffective under the circumstances.
Despite all the promise that predictive coding brings to eDiscovery, most lawyers are still reluctant to embrace it. I feel very strongly that this is because no one takes the time to explain how it really works in terms that attorneys can understand. I am not saying that all attorneys need to be able to translate calculus algorithms into Matlab code. What I am saying is that attorneys would more readily embrace the technology if vendors spent more time explaining how the most critical components of the tool work. Instead, I have listened to many presentations where a sales team glosses over how their base LSI, PLSI or SVM algorithms work to classify the documents, in favor of spending most of the meeting focusing on the less essential bells and whistles of their tool.
I see two possible reasons for this. The first is that the sales people making the pitch don’t have a thorough understanding of the core technology behind their product. Instead, they have a script to follow, and struggle whenever they have to deviate. The second is that they understand it, but don’t bother to explain for fear that the attorneys in the audience would become lost and disinterested. These are both unfortunate situations that I fear happen all too often. We should all know by now that attorneys are not very good at adopting anything that they don’t understand, and rightfully so. They are the ones who are going to have to defend the use of it when things go sideways.
I feel that what is most needed is an honest accounting of the strengths and weaknesses of the core technology, in practical terms that can allay the logical (or in some cases illogical) fears of the attorneys who need to use these tools. The various tools are all too often pitched as a one-size-fits-all solution by sales people who really don’t understand the practical applications and limitations of the science as applied to document review. This was understandably initially as everyone was in a rush to get to the market with their offering. What I can’t understand is why this problem persists after a few years of trying to sell it.
One of my goals for the past few years has been to remedy this problem as much as personally possible. I have tried to do this by going outside of my comfort zone to investigate the technology behind the tools, and then combine that knowledge with my practical eDiscovery experience in an effort to identify functional strengths and weaknesses. With that frame of mind and goal, I submit the following points. I feel that these points are some of the best practical guidance I have received over the past few years, with my occasional thoughts and opinions mixed in. Please have a look and, especially if you disagree, share your thoughts and some educational nuggets with the rest of us.
Predictive Coding Technology and Training the Machine
- The most fundamental action of the algorithms behind the various predictive coding tools is to analyze only the text on the face of a given document when attempting to determine relationships between that document, seed set documents, and the other non-categorized documents in the review corpus.
- Dates and other similar data consisting of numbers on the face of a document cannot be considered by the machine when attempting to determine relevance.
- Non-text of document metadata can be used to filter and cull document sets, but it is not part of the core algorithmic analysis of any given document
- I have heard many claims that metadata analysis is part of predictive coding “technology”, but I have yet to hear anything to convince me that its role is anything more than that of a filter applied as part of the predictive coding “process.”
- I voiced this opinion when speaking with an “expert” last year and he agreed that, while metadata filtering can certainly be incorporated into the overall process, anyone suggesting that metadata is part of the algorithmic analysis of a given document misunderstands how the core algorithmic analysis occurs.
- I feel very strongly that the role of non-text-of-document metadata needs to be crystal clear so attorneys understand what factors should be considered when training the machine for relevance and non-relevance. Without a crystal clear understanding of which data fields the machine is studying to determine conceptual relationships, one cannot possibly hope to effectively and consistently train the machine.
- The text of the documents to be analyzed via predictive coding must be indexed prior to analysis.
- Only documents that serve as “good” examples (i.e., lots of text) should be used to train the machine during the predictive coding process.
- Relevance of a document has nothing to do with whether it is a “good” example for training the machine. Whether a document is a “good” example is determined by the richness of text in that document. It helps to have “good” examples of both relevant and non-relevant documents for training.
- When assessing whether a document is a “good” training example, you should always compare what you see in the “viewer” of a given tool with what you see in the extracted text view. What you see in a viewer may actually be an image of a document, and thus no text exists for analysis unless that image text is captured via OCR.
- Numbers are not indexed for the purposes of analyzing and determining conceptual relationships in predictive coding.
- Headers and footers (i.e., boilerplate at the bottom of an e-mail) are also generally not indexed for analysis.
- Spreadsheets and other similar documents need to be considered very carefully during the training process. Spreadsheets that contain mostly numbers are ”bad” training examples. Spreadsheets with lots of text may possibly be “good” examples, but keep in mind that the formatting of a document is not considered when assessing contextual relationships between words within a document via predictive coding. This can lead to some “confusion” for the machine where large, text-heavy spreadsheets are broken up into cells or tabs of varied subject matter.
- You’ll need to do some digging early on to decide if it is possible to make a categorical decision about whether spreadsheets can be effectively reviewed by the machine, or whether you will need to review them manually.
- When using an analytics tool like Relativity’s Excerpt Text, you should limit it to instances where you have twenty or more words that can be used to train during the machine learning process. Selection of key words or phrases will not provide enough text for tools like Relativity’s Excerpt Text feature to work effectively.
- A combination of random sampling and keyword searching should be used to gather seed set documents to train the machine. Training on documents gathered through random sampling will help to avoid the “more like this” effect where you are getting only a few document types back. Supplementing random sampling with documents gathered through keyword search will likely speed up process, and will likely reduce the number of iterations required.
- Culling data sets that will be analyzed with predictive coding tools through the use of non-text-of-document metadata or keyword filtering is not something to be done lightly. My understanding is that culling swaths of non-relevant documents and training on only relevant or interesting documents will limit the type and subject matter of relevant documents returned, and will lead to an incomplete understanding of the case and available evidence.
- That is not to say that custodial collections should never be trimmed with metadata filtering. Assuming a corpus of sufficient size will remain; narrowing a collection based upon overly safe date ranges is not likely to have a negative impact on training since predictive coding algorithms do not understand dates for training purposes. This type of filtering may also lessen the chance of a trainer mistakenly labeling documents as not relevant based on things like document date, rather than correctly assessing relevance based on the facial text of a document.
- It is critical that the subject matter experts of a case, who are responsible for the training the various predictive coding systems, do not confuse responsiveness and relevance during training.
- Consideration of whether a document is “responsive” during a particular review will often not be limited to just the facial text of that document. Responsiveness determinations will also usually be subject to several secondary considerations, including document date.
- Treating a content relevant document as irrelevant because it is outside the responsiveness date range will do significant harm. It will confuse machine by contradicting similar content you have previously, and rightfully, instructed the machine to understand as relevant.