What is Continuous Active Learning (CAL), Really? – Part One

 

Ever since the March 2, 2015 Rio Tinto opinion and order, there has been a lot of buzz in eDiscovery around the phrase “Continuous Active Learning” (CAL). Judge Peck briefly mentioned CAL while summarizing the available case law around seed-set sharing and transparency. For the sake of clarity, the term seed-set in this post refers to the initial group of training documents used to kick off a Technology Assisted Review (TAR) project. We refer to the review sets that follow as training sets. The point of Judge Peck’s mention of CAL, as I understood it, was to alert readers to the possibility that seed-set selection and disclosure disputes may become much less necessary as TAR tools and protocols continue to evolve.

Judge Peck pointed to recent research and a law review article by Maura Grossman and Gordon Cormack to support that notion. Those works made two important points about seed-set documents. First, they asserted that the selection and coding of seed-set documents is less likely to define the ultimate success of TAR projects employing a true CAL protocol. The general theory there is that the influence of misclassified seed documents is fleeting, since the classifier used to identify successive training set documents is recreated after each round, rather than simply revised or refitted. Second, they argued that seed-set transparency is not the guaranteed path to TAR project completeness, since neither the producing nor receiving party has a true understanding of the breadth of the concepts / information types in a collection.

The fact that Judge Peck cited the work of Grossman and Cormack as the basis for his statement is important, because the definition of CAL asserted in those publications is different from what the makers of many TAR tools would offer – even those that claim to be CAL capable.

Read More at the Altep blog: Read More at the Altep Blog: What is Continuous Active Learning (CAL), Really? – Part One

Advertisements

Creative Analytics – Part 3: The Toolbox

toolbox.png

By Sara Skeens and Josh Tolles

Welcome to part three of our Creative Analytics series. Part one provided a suggested road-map for getting more comfortable with analytics tools, and exploring more creative uses. In part two, we discussed some of the challenges common to the presentation phase of the EDRM, which require us to look for creative solutions. This brings us to part three – the solutions. In this post we will provide more detail on a few key tools and techniques that we deploy to overcome those common challenges. This final installment is intended to serve as the closing primer for our co-hosted webinar with kCura that will be taking place tomorrow, Wednesday July 13th – Leveraging Analytics for Depo & Trial Prep. Please tune in then where we will put things into a more visual, workflow-based perspective.

Narrowing The Field – Making The Most of Your Time 

Deposition and trial preparations typically begin as production review ends (in some cases the two processes can run over each other as well, adding additional complications). It is here that you are usually faced with making sense of two distinct data sets – your produced documents and productions received. Traditional fact finding efforts here involve simply leveraging reviewer coding and supplemental keyword searches. These techniques are a great place to start, but can be highly time inefficient and almost always suffer in terms of completeness.

One helpful early approach here is to limit your fact finding data set to only unique content as much as possible. Analyzing duplicate content is a painful drain on resources. Whether a false keyword hit, or a true hot document, you generally only need one good look within the four corners to assess its value. This can be a bit counter intuitive, especially if you have been working with family coding guidelines during your review efforts. However, it is best to start small when time is of the essence. Identify key individual documents as quickly as possible and then build context around those items later.

Read more at the Altep blog: Creative Analytics – Part 3: The Toolbox

Creative #Analytics: Solving Challenges in the Presentation Phase

Financial analysis and forecast

This post is Part 2 of a series – you can also watch a video of the related webinar, or read Part 1, on the kCura Blog.

By Joshua Tolles and  Sara Skeens

In our last post, we discussed the value of looking at analytics in e-Discovery with a creative mindset, and a few steps that you can take to expand your problem solving horizons. As we noted there, analytics is most commonly thought of as a tool to be applied during the review phase of the EDRM to control data sizes; however, we’d like to change that. At Altep, we frequently use analytics to solve many more problems than just those found in the production review arena. With a firm grasp on the technology, plenty of curiosity, and a healthy passion for “building a better mouse trap,” we have found quite a few areas where analytics can help turn the eDiscovery rat race into a more methodical and scalable process.

The presentation phase of the EDRM is one such an area. While the EDRM roadmap tells us that analysis occurs in conjunction with review and production, much of the real analysis work is done post-production, in the time leading up to presentation. Cases are often made or broken at deposition, and most certainly at trial. Thorough preparation and a crystal clear understanding of the facts and available evidence are essential to success. However, you may encounter any of several potential pitfalls as you meet your discovery deadline and begin preparations.

Read more at the Altep blog: Creative Analytics – Part 2: The Presentation Phase

3 Steps to More Creative e-Discovery Analytics

By: Sara Skeens and Joshua Tolles

get-creative-with-analytics4

Flexibility and adaptability are two of the more important traits of any highly successful legal professional. Those traits are rarely more in demand than right now, when growing data volumes mean we continue to see and solve new and different discovery challenges—many of which would have seemed impossible or too difficult to resolve just a few years ago.

As the average case size continues to grow, and the definition of “unduly burdensome” continues to develop, a premium has been placed on discovery strategies that are both defensible and cost-effective. Both new and existing technologies have become the key to addressing this evolution.

Step 1: Build a strong foundation.

A strong understanding of technology provides the foundation for innovation. A deep understanding of what’s available, when combined with flexibility, creativity, and an understanding of how the tools can work together, opens up a world of problem-solving possibilities. Armed with technical know-how and a drive to think outside the box, you are not limited by when and how these tools are most commonly used. Playing off the strengths and weaknesses of each tool and its ability to solve a problem can fill in gaps and increase efficiency.

Analytics as a technology was primarily introduced in e-discovery to solve the challenge of growing data sizes and associated, often prohibitively high, review costs. Applications such as clustering, categorization, technology-assisted review (TAR), email threading, and near-duplicate detection are now implemented on a daily basis to do just that. Growing acceptance means these tools often come with templated workflows that make getting started much easier. Starting small by putting these workflows into practice can help you build enough expertise to identify new opportunities for tackling your most complex projects with customized, combined, and creative workflows ideated with your unique goals in mind.

Step 2: Explore each analytics feature on its own merit.

Did you know the same technology supports both categorization and technology-assisted review? In both use cases, the technology is trained by users’ decisions to organize documents based on content. There are, however, differences in the training and quality control methodologies that allow each of these options to be more applicable in certain situations. Categorization might be a useful exercise for QC purposes when it’s performed in conjunction with a privilege review, for instance, while a TAR project can help accelerate the earliest stages of reviewing your data.

These differences in approach exemplify the ability to use the same underlying technology to solve various challenges. As you begin implementing analytics in your projects, get to know each feature and how it can benefit different use cases. What makes email threading valuable? What about clustering?

Step 3: Explore analytics features in different combinations and stages.

The beauty of analytics is that it is more a method than a tool. “The method of logical analysis” is how Merriam-Webster defines analytics, and its very nature makes it ideal for flexing and adapting to new use cases.

Once you’ve gotten comfortable with each analytics tool and the benefits it can present, you’ll have the confidence to start combining features in the same workflows to see how one tool’s results can improve another’s. Maybe just one feature will do the job for the small case that just came through your door. But maybe a combination of features is required for the next big case. For example, how does email threading cut down noise in a TAR project? How might foreign language identification make your team’s approach to clustering results of international data more efficient? The possibilities are endless.

With a little education and expert guidance, you can apply analytics tools creatively, without limiting your team to a single, go-to approach that may not be up to snuff for solving your most complex e-discovery challenges.

You can even apply analytics beyond the scope of initial review. We at Altep will be releasing a series of posts on the Altep blog discussing how to use analytics in e-discovery in new and exciting ways, leading up to a thought-provoking webinar discussing its use in the presentation phase of the EDRM. There has been little focus on leveraging analytics during this phase, though it can benefit teams faced with organizing sets of produced data while courtroom deadlines loom.

By creatively leveraging analytics during deposition and trial, you can considerably cut time and costs during this phase, as well as locate key information more quickly to increase your chances for success. The techniques at your disposal are certainly a departure from traditional strategies, but they are tested and proven solutions that work.

Sara Skeens is a consultant for advanced review and analytics with Altep’s litigation consulting group. She has over 10 years of experience providing solutions and workflow guidance to case teams and enterprise clients in the areas of preservation, review, analysis, production, and presentation. She is a Relativity Certified Expert and has held positions in law firms, government, and providers working in both criminal and civil litigation, as well as investigations.

Joshua Tolles is a senior consultant for advanced review and analytics with Altep’s litigation consulting group. In this role, he provides process, solutions, and workflow guidance to case teams and enterprise clients in the areas of preservation, collections, processing, review, analysis, and production. Joshua is a licensed attorney in Washington State and the District of Columbia, and a Relativity Certified Expert.

Also available via the kCura blog: 3 Steps to More Creative e-Discovery Analytics

Less Sales, More Guidance in Predictive Coding and Analytics

A sales person can tell you all about how great the view is from up on the mountain, but only a guide who knows the way can get you to the top.

Guide Us To The Top

Like most attorneys interested in the field of eDiscovery, a few years back I became quite curious about a series of new litigation support product offerings most commonly referred to as predictive coding or technology assisted review.  It was hard not to get excited about a “new” technology that was reportedly offering so much promise in an area of the law that so desperately needed it.  Being the type of person who is not very likely to just take someone’s word for it (i.e., I learn most of my lessons the “hard way”), I set out to understand the technology behind these new offerings, including understanding some of the not-so-basic mathematical concepts that support the various predictive coding engines.

Over the years I have jotted down many of the predictive coding nuggets I have gathered, including some of my thoughts and opinions on offered information that I felt was not entirely accurate.  Some of those thoughts and opinions may not be well received by some product developers or providers, especially when it comes to how I feel the role of metadata in the predictive coding process is properly described.  I write this post to share those nuggets and opinions knowing well that I have plenty to learn about the process and the technology, and hoping that other eDiscovery nerds out there will be willing to step in and offer thoughtful and explanatory corrections where my opinions have gone astray.

Before I begin sharing, I think it is important to frame what my thought process was when jotting down most of these points.  As everyone now knows, the hype surrounding predictive coding’s arrival in the eDiscovery industry was vastly overblown.   This became quite clear to me a couple years ago when a vendor proposed the use of a home-grown Latent Symantec Indexing (“LSI”) based analytics tool to assist our team of attorneys with the identification and review of the most relevant documents in a corpus of about 350,000 that the vendor had been hosting for us for a couple months.  They made their case and, like many other predictive coding pitches, it sounded too good to be true.  In this case, it was.

I had become very familiar with the characteristics of this particular corpus, and it was a bit of an odd set.  Many of the most important documents in the case went back as far as the ‘50s and ‘60s.  The OCR text for those documents was mostly poor, and occasionally fair.  Beyond understanding the corpus, I had done a small amount of research on LSI and analytics on Content Analyst’s website, by reviewing some of their older white papers.  My understanding of that research told me that LSI was unreliable in cases where text of a given document is degraded beyond 35 percent (i.e., 35 percent or more of the characters cannot be identified reliably through OCR).

I came into that pitch meeting armed with these two very important pieces of information, and thankfully so.  I am fairly sure that we would have wasted some valuable time and money experimenting with the offered tool had I not been able to articulate my concerns regarding text degradation and the known limitations of the technology behind the vendor’s analytic tool.  It was on that day that I decided I would do my best to understand this “new” technology, and do my best to break it down into the simplest terms so that I could help others understand it as well.  I was motivated primarily by the unfortunate idea that most attorneys at that time would have entered that meeting entirely unprepared to escape the allure of an offering so full of promise but, unbeknownst to the vendor sales team, entirely ineffective under the circumstances.

Despite all the promise that predictive coding brings to eDiscovery, most lawyers are still reluctant to embrace it.  I feel very strongly that this is because no one takes the time to explain how it really works in terms that attorneys can understand.  I am not saying that all attorneys need to be able to translate calculus algorithms into Matlab code.  What I am saying is that attorneys would more readily embrace the technology if vendors spent more time explaining how the most critical components of the tool work.  Instead, I have listened to many presentations where a sales team glosses over how their base LSI, PLSI or SVM algorithms work to classify the documents, in favor of spending most of the meeting focusing on the less essential bells and whistles of their tool.

I see two possible reasons for this.  The first is that the sales people making the pitch don’t have a thorough understanding of the core technology behind their product.  Instead, they have a script to follow, and struggle whenever they have to deviate.  The second is that they understand it, but don’t bother to explain for fear that the attorneys in the audience would become lost and disinterested.  These are both unfortunate situations that I fear happen all too often.  We should all know by now that attorneys are not very good at adopting anything that they don’t understand, and rightfully so.  They are the ones who are going to have to defend the use of it when things go sideways.

I feel that what is most needed is an honest accounting of the strengths and weaknesses of the core technology, in practical terms that can allay the logical (or in some cases illogical) fears of the attorneys who need to use these tools.  The various tools are all too often pitched as a one-size-fits-all solution by sales people who really don’t understand the practical applications and limitations of the science as applied to document review.  This was understandably initially as everyone was in a rush to get to the market with their offering.  What I can’t understand is why this problem persists after a few years of trying to sell it.

One of my goals for the past few years has been to remedy this problem as much as personally possible.  I have tried to do this by going outside of my comfort zone to investigate the technology behind the tools, and then combine that knowledge with my practical eDiscovery experience in an effort to identify functional strengths and weaknesses.  With that frame of mind and goal, I submit the following points. I feel that these points are some of the best practical guidance I have received over the past few years, with my occasional thoughts and opinions mixed in.  Please have a look and, especially if you disagree, share your thoughts and some educational nuggets with the rest of us.

Predictive Coding Technology and Training the Machine

  • The most fundamental action of the algorithms behind the various predictive coding tools is to analyze only the text on the face of a given document when attempting to determine relationships between that document, seed set documents, and the other non-categorized documents in the review corpus.
  • Dates and other similar data consisting of numbers on the face of a document cannot be considered by the machine when attempting to determine relevance.
  • Non-text of document metadata can be used to filter and cull document sets, but it is not part of the core algorithmic analysis of any given document
  • I have heard many claims that metadata analysis is part of predictive coding “technology”, but I have yet to hear anything to convince me that its role is anything more than that of a filter applied as part of the predictive coding “process.”
  • I voiced this opinion when speaking with an “expert” last year and he agreed that, while metadata filtering can certainly be incorporated into the overall process, anyone suggesting that metadata is part of the algorithmic analysis of a given document misunderstands how the core algorithmic analysis occurs.
  • I feel very strongly that the role of non-text-of-document metadata needs to be crystal clear so attorneys understand what factors should be considered when training the machine for relevance and non-relevance.  Without a crystal clear understanding of which data fields the machine is studying to determine conceptual relationships, one cannot possibly hope to effectively and consistently train the machine.
  • The text of the documents to be analyzed via predictive coding must be indexed prior to analysis.
  • Only documents that serve as “good” examples (i.e., lots of text) should be used to train the machine during the predictive coding process.
  • Relevance of a document has nothing to do with whether it is a “good” example for training the machine.  Whether a document is a “good” example is determined by the richness of text in that document.  It helps to have “good” examples of both relevant and non-relevant documents for training.
  • When assessing whether a document is a “good” training example, you should always compare what you see in the “viewer” of a given tool with what you see in the extracted text view.  What you see in a viewer may actually be an image of a document, and thus no text exists for analysis unless that image text is captured via OCR.
  • Numbers are not indexed for the purposes of analyzing and determining conceptual relationships in predictive coding.
  • Headers and footers (i.e., boilerplate at the bottom of an e-mail) are also generally not indexed for analysis.
  • Spreadsheets and other similar documents need to be considered very carefully during the training process.  Spreadsheets that contain mostly numbers are ”bad” training examples.  Spreadsheets with lots of text may possibly be “good” examples, but keep in mind that the formatting of a document is not considered when assessing contextual relationships between words within a document via predictive coding.  This can lead to some “confusion” for the machine where large, text-heavy spreadsheets are broken up into cells or tabs of varied subject matter.
  • You’ll need to do some digging early on to decide if it is possible to make a categorical decision about whether spreadsheets can be effectively reviewed by the machine, or whether you will need to review them manually.
  • When using an analytics tool like Relativity’s Excerpt Text, you should limit it to instances where you have twenty or more words that can be used to train during the machine learning process.  Selection of key words or phrases will not provide enough text for tools like Relativity’s Excerpt Text feature to work effectively.
  • A combination of random sampling and keyword searching should be used to gather seed set documents to train the machine.  Training on documents gathered through random sampling will help to avoid the “more like this” effect where you are getting only a few document types back.  Supplementing random sampling with documents gathered through keyword search will likely speed up process, and will likely reduce the number of iterations required.
  • Culling data sets that will be analyzed with predictive coding tools through the use of non-text-of-document metadata or keyword filtering is not something to be done lightly.  My understanding is that culling swaths of non-relevant documents and training on only relevant or interesting documents will limit the type and subject matter of relevant documents returned, and will lead to an incomplete understanding of the case and available evidence.
  • That is not to say that custodial collections should never be trimmed with metadata filtering.  Assuming a corpus of sufficient size will remain; narrowing a collection based upon overly safe date ranges is not likely to have a negative impact on training since predictive coding algorithms do not understand dates for training purposes.  This type of filtering may also lessen the chance of a trainer mistakenly labeling documents as not relevant based on things like document date, rather than correctly assessing relevance based on the facial text of a document.
  • It is critical that the subject matter experts of a case, who are responsible for the training the various predictive coding systems, do not confuse responsiveness and relevance during training.
  • Consideration of whether a document is “responsive” during a particular review will often not be limited to just the facial text of that document.  Responsiveness determinations will also usually be subject to several secondary considerations, including document date.
  • Treating a content relevant document as irrelevant because it is outside the responsiveness date range will do significant harm.  It will confuse machine by contradicting similar content you have previously, and rightfully, instructed the machine to understand as relevant.