What is Continuous Active Learning (CAL), Really? – Part One


Ever since the March 2, 2015 Rio Tinto opinion and order, there has been a lot of buzz in eDiscovery around the phrase “Continuous Active Learning” (CAL). Judge Peck briefly mentioned CAL while summarizing the available case law around seed-set sharing and transparency. For the sake of clarity, the term seed-set in this post refers to the initial group of training documents used to kick off a Technology Assisted Review (TAR) project. We refer to the review sets that follow as training sets. The point of Judge Peck’s mention of CAL, as I understood it, was to alert readers to the possibility that seed-set selection and disclosure disputes may become much less necessary as TAR tools and protocols continue to evolve.

Judge Peck pointed to recent research and a law review article by Maura Grossman and Gordon Cormack to support that notion. Those works made two important points about seed-set documents. First, they asserted that the selection and coding of seed-set documents is less likely to define the ultimate success of TAR projects employing a true CAL protocol. The general theory there is that the influence of misclassified seed documents is fleeting, since the classifier used to identify successive training set documents is recreated after each round, rather than simply revised or refitted. Second, they argued that seed-set transparency is not the guaranteed path to TAR project completeness, since neither the producing nor receiving party has a true understanding of the breadth of the concepts / information types in a collection.

The fact that Judge Peck cited the work of Grossman and Cormack as the basis for his statement is important, because the definition of CAL asserted in those publications is different from what the makers of many TAR tools would offer – even those that claim to be CAL capable.

Read More at the Altep blog: Read More at the Altep Blog: What is Continuous Active Learning (CAL), Really? – Part One

Creative Analytics – Part 3: The Toolbox


By Sara Skeens and Josh Tolles

Welcome to part three of our Creative Analytics series. Part one provided a suggested road-map for getting more comfortable with analytics tools, and exploring more creative uses. In part two, we discussed some of the challenges common to the presentation phase of the EDRM, which require us to look for creative solutions. This brings us to part three – the solutions. In this post we will provide more detail on a few key tools and techniques that we deploy to overcome those common challenges. This final installment is intended to serve as the closing primer for our co-hosted webinar with kCura that will be taking place tomorrow, Wednesday July 13th – Leveraging Analytics for Depo & Trial Prep. Please tune in then where we will put things into a more visual, workflow-based perspective.

Narrowing The Field – Making The Most of Your Time 

Deposition and trial preparations typically begin as production review ends (in some cases the two processes can run over each other as well, adding additional complications). It is here that you are usually faced with making sense of two distinct data sets – your produced documents and productions received. Traditional fact finding efforts here involve simply leveraging reviewer coding and supplemental keyword searches. These techniques are a great place to start, but can be highly time inefficient and almost always suffer in terms of completeness.

One helpful early approach here is to limit your fact finding data set to only unique content as much as possible. Analyzing duplicate content is a painful drain on resources. Whether a false keyword hit, or a true hot document, you generally only need one good look within the four corners to assess its value. This can be a bit counter intuitive, especially if you have been working with family coding guidelines during your review efforts. However, it is best to start small when time is of the essence. Identify key individual documents as quickly as possible and then build context around those items later.

Read more at the Altep blog: Creative Analytics – Part 3: The Toolbox

Creative #Analytics: Solving Challenges in the Presentation Phase

Financial analysis and forecast

This post is Part 2 of a series – you can also watch a video of the related webinar, or read Part 1, on the kCura Blog.

By Joshua Tolles and  Sara Skeens

In our last post, we discussed the value of looking at analytics in e-Discovery with a creative mindset, and a few steps that you can take to expand your problem solving horizons. As we noted there, analytics is most commonly thought of as a tool to be applied during the review phase of the EDRM to control data sizes; however, we’d like to change that. At Altep, we frequently use analytics to solve many more problems than just those found in the production review arena. With a firm grasp on the technology, plenty of curiosity, and a healthy passion for “building a better mouse trap,” we have found quite a few areas where analytics can help turn the eDiscovery rat race into a more methodical and scalable process.

The presentation phase of the EDRM is one such an area. While the EDRM roadmap tells us that analysis occurs in conjunction with review and production, much of the real analysis work is done post-production, in the time leading up to presentation. Cases are often made or broken at deposition, and most certainly at trial. Thorough preparation and a crystal clear understanding of the facts and available evidence are essential to success. However, you may encounter any of several potential pitfalls as you meet your discovery deadline and begin preparations.

Read more at the Altep blog: Creative Analytics – Part 2: The Presentation Phase

3 Steps to More Creative e-Discovery Analytics

By: Sara Skeens and Joshua Tolles


Flexibility and adaptability are two of the more important traits of any highly successful legal professional. Those traits are rarely more in demand than right now, when growing data volumes mean we continue to see and solve new and different discovery challenges—many of which would have seemed impossible or too difficult to resolve just a few years ago.

As the average case size continues to grow, and the definition of “unduly burdensome” continues to develop, a premium has been placed on discovery strategies that are both defensible and cost-effective. Both new and existing technologies have become the key to addressing this evolution.

Step 1: Build a strong foundation.

A strong understanding of technology provides the foundation for innovation. A deep understanding of what’s available, when combined with flexibility, creativity, and an understanding of how the tools can work together, opens up a world of problem-solving possibilities. Armed with technical know-how and a drive to think outside the box, you are not limited by when and how these tools are most commonly used. Playing off the strengths and weaknesses of each tool and its ability to solve a problem can fill in gaps and increase efficiency.

Analytics as a technology was primarily introduced in e-discovery to solve the challenge of growing data sizes and associated, often prohibitively high, review costs. Applications such as clustering, categorization, technology-assisted review (TAR), email threading, and near-duplicate detection are now implemented on a daily basis to do just that. Growing acceptance means these tools often come with templated workflows that make getting started much easier. Starting small by putting these workflows into practice can help you build enough expertise to identify new opportunities for tackling your most complex projects with customized, combined, and creative workflows ideated with your unique goals in mind.

Step 2: Explore each analytics feature on its own merit.

Did you know the same technology supports both categorization and technology-assisted review? In both use cases, the technology is trained by users’ decisions to organize documents based on content. There are, however, differences in the training and quality control methodologies that allow each of these options to be more applicable in certain situations. Categorization might be a useful exercise for QC purposes when it’s performed in conjunction with a privilege review, for instance, while a TAR project can help accelerate the earliest stages of reviewing your data.

These differences in approach exemplify the ability to use the same underlying technology to solve various challenges. As you begin implementing analytics in your projects, get to know each feature and how it can benefit different use cases. What makes email threading valuable? What about clustering?

Step 3: Explore analytics features in different combinations and stages.

The beauty of analytics is that it is more a method than a tool. “The method of logical analysis” is how Merriam-Webster defines analytics, and its very nature makes it ideal for flexing and adapting to new use cases.

Once you’ve gotten comfortable with each analytics tool and the benefits it can present, you’ll have the confidence to start combining features in the same workflows to see how one tool’s results can improve another’s. Maybe just one feature will do the job for the small case that just came through your door. But maybe a combination of features is required for the next big case. For example, how does email threading cut down noise in a TAR project? How might foreign language identification make your team’s approach to clustering results of international data more efficient? The possibilities are endless.

With a little education and expert guidance, you can apply analytics tools creatively, without limiting your team to a single, go-to approach that may not be up to snuff for solving your most complex e-discovery challenges.

You can even apply analytics beyond the scope of initial review. We at Altep will be releasing a series of posts on the Altep blog discussing how to use analytics in e-discovery in new and exciting ways, leading up to a thought-provoking webinar discussing its use in the presentation phase of the EDRM. There has been little focus on leveraging analytics during this phase, though it can benefit teams faced with organizing sets of produced data while courtroom deadlines loom.

By creatively leveraging analytics during deposition and trial, you can considerably cut time and costs during this phase, as well as locate key information more quickly to increase your chances for success. The techniques at your disposal are certainly a departure from traditional strategies, but they are tested and proven solutions that work.

Sara Skeens is a consultant for advanced review and analytics with Altep’s litigation consulting group. She has over 10 years of experience providing solutions and workflow guidance to case teams and enterprise clients in the areas of preservation, review, analysis, production, and presentation. She is a Relativity Certified Expert and has held positions in law firms, government, and providers working in both criminal and civil litigation, as well as investigations.

Joshua Tolles is a senior consultant for advanced review and analytics with Altep’s litigation consulting group. In this role, he provides process, solutions, and workflow guidance to case teams and enterprise clients in the areas of preservation, collections, processing, review, analysis, and production. Joshua is a licensed attorney in Washington State and the District of Columbia, and a Relativity Certified Expert.

Also available via the kCura blog: 3 Steps to More Creative e-Discovery Analytics

Less Sales, More Guidance in Predictive Coding and Analytics

A sales person can tell you all about how great the view is from up on the mountain, but only a guide who knows the way can get you to the top.

Guide Us To The Top

Like most attorneys interested in the field of eDiscovery, a few years back I became quite curious about a series of new litigation support product offerings most commonly referred to as predictive coding or technology assisted review.  It was hard not to get excited about a “new” technology that was reportedly offering so much promise in an area of the law that so desperately needed it.  Being the type of person who is not very likely to just take someone’s word for it (i.e., I learn most of my lessons the “hard way”), I set out to understand the technology behind these new offerings, including understanding some of the not-so-basic mathematical concepts that support the various predictive coding engines.

Over the years I have jotted down many of the predictive coding nuggets I have gathered, including some of my thoughts and opinions on offered information that I felt was not entirely accurate.  Some of those thoughts and opinions may not be well received by some product developers or providers, especially when it comes to how I feel the role of metadata in the predictive coding process is properly described.  I write this post to share those nuggets and opinions knowing well that I have plenty to learn about the process and the technology, and hoping that other eDiscovery nerds out there will be willing to step in and offer thoughtful and explanatory corrections where my opinions have gone astray.

Before I begin sharing, I think it is important to frame what my thought process was when jotting down most of these points.  As everyone now knows, the hype surrounding predictive coding’s arrival in the eDiscovery industry was vastly overblown.   This became quite clear to me a couple years ago when a vendor proposed the use of a home-grown Latent Symantec Indexing (“LSI”) based analytics tool to assist our team of attorneys with the identification and review of the most relevant documents in a corpus of about 350,000 that the vendor had been hosting for us for a couple months.  They made their case and, like many other predictive coding pitches, it sounded too good to be true.  In this case, it was.

I had become very familiar with the characteristics of this particular corpus, and it was a bit of an odd set.  Many of the most important documents in the case went back as far as the ‘50s and ‘60s.  The OCR text for those documents was mostly poor, and occasionally fair.  Beyond understanding the corpus, I had done a small amount of research on LSI and analytics on Content Analyst’s website, by reviewing some of their older white papers.  My understanding of that research told me that LSI was unreliable in cases where text of a given document is degraded beyond 35 percent (i.e., 35 percent or more of the characters cannot be identified reliably through OCR).

I came into that pitch meeting armed with these two very important pieces of information, and thankfully so.  I am fairly sure that we would have wasted some valuable time and money experimenting with the offered tool had I not been able to articulate my concerns regarding text degradation and the known limitations of the technology behind the vendor’s analytic tool.  It was on that day that I decided I would do my best to understand this “new” technology, and do my best to break it down into the simplest terms so that I could help others understand it as well.  I was motivated primarily by the unfortunate idea that most attorneys at that time would have entered that meeting entirely unprepared to escape the allure of an offering so full of promise but, unbeknownst to the vendor sales team, entirely ineffective under the circumstances.

Despite all the promise that predictive coding brings to eDiscovery, most lawyers are still reluctant to embrace it.  I feel very strongly that this is because no one takes the time to explain how it really works in terms that attorneys can understand.  I am not saying that all attorneys need to be able to translate calculus algorithms into Matlab code.  What I am saying is that attorneys would more readily embrace the technology if vendors spent more time explaining how the most critical components of the tool work.  Instead, I have listened to many presentations where a sales team glosses over how their base LSI, PLSI or SVM algorithms work to classify the documents, in favor of spending most of the meeting focusing on the less essential bells and whistles of their tool.

I see two possible reasons for this.  The first is that the sales people making the pitch don’t have a thorough understanding of the core technology behind their product.  Instead, they have a script to follow, and struggle whenever they have to deviate.  The second is that they understand it, but don’t bother to explain for fear that the attorneys in the audience would become lost and disinterested.  These are both unfortunate situations that I fear happen all too often.  We should all know by now that attorneys are not very good at adopting anything that they don’t understand, and rightfully so.  They are the ones who are going to have to defend the use of it when things go sideways.

I feel that what is most needed is an honest accounting of the strengths and weaknesses of the core technology, in practical terms that can allay the logical (or in some cases illogical) fears of the attorneys who need to use these tools.  The various tools are all too often pitched as a one-size-fits-all solution by sales people who really don’t understand the practical applications and limitations of the science as applied to document review.  This was understandably initially as everyone was in a rush to get to the market with their offering.  What I can’t understand is why this problem persists after a few years of trying to sell it.

One of my goals for the past few years has been to remedy this problem as much as personally possible.  I have tried to do this by going outside of my comfort zone to investigate the technology behind the tools, and then combine that knowledge with my practical eDiscovery experience in an effort to identify functional strengths and weaknesses.  With that frame of mind and goal, I submit the following points. I feel that these points are some of the best practical guidance I have received over the past few years, with my occasional thoughts and opinions mixed in.  Please have a look and, especially if you disagree, share your thoughts and some educational nuggets with the rest of us.

Predictive Coding Technology and Training the Machine

  • The most fundamental action of the algorithms behind the various predictive coding tools is to analyze only the text on the face of a given document when attempting to determine relationships between that document, seed set documents, and the other non-categorized documents in the review corpus.
  • Dates and other similar data consisting of numbers on the face of a document cannot be considered by the machine when attempting to determine relevance.
  • Non-text of document metadata can be used to filter and cull document sets, but it is not part of the core algorithmic analysis of any given document
  • I have heard many claims that metadata analysis is part of predictive coding “technology”, but I have yet to hear anything to convince me that its role is anything more than that of a filter applied as part of the predictive coding “process.”
  • I voiced this opinion when speaking with an “expert” last year and he agreed that, while metadata filtering can certainly be incorporated into the overall process, anyone suggesting that metadata is part of the algorithmic analysis of a given document misunderstands how the core algorithmic analysis occurs.
  • I feel very strongly that the role of non-text-of-document metadata needs to be crystal clear so attorneys understand what factors should be considered when training the machine for relevance and non-relevance.  Without a crystal clear understanding of which data fields the machine is studying to determine conceptual relationships, one cannot possibly hope to effectively and consistently train the machine.
  • The text of the documents to be analyzed via predictive coding must be indexed prior to analysis.
  • Only documents that serve as “good” examples (i.e., lots of text) should be used to train the machine during the predictive coding process.
  • Relevance of a document has nothing to do with whether it is a “good” example for training the machine.  Whether a document is a “good” example is determined by the richness of text in that document.  It helps to have “good” examples of both relevant and non-relevant documents for training.
  • When assessing whether a document is a “good” training example, you should always compare what you see in the “viewer” of a given tool with what you see in the extracted text view.  What you see in a viewer may actually be an image of a document, and thus no text exists for analysis unless that image text is captured via OCR.
  • Numbers are not indexed for the purposes of analyzing and determining conceptual relationships in predictive coding.
  • Headers and footers (i.e., boilerplate at the bottom of an e-mail) are also generally not indexed for analysis.
  • Spreadsheets and other similar documents need to be considered very carefully during the training process.  Spreadsheets that contain mostly numbers are ”bad” training examples.  Spreadsheets with lots of text may possibly be “good” examples, but keep in mind that the formatting of a document is not considered when assessing contextual relationships between words within a document via predictive coding.  This can lead to some “confusion” for the machine where large, text-heavy spreadsheets are broken up into cells or tabs of varied subject matter.
  • You’ll need to do some digging early on to decide if it is possible to make a categorical decision about whether spreadsheets can be effectively reviewed by the machine, or whether you will need to review them manually.
  • When using an analytics tool like Relativity’s Excerpt Text, you should limit it to instances where you have twenty or more words that can be used to train during the machine learning process.  Selection of key words or phrases will not provide enough text for tools like Relativity’s Excerpt Text feature to work effectively.
  • A combination of random sampling and keyword searching should be used to gather seed set documents to train the machine.  Training on documents gathered through random sampling will help to avoid the “more like this” effect where you are getting only a few document types back.  Supplementing random sampling with documents gathered through keyword search will likely speed up process, and will likely reduce the number of iterations required.
  • Culling data sets that will be analyzed with predictive coding tools through the use of non-text-of-document metadata or keyword filtering is not something to be done lightly.  My understanding is that culling swaths of non-relevant documents and training on only relevant or interesting documents will limit the type and subject matter of relevant documents returned, and will lead to an incomplete understanding of the case and available evidence.
  • That is not to say that custodial collections should never be trimmed with metadata filtering.  Assuming a corpus of sufficient size will remain; narrowing a collection based upon overly safe date ranges is not likely to have a negative impact on training since predictive coding algorithms do not understand dates for training purposes.  This type of filtering may also lessen the chance of a trainer mistakenly labeling documents as not relevant based on things like document date, rather than correctly assessing relevance based on the facial text of a document.
  • It is critical that the subject matter experts of a case, who are responsible for the training the various predictive coding systems, do not confuse responsiveness and relevance during training.
  • Consideration of whether a document is “responsive” during a particular review will often not be limited to just the facial text of that document.  Responsiveness determinations will also usually be subject to several secondary considerations, including document date.
  • Treating a content relevant document as irrelevant because it is outside the responsiveness date range will do significant harm.  It will confuse machine by contradicting similar content you have previously, and rightfully, instructed the machine to understand as relevant.

What (I Think) I Know About De-duplication – Corrections Welcomed



This is one of the most important tools in the ESI processing arsenal.  There are essentially three methods used to effectively de-duplicate document collections.  They each present their own set of strengths and weaknesses.


Hash Values Generally

  • Hash values are unique alphanumeric codes generated by a number of available algorithmic tools.
  • The hash value for a particular document, or collection of documents, is based on the layout of the bit content of that document or collection, including the Application Metadata contained within a given document or collection of documents.
    • Hash values can be generated for a single sentence of text, a paragraph, a document in its entirety, a family of e-mails and attachments, or for all the data stored on a single hard drive or server.
    • If even one bit is changed within the specified data, an entirely new value would be assigned by the same algorithmic tool.
      • Bit flipping is when a single bit changes from 0 to 1, or vice versa, without having been manipulated to do so.  Even this seemingly small change will dramatically affect the previously assigned hash value.
      • The two most common algorithmic tools used to generate hash values are MD5 and SHA-1.
        • While these algorithms may now have been “broken,” they have been broken only in terms of creating a collision with two different documents.
        • No one has been able to reverse engineer the content of a document from a known hash value at this point.
        • Different tools will generate different hash values for the same exact document or collection of documents.
          • Using the same tool on the same document should always produce identical hash values, regardless of the party running hashing the document.  The use of “identical” here includes all application metadata, not just what is seen on the face of the document.


Custodial (Vertical) De-duplication vs. Global (Horizontal) De-duplication

  • Global de-duplication will retain the first copy of all loose electronic documents, duplicate e-mails or document families encountered during the process.
  • Subsequent copies of those documents or families with matching hash values will be removed from the database or will not be added to the review database during processing, depending upon when de-duplication occurs.
    • It would seem the more cost effective approach must be to de-duplicate prior to some of the other processes and loading the data into a review database.
    • Global de-duplication is not the common practice at this point.
    • Custodial de-duplication will likely result in there being multiple copies of the same document across the various custodial collection in the database.


Whole Document Hash Value De-duplication (WDHVD – exact duplicates)

  • The constant and almost entirely unique nature of hash values for unaltered documents allows for the use of those hash values to locate and remove exact duplicate documents from a review database.
  • For this method of de-duplication all documents in a collection are run through the selected algorithmic tool and an index of hash values is created.
  • WDHVD will typically be able to identify many exact duplicates of loose electronic documents in a collection based upon duplicate hash values.
    • Even though these documents are technically “exact duplicates,” additional steps should be taken to ensure that critical data is not missed during review.
      • Example – documents with different file names can still be identical duplicates for the purposes of obtaining a hash value.
      • The file name is system metadata that is not part of the standard hash value encryption / analysis.
      • This could be a cause for concern where a document with a damning file name is excluded as a duplicate based on the hash value of a document encountered earlier in the de-duplication process, but later somehow discovered by opposing counsel.
  • Additional metadata fields should be included in the hash value encryption / analysis to ensure that such items are not cast out of the collection as duplicates.
  • Additional metadata fields should also be in included in the WDHVD analysis of document families (e-mail families, zip files, etc.).
    • Family member should never be eliminated as duplicates based upon standalone document hash values
    • A family member is only truly considered a duplicate if the exact family exists elsewhere in the collection.
    • Generating hash values for families of documents likely decreases the chances of successfully detecting exact duplicate families.
    • Analyzing bundled document likely equals more variables where even the slightest difference (even just an extra space in one document) will result in a dramatically different hash value.
    • Review of these duplicate families is much better than some of the imaged alternatives though.
    • I suspect there are few true duplicate families when de-duping vertically,, as opposed to horizontally.
    • Strengths – very reliable and exact form of de-duplication.
    • Weaknesses – lacks flexibility in its rigid determination of exact bit by bit duplicates.

Document Segment Hash Value De-duplication (DSHVD – near duplicates)

  • This method functions much like the WDHVD approach, but with a bit more flexibility.
  • I suppose the proper definition of this process would be any attempt to de-duplicate a collection of documents where you select some subset of the application metadata that is less than the full available set.
    • An example of this would be to exclude the Sent On Time metadata for a set of e-mail because Sent On times can vary by a few tenths or hundredths of a second.
    • In the strictest sense, even those small differences mean the documents are not duplicates.
    • Selecting all other commonly used e-mail metadata fields (i.e., text of document, e-mail subject, sent on date, etc.) for hash analysis will provide reasonable certainty that the documents are in fact duplicates.
    • If you are reduced to examining only the text of the document then I believe you have crossed the line in the text comparison de-duplication category.
    • One of the major weaknesses of WDHVD is its rigidity in determining duplicates.
    • DSHVD is a bit more flexible in that it analyzes only specified segments of the document structure and creates a hash value based on the bits of those collective segment.


Text Comparison De-duplication (near duplicates)

  • This approach involves analysis of the text of the document field(s) to determine the level of similarity between documents.
  • Two approaches are common: (1) comparison of “chunks” of text and (2) counting shingles.
    • For the first approach you can divide the number of similar words found in both documents by the total number of words found in the longer document to provide you with a value of similarity between 0 and 1.
      • An example would be to compare two document.  The first longer has 99 words.  The shorter has 54 words.  The common chunk of text they share is 23 words.
      • [23 / 99 = 0.23]
      • The weakness here is that the similarity value does not depend on the length of the shorter document.
      • If multiple documents have the same common text, the similarity value will be the same when comparing the longest document to either of the two shorter ones.
      • This fails to account for one of the shorter documents containing more unique (not seen in the longest document) text than the other.
  • The second approach is much like the first, but employs a slightly different equation to determine similarity.
    • The equation divides the total number of words the documents have in common (23) by the total number of word found in both document (153) minus the words in common. [23 / ((99 + 54) – 23) = 0.18]
    • This a less optimistic approach than the option outlined above.
  • A third approach is counting “shingles” of similar text that appear within two documents (http://en.wikipedia.org/wiki/MinHash)
    • This approach entails setting a (usually small) minimum requirement for the number of consecutive words two document must have in common to have a matching “shingle” (e.g., five consecutive matching words).
    • The similarity of two given documents is determined by dividing the number of shingles the two documents have in common by the total number of unique shingles found across both documents (http://en.wikipedia.org/wiki/W-shingling#Resemblance)
    • Short document generally give very low similarity scores when counting shingles.
    • The MinHash algorithm discards duplicate shingles encountered within a document.
    • Determining similarity with the MinHash algorithm is closer to an information matching approach, than a literal matching approach.
    •  MinHash reduces overall workload by selecting a sub group of shingles for hashing and comparison.  This comparison provides more of an approximation of similarity, than other methods fully analyzing all common shingles.
  • Consider Literal and Information matching approaches for all of these options.
    • Literal matching requires one to one matching of similar text blocks / shingles as a condition of determining similarity.
    • Information matching allows for one vs. all matching of duplicate text blocks or shingles.  A document with only one block of repeating text will be 100% similar to another document as long as the full matching block appears within the longer document, even if it appears only once in the longer document.
    • Information matching seems like a more sensible approach.  In the case smaller shingles, it allows for two overlapping shingles to be counted as 100% duplicates, while literal matching will only find one matching shingle because the words remaining after the first match don’t meet the minimum shingles size (e.g., 5 word minimum shingle size, 9 words in common).