About Joshua

Attorney and eDiscovery Consultant

Reviewing Native Excel Files, Part 1: Detecting Inconsistent Formulas

Looking forward to seeing all of these posts…

Excel Esquire

This is the first in a series of posts about reviewing native Excel files produced by parties in litigation.  We’ve finally reached a tipping point in litigation where the production of native Excel files (rather than inscrutable thousand-page printouts) is the rule rather than the exception.  Discovery stipulations now routinely contain a provision that calls for Excel files to be produced natively (does yours?), and Magistrate Judge Facciola famously observed that tiffing out electronic documents such as spreadsheets is “madness” (Covad Commc’ns. Co. v. Revonet, Inc., 2009 WL 2595257 (D.D.C. Aug. 25, 2009)).  The question for practicing lawyers today is how to review those files, and how to exploit the wealth of information they often contain.

Today we look at Excel’s built-in feature that flags inconsistent formulas, and see how that feature can call attention to potentially critical information lurking beneath the surface.

Suppose your client is a plaintiff in…

View original post 352 more words

Can You Really Compete in TREC Retroactively?

Another great post from Bill Dimm.

Clustify Blog - eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

I recently encountered a marketing piece where a vendor claimed that their tests showed their predictive coding software demonstrated favorable performance compared to the software tested in the 2009 TREC Legal Track for Topic 207 (finding Enron emails about fantasy football).  I spent some time puzzling about how they could possibly have measured their performance when they didn’t actually participate in TREC 2009.

One might question how meaningful it is to compare to performance results from 2009 since the TREC participants have probably improved their software over the past six years.  Still, how could you do the comparison if you wanted to?  The stumbling block is that TREC did not produce a yes/no relevance determination for all of the Enron emails.  Rather, they did stratified sampling and estimated recall and prevalence for the participating teams by producing relevance determinations for just a few thousand emails.

Stratified sampling means that the…

View original post 1,680 more words

The Kind of Stuff I Think About Late At Night…

The most critical component of the predictive coding exercise is training of the system.  The whole point of this component is to separate relevant content from non-relevant content.  The point is most definitely not to separate the responsive documents from the non-responsive documents.  These are two very different standards.  Separating responsive documents from non-responsive documents usually requires not only identification of non-relevant content, but also dissecting relevant content to meet responsiveness requirements.  The latter is all too often where the training process goes wrong.

One of the more beneficial goals of using predictive coding software is the ability to accurately identify and eliminate non-relevant documents from the review universe.  With that in mind, system trainers need to remember that they should avoid dismissing relevant content because it does not meet responsiveness requirements.  I know that has been said thousands of times, but it needs to be said again and again.

Responsiveness in your case may hinge upon whether a particular widget that was manufactured in Seattle was red.  If a system trainer then dismisses an e-mail as non-relevant because it discusses blue widgets made in Seattle they are confusing the system and hindering the process.  To truly get the most out of the process you must include the blue Seattle widgets discussion as relevant, and likely also include discussion about widgets of other colors manufactured in other cities.  Discussion of the manufacture of widgets is relevant content.  Whether they were made in Seattle and if they were red will determine whether the document is responsive.

The Path to E-Mail Production IV, Revisited

Ball in your Court

path of email-4This is the ninth in a series revisiting Ball in Your Court columns and posts from the primordial past of e-discovery–updating and critiquing in places, and hopefully restarting a few conversations.  As always, your comments are gratefully solicited.

The Path to Production: Are We There Yet?

(Part IV of IV)

[Originally published in Law Technology News, January 2006]

The e-mail’s assembled and accessible.  You could begin review immediately, but unless your client has money to burn, there’s more to do before diving in: de-duplication. When Marge e-mails Homer, Bart and Lisa, Homer’s “Reply to All” goes in both Homer’s Sent Items and Inbox folders, and in Marge’s, Bart’s and Lisa’s Inboxes.  Reviewing Homer’s response five times is wasteful and sets the stage for conflicting relevance and privilege decisions.

Duplication problems compound when e-mail is restored from backup tape.  Each tape is a snapshot of e-mail at a moment…

View original post 776 more words

Detecting Hidden Data In Office Documents — Soapbox (Part 1 of 5)

This is the first part of a five part entry regarding practical tips for identify hidden or difficult to detect data in Microsoft Office documents.  Part one is where I spend a bit of time on my soap box talking about why this issue is important to me.  Parts two through five will discuss some quick practical tips for identifying and reviewing difficult data that I have picked up over the years while doing and managing the work.  I am certainly not the first to write a post on this topic (see, Accessing hidden metadata in Office documents for eDiscovery and  There’s Hidden, and Very Hidden Data in Excel), and I certainly don’t claim to bring any groundbreaking information to the game.  I just felt like writing about something, and this topic seemed like a good place to start.  I hope you enjoy these posts and find them to be of some use.

As a former manager of document review projects, there was one thing that kept me awake at night more than any other.  It wasn’t the sometimes impossible client or partner driven deadlines.  It wasn’t the difficulty of managing groups of attorneys, most of whom no longer had the will or morale to be concerned with improving their eDiscovery skills or knowledge after years of soulless and thankless work.  That particular problem was certainly the second most concerning issue I encountered, and it led to me becoming very passionate about how people should be managed, developed, and mentored.  But that is a topic for another time and another post in the near future.

The concern that kept me awake more than any other was the fear that someone on my team would miss something, a critical fact for the case or some indicator of privilege, because they didn’t know how to spot the signs of hidden content, and did not know how to root it out when they did.  As anyone who has been in the industry well knows, it is a delicate balance of speed and precision, and you sometimes need to sacrifice one at the expense of the other depending upon the circumstances of a particular assignment.  Speed is always appreciated and reviewers not keeping pace will be dropped.  A blatant lack of precision in favor of speed usually becomes obvious after a short time and those reviewers will also be dropped.  These are the nuts and bolts of grading document reviewers, and it is easy to get lost in the all too unreliable metrics.  But the metrics are a somewhat superficial evaluation of talent, and are more or less establishing acceptable minimums rather than evaluating top end performers.  The numbers make it easy to put people in boxes with regard to their performance, and that makes us feel like we are managing.  There is a need to get beyond those numbers to see who really “gets it” and to leverage that understanding to inform the decisions of the rest of the team.  It is no easy task, however, when most cases are working under very difficult timelines, and with a temporary project goal as the focus.

As I alluded to above, the world of document review is also a very transient one.  Project assignments are often short-term, and upon short notice.  I worked in the industry for about nine years, and I worked with at least a few hundred attorneys on various projects during that time.  I know that might not sound like much, but I was on only a couple of temporary assignments before having the good fortune to land a permanent position with a great firm for seven of those nine years.  Even with a permanent position, there were always many attorneys coming and going.  The constant exposure to all of these different project groups really opened my eyes as to the unbelievably wide-ranging computer literacy and Microsoft Office skills that the average contract attorney in the industry may have.  While the quick and necessary answer to the problem of a lack of computer literacy was often to try to teach the individual enough to get them functioning and ready for review, I became increasing obsessed with the contradiction that you can have these same individuals effectively reviewing documents that all to often require some specialized knowledge just to make sure they can see what the document is really saying.  Whether idealistic or not, I became fixated during the last few years of my term in that role upon the idea that these attorneys must be taught the necessary skill to effectively review documents.

Now, I am not saying that every document has some kind of secret hidden code embedded where only a select few know how to find it.  What I am saying that that on any given day, on any given project, a reviewer is likely to encounter Word, PowerPoint, or Excel documents where some significant amount of data is hidden or obscured from view.  The intent in “hiding” this data is not always something sinister.  In fact, it is usually quite the opposite.  Data within a spreadsheet is typically hidden from view with the very simple intention of presenting a cleaner and more attractive document that is not full of distractions.  I am also not here to entertain the arguments that data processing tools make it so that reviewers don’t need to worry much about this issue since almost all content can be “un-hidden” during processing.  As someone who has been there and done that, I can say with certainty that this assertion is false.  Some of the work can be done during processing, but a reviewer still must be able to at least spot signs of potentially concerning hidden data.

It was with these thoughts rattling around my head that I set out to develop a presentation a couple of years ago to teach our team members about the dangers of hidden data in Office documents, and how to mitigate the associated risks through effective identification and review of that data.  I considered that initial presentation to be a success, even though it definitely bored some in the audience to sleep over the course of those 50 minutes.  However, that presentation, in both format and substance, was not really something that could be shared publicly for a variety of reasons.  So, I figured it is about time that I take what I consider to the be most important aspects of the presentation and distill them down into something fit for public consumption.  That is my goal for parts two through five of this entry.  If I have not bored to you to sleep yet with my rambling, please do stay tuned and I will try to get that entry in order.  This really is a critical eDiscovery topic in my mind, and I think it is issues like this one that are at the heart of many of the opinions authored in the recent years expressing concern over attorney eDiscovery competency.  Some small part of me wants to think that this should be a non-issue these days, but just like the inability to effectively image Excel documents, I am afraid it lives on.  Thanks again for reading thus far.

Stay tuned…

Visualizing Data in a Predictive Coding Project

e-Discovery Team ®

This blog will share a new way to visualize data in a predictive coding project. I only include a brief description this week. Next week I will add a full description of this project. Advanced students should be able to predict the full text from the images alone. Study the text and try to figure out the details of what is going on.

Soon all good predictive coding software will include visualizations like this to help searchers to understand the data. The images can be automatically created by computer to accurately visualize exactly how the data is being analyzed and ranked. Experienced searchers can use this kind of visual information to better understand what they should do next to efficiently meet their search and review goals.

For a game try to figure out how the high and low number of relevant documents that you must find in this review project to…

View original post 560 more words

Use Excel to Count the Number of Emails in Each Email Chain

Excel Esquire

Courts and litigants have long struggled with the question of how to describe email chains on a privilege log.  Should you log only the most recent email, or log every email in the chain–or something in between?  New York has recently adopted a potentially burdensome rule on this topic–one that cries out for an Excel solution.

Effective September 2, 2014, Commercial Division Rule 11-b imposes new obligations on litigants in New York Supreme who create document-by-document privilege logs, as opposed to the now-preferred “categorical privilege logs.” See here to read the rule.  Among other things, entries for email chains should now indicate “the number of e-mails within the dialogue.”  Rule 11-b (b)(3)(iii).  That means you can log only the most recent email in a given chain, but you need to also disclose how many emails are in the chain.

How, exactly, does one figure out the number of emails in every email…

View original post 751 more words

Preserving Gmail for Dummies

Ball in your Court

gmail_GoogleI posted here a year ago laying out a detailed methodology for collection and preservation of the contents of a Gmail account in the static form of a standard Outlook PST.  Try as I might to make it foolproof, downloading Gmail using IMAP and Outlook is tricky.  Happily since my post, the geniuses at Google introduced a truly simple, no-cost way to collect Gmail and other Google content for preservation and portability.  It sets a top flight example for online service providers, and presages how we may use the speed, power and flexibility of Google search as a culling mechanism before exporting for e-discovery.

View original post 630 more words