Detecting Hidden Data In Office Documents — Soapbox (Part 1 of 5)

This is the first part of a five part entry regarding practical tips for identify hidden or difficult to detect data in Microsoft Office documents.  Part one is where I spend a bit of time on my soap box talking about why this issue is important to me.  Parts two through five will discuss some quick practical tips for identifying and reviewing difficult data that I have picked up over the years while doing and managing the work.  I am certainly not the first to write a post on this topic (see, Accessing hidden metadata in Office documents for eDiscovery and  There’s Hidden, and Very Hidden Data in Excel), and I certainly don’t claim to bring any groundbreaking information to the game.  I just felt like writing about something, and this topic seemed like a good place to start.  I hope you enjoy these posts and find them to be of some use.

As a former manager of document review projects, there was one thing that kept me awake at night more than any other.  It wasn’t the sometimes impossible client or partner driven deadlines.  It wasn’t the difficulty of managing groups of attorneys, most of whom no longer had the will or morale to be concerned with improving their eDiscovery skills or knowledge after years of soulless and thankless work.  That particular problem was certainly the second most concerning issue I encountered, and it led to me becoming very passionate about how people should be managed, developed, and mentored.  But that is a topic for another time and another post in the near future.

The concern that kept me awake more than any other was the fear that someone on my team would miss something, a critical fact for the case or some indicator of privilege, because they didn’t know how to spot the signs of hidden content, and did not know how to root it out when they did.  As anyone who has been in the industry well knows, it is a delicate balance of speed and precision, and you sometimes need to sacrifice one at the expense of the other depending upon the circumstances of a particular assignment.  Speed is always appreciated and reviewers not keeping pace will be dropped.  A blatant lack of precision in favor of speed usually becomes obvious after a short time and those reviewers will also be dropped.  These are the nuts and bolts of grading document reviewers, and it is easy to get lost in the all too unreliable metrics.  But the metrics are a somewhat superficial evaluation of talent, and are more or less establishing acceptable minimums rather than evaluating top end performers.  The numbers make it easy to put people in boxes with regard to their performance, and that makes us feel like we are managing.  There is a need to get beyond those numbers to see who really “gets it” and to leverage that understanding to inform the decisions of the rest of the team.  It is no easy task, however, when most cases are working under very difficult timelines, and with a temporary project goal as the focus.

As I alluded to above, the world of document review is also a very transient one.  Project assignments are often short-term, and upon short notice.  I worked in the industry for about nine years, and I worked with at least a few hundred attorneys on various projects during that time.  I know that might not sound like much, but I was on only a couple of temporary assignments before having the good fortune to land a permanent position with a great firm for seven of those nine years.  Even with a permanent position, there were always many attorneys coming and going.  The constant exposure to all of these different project groups really opened my eyes as to the unbelievably wide-ranging computer literacy and Microsoft Office skills that the average contract attorney in the industry may have.  While the quick and necessary answer to the problem of a lack of computer literacy was often to try to teach the individual enough to get them functioning and ready for review, I became increasing obsessed with the contradiction that you can have these same individuals effectively reviewing documents that all to often require some specialized knowledge just to make sure they can see what the document is really saying.  Whether idealistic or not, I became fixated during the last few years of my term in that role upon the idea that these attorneys must be taught the necessary skill to effectively review documents.

Now, I am not saying that every document has some kind of secret hidden code embedded where only a select few know how to find it.  What I am saying that that on any given day, on any given project, a reviewer is likely to encounter Word, PowerPoint, or Excel documents where some significant amount of data is hidden or obscured from view.  The intent in “hiding” this data is not always something sinister.  In fact, it is usually quite the opposite.  Data within a spreadsheet is typically hidden from view with the very simple intention of presenting a cleaner and more attractive document that is not full of distractions.  I am also not here to entertain the arguments that data processing tools make it so that reviewers don’t need to worry much about this issue since almost all content can be “un-hidden” during processing.  As someone who has been there and done that, I can say with certainty that this assertion is false.  Some of the work can be done during processing, but a reviewer still must be able to at least spot signs of potentially concerning hidden data.

It was with these thoughts rattling around my head that I set out to develop a presentation a couple of years ago to teach our team members about the dangers of hidden data in Office documents, and how to mitigate the associated risks through effective identification and review of that data.  I considered that initial presentation to be a success, even though it definitely bored some in the audience to sleep over the course of those 50 minutes.  However, that presentation, in both format and substance, was not really something that could be shared publicly for a variety of reasons.  So, I figured it is about time that I take what I consider to the be most important aspects of the presentation and distill them down into something fit for public consumption.  That is my goal for parts two through five of this entry.  If I have not bored to you to sleep yet with my rambling, please do stay tuned and I will try to get that entry in order.  This really is a critical eDiscovery topic in my mind, and I think it is issues like this one that are at the heart of many of the opinions authored in the recent years expressing concern over attorney eDiscovery competency.  Some small part of me wants to think that this should be a non-issue these days, but just like the inability to effectively image Excel documents, I am afraid it lives on.  Thanks again for reading thus far.

Stay tuned…

Visualizing Data in a Predictive Coding Project

e-Discovery Team ®

This blog will share a new way to visualize data in a predictive coding project. I only include a brief description this week. Next week I will add a full description of this project. Advanced students should be able to predict the full text from the images alone. Study the text and try to figure out the details of what is going on.

Soon all good predictive coding software will include visualizations like this to help searchers to understand the data. The images can be automatically created by computer to accurately visualize exactly how the data is being analyzed and ranked. Experienced searchers can use this kind of visual information to better understand what they should do next to efficiently meet their search and review goals.

For a game try to figure out how the high and low number of relevant documents that you must find in this review project to…

View original post 560 more words

Use Excel to Count the Number of Emails in Each Email Chain

Excel Esquire

Courts and litigants have long struggled with the question of how to describe email chains on a privilege log.  Should you log only the most recent email, or log every email in the chain–or something in between?  New York has recently adopted a potentially burdensome rule on this topic–one that cries out for an Excel solution.

Effective September 2, 2014, Commercial Division Rule 11-b imposes new obligations on litigants in New York Supreme who create document-by-document privilege logs, as opposed to the now-preferred “categorical privilege logs.” See here to read the rule.  Among other things, entries for email chains should now indicate “the number of e-mails within the dialogue.”  Rule 11-b (b)(3)(iii).  That means you can log only the most recent email in a given chain, but you need to also disclose how many emails are in the chain.

How, exactly, does one figure out the number of emails in every email…

View original post 751 more words

Preserving Gmail for Dummies

Ball in your Court

gmail_GoogleI posted here a year ago laying out a detailed methodology for collection and preservation of the contents of a Gmail account in the static form of a standard Outlook PST.  Try as I might to make it foolproof, downloading Gmail using IMAP and Outlook is tricky.  Happily since my post, the geniuses at Google introduced a truly simple, no-cost way to collect Gmail and other Google content for preservation and portability.  It sets a top flight example for online service providers, and presages how we may use the speed, power and flexibility of Google search as a culling mechanism before exporting for e-discovery.

View original post 630 more words

Guest Blog: Talking Turkey

e-Discovery Team ®

Maura-and-Gordon_Aug2014EDITORS NOTE: This is a guest blog by Gordon V. Cormack, Professor, University of Waterloo, and Maura R. Grossman, Of Counsel, Wachtell, Lipton, Rosen & Katz. The views expressed herein are solely those of the authors and should not be attributed to Maura Grossman’s law firm or its clients. 

This guest blog constitutes the first public response by Professor Cormack and Maura Grossman, J.D., Ph.D., to articles published by one vendor, and others, that criticize their work. In the Editor’s opinion the criticisms are replete with misinformation and thus unfair. For background on the Cormack Grossman study in question, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014, and the Editor’s views on this important research seeLatest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One and Part Two and Part Three. After remaining…

View original post 4,281 more words

Latest Grossman and Cormack Study Proves Efficacy of Multimodal Search for Predictive Coding Training Documents and the Folly of Random Search – Part Two

e-Discovery Team ®

This is a continuation of my earlier blog with the same title: Latest Grossman and Cormack Study Proves Efficacy of Multimodal Search for Predictive Coding Training Documents and the Folly of Random Search – Part One. 

Latest Grossman Cormack Study

grossman_cormack_filteredThe information scientist behind this study is Gordon V. Cormack, Professor, University of Waterloo. He has a long history as a search expert outside of legal search, including special expertise in spam searches. The lawyer who worked with Gordon on this study is Maura R. Grossman, Of Counsel, Wachtell, Lipton, Rosen & Katz. In addition to her J.D., she has a PhD in psychology, and has been a tireless advocate for effective legal search for many years. Their work is well known to everyone in the field.

The primary purpose of their latest study was not to test the effectiveness of training based on random samples. That was a…

View original post 2,865 more words

Predictive Coding Confusion

Clustify Blog - eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

This article looks at a few common misconceptions and mistakes related to predictive coding and confidence intervals.

Confidence intervals vs. training set size:  You can estimate the percentage of documents in a population having some property (e.g., is the document responsive, or does it contain the word “pizza”) by taking a random sample of the documents and measuring the percentage having that property.  The confidence interval tells you how much uncertainty there is due to your measurement being made on a sample instead of the full population.  If you sample 400 documents, the 95% confidence interval is +/- 5%, meaning that 95% of the time the range from -5% to +5% around your estimate will contain the actual value for the full population.  For example, if you sample 400 documents and find that 64 are relevant (16%), there is a 95% chance that the range 11% to 21% will…

View original post 1,626 more words

Latest Grossman and Cormack Study Proves Efficacy of Multimodal Search for Predictive Coding Training Documents and the Folly of Random Search – Part One

e-Discovery Team ®

Nasreddin_(17th-century_miniature)There is a well-known joke found in most cultures of the world about a fool looking for something. This anecdote has been told for thousands of years because it illustrates a basic trait of human psychology, now commonly called after the joke itself, the  Streetlight Effect. This is a type of observational bias where people only look for whatever they are searching by looking where it is easiest. This human frailty, when pointed out in the right way, can be funny. One of the oldest known forms of pedagogic humor illustrating the Streetlight effect comes from the famous stories of Nasrudin, aka, Nasreddin, an archetypal wise fool from 13th Century Sufi traditions. Here is one version of this joke attributed to Nasreddin:

One late evening Nasreddin found himself walking home. It was only a very short way and upon arrival he can be seen to be upset about something. Alas, just then a…

View original post 2,133 more words

Fair Comparison of Predictive Coding Performance

Clustify Blog - eDiscovery, Document Clustering, Technology-Assisted Review (Predictive Coding), Information Retrieval, and Software Development

Understandably, vendors of predictive coding software want to show off numbers indicating that their software works well.  It is important for users of such software to avoid drawing wrong conclusions from performance numbers.

Consider the two precision-recall curves below (if you need to brush up on the meaning of precision and recall, see my earlier article):precision_recall_for_diff_tasks

The one on the left is incredibly good, with 97% precision at 90% recall.  The one on the right is not nearly as impressive, with 17% precision at 70% recall, though you could still find 70% of the relevant documents with no additional training by reviewing only the highest-rated 4.7% of the document population (excluding the documents reviewed for training and testing).

Why are the two curves so different?  They come from the same algorithm applied to the same document population with the same features (words) analyzed and the exact same random sample…

View original post 636 more words