Predictive Coding Confusion

Clustify Blog - eDiscovery, Document Clustering, Predictive Coding, Information Retrieval, and Software Development

This article looks at a few common misconceptions and mistakes related to predictive coding and confidence intervals.

Confidence intervals vs. training set size:  You can estimate the percentage of documents in a population having some property (e.g., is the document responsive, or does it contain the word “pizza”) by taking a random sample of the documents and measuring the percentage having that property.  The confidence interval tells you how much uncertainty there is due to your measurement being made on a sample instead of the full population.  If you sample 400 documents, the 95% confidence interval is +/- 5%, meaning that 95% of the time the range from -5% to +5% around your estimate will contain the actual value for the full population.  For example, if you sample 400 documents and find that 64 are relevant (16%), there is a 95% chance that the range 11% to 21% will…

View original post 1,626 more words

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s