This article looks at a few common misconceptions and mistakes related to predictive coding and confidence intervals.

**Confidence intervals vs. training set size**: You can estimate the percentage of documents in a population having some property (e.g., is the document responsive, or does it contain the word “pizza”) by taking a random sample of the documents and measuring the percentage having that property. The confidence interval tells you how much uncertainty there is due to your measurement being made on a sample instead of the full population. If you sample 400 documents, the 95% confidence interval is +/- 5%, meaning that 95% of the time the range from -5% to +5% around your estimate will contain the actual value for the full population. For example, if you sample 400 documents and find that 64 are relevant (16%), there is a 95% chance that the range 11% to 21% will…

View original post 1,626 more words