This is Part 3 of a multi-part blog series that discusses some practical applications of sampling in electronic discovery, step by step.

- View Practical Applications of Random Sampling, Part 1 – Introduction
- View Practical Applications of Random Sampling, Part 2 – Estimating Prevalence
- Download the related white paper “Practical Applications of Sampling in eDiscovery”

As discussed previously, the size of the sample you should take is dictated by the strength of the measurement you want to achieve, the size of your dataset and the expected prevalence of relevant material within the dataset.

**DESIRED MEASUREMENT STRENGTH**

The strength of the measurement is expressed through two values: confidence level and confidence interval.

- Confidence Level is expressed as a percentage, and is a measure of how certain you are about the results you get. Or, said another way: if you took the same size sample the same way 100 times, how many times out of 100 would you get the same results? Typically, you will be seeking a confidence level of 90%, 95% or 99%.

- Confidence Interval is also expressed as a percentage, and is a measure of how precise your results are. Or, said differently, how much uncertainty there is in your results. Typically you will be seeking a confidence interval between +/-2% (which is a total range of 4%) and +/-5% (which is a total range of 10%).
- The term confidence interval is sometimes used interchangeably with the term “margin of error.” The margin of error, however, is stated as one half of the confidence interval, just as a radius is one half of a diameter. For example, a margin of error of 2% refers to a confidence interval of +/- 2% (a 4% range).

For example, you might choose to take a measurement with a confidence level of 95% and a confidence interval of +/-2% to estimate prevalence. That measurement strength has been referenced in a variety of cases and articles as a potentially acceptable standard. If review of your sample revealed a prevalence of 50%, you would know that if you repeated the test another 100 times, 95 of those tests would also have results that fall between 48% and 52% prevalence.

Strength of measurement affects sample sizes in two ways.

- First, the higher the confidence level you desire, the larger the sample you will need to take.
- Second, the lower the margin of error you desire, the larger the sample you will need to take. See Figure 1 below illustrating how sample sizes increase with confidence level and interval.

*FIGURE 1 – SAMPLE SIZE VARIABILITY WITH CONFIDENCE LEVEL AND INTERVAL*

**SIZE OF THE DATASET (AKA THE SAMPLING FRAME)**

Sample sizes also increase with the size of the sampling frame, but only up to a point. Beyond that point, the required sample size levels off. For example, the sample size needed for 100,000 documents is roughly the same as the sample size needed for 1,000,000 documents. Figure 2 illustrates how sample size increases with sampling frame size.

*FIGURE 2 – SAMPLE SIZE VARIABILITY WITH SAMPLE FRAME SIZE*

Understanding this can produce significant cost savings. A traditional 5% sample of 1,000,000 documents would be 50,000 documents, but a simple random sample of only about 2,400 documents actually is sufficient to estimate prevalence and accomplish other useful investigatory tasks.

**EXPECTED PREVALENCE**

Prevalence also affects the required sample size, however it will not yet be known when “prevalence” itself is what you are sampling to estimate. In that case, you should use the most conservative value – the one resulting in the largest sample size. Assuming a prevalence of 50%, *i.e.* that half of the sampling frame is relevant and half is not, requires the largest sample size. Sample size decreases as prevalence increases or decreases from 50%. See Figure 3 for a visualization of how sample size fluctuates with prevalence.

*FIGURE 3 – SAMPLE SIZE VARIABILITY WITH PREVALENCE*

When estimating prevalence, there is no “correct” strength of measurement to take. As noted above, several orders and articles have referenced a 95% confidence level and a +/- 2% confidence interval, but that is persuasive authority at best. You may not feel comfortable with anything less than 99% +/- 1%, or you may be fine at 90% +/- 5%. It depends on your specific circumstances.

In Part 4 of this series, we will look at how to calculate your sample size based on these variables and how to use that sample to estimate prevalence.

This is Part 3 of a multi-part blog series that discusses some practical applications of sampling in electronic discovery, step by step.