Thursday, September 28, 2023

# Sample Size (Contd.)

This is part of a 3-part series on the topic. Please read the posts in the order for maximum clarity and context: In this blog, we’ll use actual numbers to determine quantifiable sample sizes. And finally how to write a program in Python to do the sample size calculations for any need (at different confidence levels, and varying margins of error). Let’s get started with real examples.

## Scenario 1:

We want to find out how many people in our city would favor a sugar tax. We want error margin to be no more than +/-2% and we want to be 95% confident about the result. In other words, how many people do we need to survey keeping within margin of error at our desired confidence level? We don’t need to know the true population (P) at all!

We don’t know the P, so we’ll work with sample proportion p (aka p hat) instead.

We’ll apply the following formula to find the sample population n.

n = z2 * p (1-p) / M2

Where n = sample size we want to find, z = z score (confidence level will be used here), p = sample proportion, M = Margin of error.

Since, we have no specific data or reason for bias (yet) for population proportion p, we’ll keep it 50-50. In other words, the population will be assumed to pick either-or equally. So, p = 0.5

z is dependent on confidence level. And Confidence level is the likelihood that a value will fall within a given range. Since we want to be 95% confident here, we can map 95% confidence level to find the Z score using a table (you can search for the table on the internet or books easily, but I’ll revisit that when I implement a Python program to do this in a following blog).

At confidence level 95%, z = 1.96 (this is known as the “critical value”). As stated, we’ll set margin of error at 5% or M = 0.05

So, we can find n = (1.96)2 * 0.5(1-0.5) / 0.052 = 384.16

Or, n= 384 (always rounded)

So, the smallest sample to achieve the 95% confidence at 5% error margin, we need to sample 384 (people or tests, etc.).

## Call-Outs:

At 2% margin of error, the sample size would be much higher: 3.8416 * 0.25 /0.0004 = 2,401

At 1% margin of error, and 99.9%, we will need 27,060 sample size.

At 0% margin of error, the sample size would be UNDEFINED: 3.8416 * 0.25 /0 = Cannot div by zero! Meaning, we’ll have to survey each and every one in the true population.

Another common way to represent the formula: p (hat) = z * SQRT(p(1-p)/n)

## But Wait!

We did not use the actual population size at all to determine the sample size! Does the actual population matter?

The population size matters, little, in most cases. Meaning, if the sample size is 50% of the large population (remember, we used p=0.5 or 50%), then it matters so little that it can be ignored. However, if the sample size is a much smaller percentage of the actual population, then it could matter. We could adjust or “correct” it as follows using a correction factor:

If N is the actual population size and n is the sample size, then the correction factor is: SQRT((N-n)/(N-1))

(the value of which would then be multiplied to our sample size n to find corrected n for finite population)

Still using p=0.5, for 300,000,000 people (say, a national survey), with 95% confidence and 5% error margin, the sample size n=384 (as calculated above), with the correction factor: SQRT((300,000,000 – 384)/ (300,000,000 – 1)) = SQRT(299999616/299999999) = SQRT(0.9999987233333291) = 0.99999…

Or, n*0.99999… = 384*0.99999… = 384 still!

In essence, it would make essentially no difference if you took 300 people from a country or 300,000,000 for this statistical theory. This seems counter-intuitive, but is “correct”.

## Scenario 2:

Let’s say, based on some previous surveys/experiments, 15% of the population were left-handed (e.g. there was a survey of 500 people, and 75 were left-handed), we want to apply 5% error margin with 95% confidence to find the minimal sample size. In this case, we would use p=0.15 (instead of perfectly balanced 0.5 since we have reason to support a different population proportion). Apply all the calculations, we’d get n= 195.84 or, 196.

Guess what happens if you just surveyed 2 people and published your findings but didn’t tell anyone the sample size? 🙂 What confidence should we have and what margin of error would be exposed?

The truth is: with 99% confidence level, your error margin is 99% when you surveyed just 2 people 🙂

Next Step:

In my next blog, I’ll demonstrate a simple Python program to do the calculation easily, and using any parameter values we want (confidence level, and error margin). Note that, Excel has built-in functions as well to do something similar and more (look up STANDARDIZE() function).