A friend of mine asked me an interesting question today. He said he has one million data points in his experiment yet the p value of a t-test on the average metric value on the two population was not smaller than 0.05, he was wondering if he could safely assume the two populations had the same average metric?

The truth is, the fact that the p value is not smaller than 0.05 can be merely a consequence of your experiment not having enough power. Meaning that there were not large enough samples to be able to reject the null hypothesis. To make this more clear we can picture an extreme case in which instead of one million users we have only 3 users. If p value is not smaller than 0.05 we certainly cannot claim the two population had the same mean value.

Statisticians tend to look at the statistical power. The statistical power is the probability of rejecting the null hypothesis when it is in fact false. Power = P(Reject H0| H0 is false). In other words the power for the test tells you how likely it is that your test is doing its job correctly.

Let’s demonstrate this with a quick R script. My friend told me that the difference in the two means is just 0.10% and since he had 1 million datapoints, he suggested that since the p value of the t-test was not significant then there is no real effect and any difference is due to just noise. Well, I constructed a little experiment, with one million data points, in which the true difference of means was .10%. I ran it 500 times and counted how many times I have correctly rejected the null hypothesis. Turns out that there is about 90% chance that the p value turns out to be non significant even though we already know that there truly is a difference in the mean.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# rm(list = ls()) n<- 10^6 delta <- 0.001 sd <- 1 sig.level <- 0.05 power <- power.t.test(n = n, delta = delta, sd = sd, sig.level = sig.level, type = c("two.sample"), alternative = c("two.sided"))$power p <- c() iPower <- c() for (i in 1:500){ a <- rnorm(n, 1+delta, sd) b <- rnorm(n, 1, sd) p <- c(p, t.test(a,b,paired = FALSE)$p.value<sig.level) iPower <- c(iPower, length(which(p[1:i]==TRUE))/length(p[1:i])) } library(ggplot2) data <- data.frame(x= 1:500, y=iPower) q <- ggplot(data, aes(x,y))+geom_point()+geom_abline(intercept = power, slope = 0, color = "red", linetype="dashed")+ylab("Percentage of cases H0 was rejected")+xlab("Simulations") plot(q) |

The command ` power.t.test()`

calculates the statistical power for my friend’s test and the power turns out to be slightly above 10%. We calculate the power with the quick simulation in the next lines and after 500 iterations the power value converges to about 0.118. The theoretical value turns out to be 0.105. If he had 10 million data points then his test power would increase from 10.5% to 61% and finally with about 100 million points the power is at about 99.99%. Estimating the number of data points that you need to get a statistically significant result (assuming a true effect exist) is usually referred to as power analysis and sample size determination.

So next time your friend tells you the p value was not significant you can just tell them to increase their sample size to have enough power. Any other conclusion might be reading too much into the data.