Image for post

Photo by Taras Chernus on Unsplash

If you think that you saw something interesting, you need evidence to prove that it isn’t a one-off occurrence.

I have a lucky hand, and recently, I have been winning a lot. Every number I bet turns out to be the winning number on a roulette wheel. Have I discovered something interesting? What are the chances that I just got randomly lucky with all my recent wins, while assuming that I am as much a loser as I was before?

In statistics, the p-value is the answer to that last question. Ok, now that hurt! How do I prove that my winning intuition is more than just pure luck? Of course, I begin experimenting and gamble more. The more I bet, and the more I win, I build evidence in my favor. The lower the “p-value”, the stronger the evidence.

Can I actually manipulate the p-value to make myself look good in front of my friends, without falsifying the results? What would you think if you found out that all my winning bets were placed at the same casino? Do you have any information about my bets at other casinos? Have I won just as much money there?


Unfortunately, the p-value can be easily manipulated. Further, it happens a lot in real life, in both scientific studies and business research. For example, p-hacking allegations have led to further questioning of research studies after they were published, as their results were not reproducible. With more such high profile examples happening, p-hacking even found its way into pop culture.

“Is science bullshit? No, but there is a lot of bullshit currently masquerading as science” — John Oliver

Any which way you manipulate p-values, it leads to bad decisions with expensive results, not to forget the loss of the researcher’s credibility. If we are testing a hypothesis, a p-value above a certain statistical significance level indicates that there is not enough evidence to reject the counterargument to the hypothesis. That doesn’t mean that our own hypothesis is wrong. However, this does create a huge grey area in interpreting p-values, that is widely manipulated in both research and business.

How do we dodge the p-value bullet?

We need to understand that as data scientists and researchers, our job is to ask questions and make recommendations, and not go out on a limb to justify that a given hypothesis is correct. Before you read any further though, I highly recommend going through the basics about testing and experimentation.

Not having enough evidence to validate a hypothesis does not mean that the hypothesis is wrong. However, this fact has been twisted a lot in order to publish misleading results from research studies in both academia and business.

Before drawing conclusions from our own research studies, or from some other body of work, we should always ask ourselves these questions.

Are we accepting a result only because it aligns with our own biases?

One of the common reasons why p-hacking occurs, whether intentional or not, is the exploitation of researcher degrees of freedom. This term refers to the amount of flexibility researchers employ when designing statistical experiments so that the results they want to publish turn out to be statistically significant. Usually, these design decisions come down to factors such as the variables included in the experiment, deciding when to stop collecting data, data sampling methods, and so on.

As data scientists, our job is to objectively validate our hypotheses, and not defend them.

Imagine that we own a chain of supermarket stores, and notice that in one of our stores, beer and diapers are always bought together. Does this mean that the correlation between beer and diaper sales is real? Is the correlation statistically significant? What if we manipulated the experiment to force a statistically significant result, whether intentional or not. For example,

  • We selected only those transactions that contained purchases of baby products.
  • We ran our study on only a handful of stores in a neighborhood that randomly happens to have a lot of young couples with new babies.

Instead, as good data scientists, we actually want to take further steps to ensure that we are running a reliable experiment. There are more variables at play than just sales of beer and diaper. For example,

  • We should run the experiment in other markets and verify if the results are reproducible.
  • When selecting data for the experiment, go for random sampling. That is the gold standard.

Drawing conclusions based on spurious results that we think are reliable can prove to be extremely expensive. We surely don’t want those dollars down in the drain.

At what significance threshold are we willing to accept the results?

In other words, what are the odds that we are willing to take to Vegas? Our hypothesis is that beer and diapers go hand in hand, the counter argument is that they do not. This counter argument is the null hypothesis. What would it take to reject the null hypothesis? The p-value is meant to be that mentor who is honest and blunt, and without prejudice. How do we interpret its advice to make a rational decision? We make a plan. We draw a line in the sand beyond which if the p-value crosses, we fail to reject the null hypothesis. This line in the sand is called this the significance level, or alpha.

A conservative significance threshold doesn’t necessarily make a test reliable. It only indicates the level of risk we are willing to accept before making decisions.

So, does a lower, more conservative significance threshold mean that our test has solid results? No. It’s just an indicator of the amount of risk we are willing to take. What it is not an indicator of is the quality of our research. The significance level can be more conservative or relaxed depending on the domain.

The threshold to reject the null hypothesis can change from domain to domain, and can be either conservative or relaxed depending on the merit of the situation. That does not change the quality of the research, or the reliability of the observed p-value itself.

For example, the significance level can be more relaxed for a marketing study than for a clinical trial of a new drug. However, tightening or relaxing this threshold will not change the merit of the study and the reliability of the p-value statistic itself. Our judgement should be based on the quality of the research, and changing the significance level to make the outcome look good won’t change how good it actually is.


Just because something looks good, doesn’t mean that it is good, or necessarily correct.

Of course, there are a lot of things to consider when setting up an experiment. We need to get a few basics right, like figuring out what metric we want to measure, what is the size of the test, or how long we need to run the experiment. Dr. Mircea Davidescu explains some of these in detail in this brilliant article. However, once we figure out the basics, our job as data scientists is to be objective with the results and not manipulate them.

After all, we don’t owe the results of any experiment to anyone. We observe what we see, and recommend ways to move forward.

#p-value #experimentation #p-hacking #data-science #a-b-testing

Protecting Yourself From P-value Manipulation
1.15 GEEK