Science as a pursuit has always had Reproducibility at its core. After all, if a claim is made about the physical world, and the evidence does not support such a claim, it doesn’t matter how much ideology or vested interest the idea has pushing it, there’s no reason for you to believe it. In a seemingly post truth world that we live in, where politicians, the media, and voices on social media propagate information that is often varying shades of dishonest, it pays dividends for your integrity to make reproducible claims. It’s part and parcel to your job as a data scientist.

I think Reproducibility in data science is less well understood than Reproducibility in more established fields of science. For example, a study can clarify one or two simple claims that have to do with testing the mean difference between two or more groups. Examples include…

  1. Does treatment A make a statistically significant difference over placebo treatment B?
  2. Do groups exposed to differing lengths of stimuli exhibit varying outcomes?
  3. What is the effect size of a treatment?

Since there is generally a publication bias towards statistically significant results, some research does not get published if its goal is to repeat what other studies have done. However when they are performed, if they do not come to the same conclusion under similar inputs, then it casts doubt on the original claims. The research has not been reproduced.

In the field of structural engineering (my first career), we used a form of Reproducibility to validate designs performed by other people. Often an engineer would be tasked with designing a bridge, which is an awfully complex hunk of concrete and steel. In case you’ve never been outside, here is a picture of one.

Image for post

#programming #data-science #storytelling #reproducibility #big data

Reproducibility in Data Science
1.15 GEEK