India is a nation whose main livelihood is based on agriculture and we live peacefully in our own diversity.

** P.C.Mahalanobis** was a pioneer in using statistics and data to build the foundation stone to take decisions in India even for the policymakers of the government.

His works of statistics in agriculture and to understand the diversity of India using data made statistics just for the nation.

P.C. Mahalanobis’ birthday, 29 June, is celebrated as the National Statistics Day every year since 2007 in India.

*PS: This is not to be confused with World Statistics Day on 20th October.*

What is Statistics?

How was the journey of Statistics in India?

Why Statistics?

To celebrate the National Statistics Day in my own small way, I will share P.C.Mahalanobis’ views on these aforesaid questions in the articles one by one.

In this small article, I will share Mahalanobis’s view of Statistics.

In my day-to-day data work, I routinely find myself running a lot of `for`

loops. These can take minutes to complete, which isn’t necessarily a long time, but looping is embarrassingly parallelizable. We can do better.

In this article, I will discuss how to make more efficient use of your time when working in Python. Whether you work on a laptop or a high-performance computer (HPC), you can speed up your workflow by taking full advantage of all the computing power available to you. This can be achieved with the `Dask`

and `Dask-jobqueue`

libraries. This post will discuss how to create and use a `dask`

cluster on your local computer and an HPC.

`Dask`

is a Python library for parallel computing and `dask-jobqueue`

lets you interact with job schedulers, such as Slurm, from a Jupyter Notebook. `Dask`

makes simple things are easier and complex things are possible and its`numpy `

and `pandas`

-like API makes writing code familiar to Pythonic data practitioners.

- Installation
- Setup Dask cluster on a laptop
- Setup Dask cluster on an HPC
- Submit work to the cluster
- Dask LabExtension
- Final thoughts

Hypothesis testing is a procedure where researchers make a precise statement based on their findings or data. Then, they collect evidence to falsify that precise statement or claim. This precise statement or claim is called the null hypothesis. If the evidence is strong to falsify the null hypothesis, we can reject the null hypothesis and adapt the alternative hypothesis. This is the basic idea of hypothesis testing.

There are two distinct types of errors that can occur in formal hypothesis testing. They are:

Type I: Type I error occurs when the null hypothesis is true but the hypothesis testing results show the evidence to reject it. This is called a false positive.

Type II: Type II error occurs when the null hypothesis is not true but it is not rejected in hypothesis testing.

Most hypothesis testing procedure performs well controlling type I error (at 5%) in ideal conditions. That may give a false idea that there is only a 5% probability that the reported findings are wrong. But it’s not that simple. The probability can be much higher than 5%.

The normality of the data is an issue that can break down a statistical test. If the dataset is small, the normality of the data is very important for some statistical processes such as confidence interval or p-test. But if the data is large enough, normality does not have a significant impact.

If the variables in the dataset are correlated with each other, that may result in poor statistical inference. Look at this picture below:

In this graph, two variables seem to have a strong correlation. Or, if a series of data is observed as a sequence, that means values are correlated with its neighbors, and there may have some clustering or autocorrelation in the data. This kind of behavior in the dataset can adversely impact the statistical tests.

This is especially important when interpreting the result of a statistical test. “Correlation does not mean causation”. Here is an example. Suppose, you have study data that shows, more people who do not have college education believe that women should get paid less than men in the workplace. You may have conducted a good hypothesis testing and prove that. But care must be taken on what conclusion is drawn from this. Probably, there is a correlation between college education and the belief that ‘women should get paid less’. But it is not fair to say that not having a college degree is the cause of such belief. This is a correlation but not a direct cause ad effect relationship.

A more clear example can be provided from medical data. Studies showed that people with fewer cavities are less likely to get heart disease. You may have enough data to statistically prove that but you actually cannot say that the dental cavity causes heart disease. There is no medical theory like that.

This video tutorial provides a basic introduction into statistics. It explains how to find the mean, median, mode, and range of a data set. It also explains how to find the interquartile range, quartiles, percentiles as well as any outliers. The full version of this video which can be found on my patreon page also mentions how to construct box and whisker plots, histograms, frequency tables, frequency distribution tables, dot plots, and stem and leaf plots. It also covers relative frequency and cumulative relative frequency as well as how to use it to determine the value that a corresponds to a certain percentile. Finally, this video also discusses skewness - it explains which distribution is symmetric and which is skewed to the right (positive skew) and which is skewed to the left (negative skew).