1594320480

**Definition Outlier:**

The outlier is an observation that so much deviates or far away from the

other observation. Outlier detection is important in data analysis. The purpose of the study is to investigate the outlier from the small samples or non-normally data set and it is problematic about their characteristic. So we convert the data into normal by deleting outlier.

**Statistical Techniques and tools**

**1.1** *Grubb’s Test*

**1.2** *Inter-Quartile Range(IQR)*

**1.3** *Dixon’s Test*

**1.4** *Boxplot*

*1.1 Grubb’s Test :*

Grubbs (1969) detects a single outlier in a univariate data set. It is a dataset that follows an approximately normal distribution and the sample size is less than 30. Grubb’s test is defined by the following two hypotheses.

_Ho _: There is no outlier in the data set.

_H1 _: There is one outlier in the data set.

There are several statistic for the Grubbs test considering an ordered data sample test if the minimum or maximum values are outliers.

Where is the element of the data set, ** X** and

The calculated value of parameter **G** is compared with the critical value for Grubb’s test. When the calculated value is higher or lower than the critical value of choosing statistical significance, then the calculated value can be accepted as an outlier.

*Criteria:*

*1.2 Inter-Quartile Range(IQR)*

This is the quantile method used to detect outliers from the univariate data sets. There is no need to use the quantile method in statistical tables. The following steps are used in this method.

i) First, we find ***Q1***and***Q3***we find and i.e first and third quantile.

ii) Then find a difference of them i.e *H = Q3-Q1*

*Criteria:*

A value lower than ***Q1–1.5H***and higher than ** Q1+1.5H** is considered to be a mild outlier. A value lower than

*1.3 Dixon’s Test :*

This test developed by “W.Dixon 9 0 and used to the test is appropriate for a small sample size. The test has some limitation to *n≤ 30*.

The Dixon defined the test statistic to detect outlier is

*1.4 Boxplot :*

Boxplot is a graphical tool to detect outliers. In boxplot, we give the different

arguments that are given to detect outliers. It produces box and plot the given data observation. In boxplot observation are off the box they are as an outlier.

#machine-learning #clinical-trials #statistics #data-science #data-analysis #data analysis

1595080800

In my previous article, we see the outlier detection technique univariate approach, let’s look further

**Statistical Techniques and tools**

**2.1** Standardized Residuals

**2.2** Studentized Residuals

**2.3** COOK’S Distance

**2.4** Leverage

**2.5** DFBETAS

**2.6 **DFFITS

*2.1 Standardized Residuals*

Since the approximate average variance of a residual is estimated by MSRes, a

logical scaling for the residuals would be the standardized residuals. The standardized residuals have mean zero and approximately unit variance.

**Criteria:**

A large standardized residual (di > 3) potentially indicates an outlier.

*2.2 Studentized Residuals*

A studentized residual (sometimes referred to as an “externally **studentized**

**residual**” or a “**deleted t residual**”) is:

**Criteria :**

Studentized residuals are going to be more effective for detecting outlying

observations than standardized residuals. If an observation has a studentized residual that is larger than 3 (in absolute value) we can call it an outlier.

*2.3 COOK’S Distance*

Its formula is given as,

**Criteria :**

We usually consider points for which Di>1 We can call i th observation is an outlier.

#r #outliers #clinical-trials #machine-learning #data-science

1594320480

**Definition Outlier:**

The outlier is an observation that so much deviates or far away from the

other observation. Outlier detection is important in data analysis. The purpose of the study is to investigate the outlier from the small samples or non-normally data set and it is problematic about their characteristic. So we convert the data into normal by deleting outlier.

**Statistical Techniques and tools**

**1.1** *Grubb’s Test*

**1.2** *Inter-Quartile Range(IQR)*

**1.3** *Dixon’s Test*

**1.4** *Boxplot*

*1.1 Grubb’s Test :*

Grubbs (1969) detects a single outlier in a univariate data set. It is a dataset that follows an approximately normal distribution and the sample size is less than 30. Grubb’s test is defined by the following two hypotheses.

_Ho _: There is no outlier in the data set.

_H1 _: There is one outlier in the data set.

There are several statistic for the Grubbs test considering an ordered data sample test if the minimum or maximum values are outliers.

Where is the element of the data set, ** X** and

The calculated value of parameter **G** is compared with the critical value for Grubb’s test. When the calculated value is higher or lower than the critical value of choosing statistical significance, then the calculated value can be accepted as an outlier.

*Criteria:*

*1.2 Inter-Quartile Range(IQR)*

This is the quantile method used to detect outliers from the univariate data sets. There is no need to use the quantile method in statistical tables. The following steps are used in this method.

i) First, we find ***Q1***and***Q3***we find and i.e first and third quantile.

ii) Then find a difference of them i.e *H = Q3-Q1*

*Criteria:*

A value lower than ***Q1–1.5H***and higher than ** Q1+1.5H** is considered to be a mild outlier. A value lower than

*1.3 Dixon’s Test :*

This test developed by “W.Dixon 9 0 and used to the test is appropriate for a small sample size. The test has some limitation to *n≤ 30*.

The Dixon defined the test statistic to detect outlier is

*1.4 Boxplot :*

Boxplot is a graphical tool to detect outliers. In boxplot, we give the different

arguments that are given to detect outliers. It produces box and plot the given data observation. In boxplot observation are off the box they are as an outlier.

#machine-learning #clinical-trials #statistics #data-science #data-analysis #data analysis

1621628640

Python has a set of magic methods that can be used to enrich data classes; they are special in the way they are invoked. These methods are also called “dunder methods” because they start and end with double underscores. Dunder methods allow developers to emulate built-in methods, and it’s also how operator overloading is implemented in Python. For example, when we add two integers together, `4 + 2`

, and when we add two strings together, `“machine” + “learning”`

, the behaviour is different. The strings get concatenated while the integers are actually added together.

If you have ever created a class of your own, you already know one of the dunder methods, `__init__()`

. Although it’s often referred to as the constructor, it’s not the real constructor; the `__new__()`

method is the constructor. The superclass’s ` __new__()`

, `super().__new__(cls[, ...])`

, method is invoked, which creates an instance of the class, which is then passed to the `__init__()`

along with other arguments. Why go through the ordeal of creating the `__new__()`

method? You don’t need to; the `__new__()`

method was created mainly to facilitate the creation of subclasses of immutable types (such as int, str, list) and metaclasses.

#developers corner #uncategorized #dunder methods #magic methods #operator overriding #python dunder methods #python magic methods

1595096220

Hypothesis testing is a procedure where researchers make a precise statement based on their findings or data. Then, they collect evidence to falsify that precise statement or claim. This precise statement or claim is called the null hypothesis. If the evidence is strong to falsify the null hypothesis, we can reject the null hypothesis and adapt the alternative hypothesis. This is the basic idea of hypothesis testing.

There are two distinct types of errors that can occur in formal hypothesis testing. They are:

Type I: Type I error occurs when the null hypothesis is true but the hypothesis testing results show the evidence to reject it. This is called a false positive.

Type II: Type II error occurs when the null hypothesis is not true but it is not rejected in hypothesis testing.

Most hypothesis testing procedure performs well controlling type I error (at 5%) in ideal conditions. That may give a false idea that there is only a 5% probability that the reported findings are wrong. But it’s not that simple. The probability can be much higher than 5%.

The normality of the data is an issue that can break down a statistical test. If the dataset is small, the normality of the data is very important for some statistical processes such as confidence interval or p-test. But if the data is large enough, normality does not have a significant impact.

If the variables in the dataset are correlated with each other, that may result in poor statistical inference. Look at this picture below:

In this graph, two variables seem to have a strong correlation. Or, if a series of data is observed as a sequence, that means values are correlated with its neighbors, and there may have some clustering or autocorrelation in the data. This kind of behavior in the dataset can adversely impact the statistical tests.

This is especially important when interpreting the result of a statistical test. “Correlation does not mean causation”. Here is an example. Suppose, you have study data that shows, more people who do not have college education believe that women should get paid less than men in the workplace. You may have conducted a good hypothesis testing and prove that. But care must be taken on what conclusion is drawn from this. Probably, there is a correlation between college education and the belief that ‘women should get paid less’. But it is not fair to say that not having a college degree is the cause of such belief. This is a correlation but not a direct cause ad effect relationship.

A more clear example can be provided from medical data. Studies showed that people with fewer cavities are less likely to get heart disease. You may have enough data to statistically prove that but you actually cannot say that the dental cavity causes heart disease. There is no medical theory like that.

#statistical-analysis #statistics #statistical-inference #math #data analysis

1598622960

If you can’t explain it to a six year old, you don’t understand it yourself.

The world still awaits the next Einstein, or perhaps it won’t ever get another *person of the century*. But I like to believe: in the universe and in a variation of the above quote…

The best way to learn is to teach.

The idea for this post came when I was once helping one of my juniors with an assignment on **outlier detection**. It wasn’t a very complicated one, just an application of IQR Method of Outlier Detection on a dataset. The tutorial took an exciting turn when he asked me:

*“Why 1.5 times IQR? Why not 1 or 2 or any other number?”*

Now this question won’t ring any bells for those who are not familier with IQR Method of Outlier Detection (explained below), but for those who know how simple this method is, I hope the question above would make you think about it. After all, isn’t that what good data scientists do? *Question everything, believe nothing.*

In the most general sense, an *outlier* is a data point which differs significantly from other observations. Now its meaning can be interpreted according to the statistical model under study, but for the sake of simplicity and not to divert too far from the main agenda of this post, we’d consider first order statistics and too on a very simple dataset, without any loss of generality.

To explain IQR Method easily, let’s start with a box plot.

A box plot from source

A box plot tells us, more or less, about the distribution of the data. It gives a sense of how much the data is actually spread about, what’s its range, and about its skewness. As you might have noticed in the figure, that a box plot enables us to draw inference from it for an ordered data, i.e., it tells us about the various metrics of a data arranged in ascending order.

In the above figure,

- _minimum _is the minimum value in the dataset,
- and
*maximum*is the maximum value in the dataset.

So the difference between the two tells us about the range of dataset.

- The _median _is the median (or centre point), also called second quartile, of the data (resulting from the fact that the data is ordered).
- _Q1 _is the first quartile of the data, i.e., to say 25% of the data lies between _minimum _and
*Q1*. - _Q3 _is the third quartile of the data, i.e., to say 75% of the data lies between _minimum _and
*Q3*.

The difference between *Q3* and *Q1* is called the **Inter-Quartile Range** or **IQR**.

```
IQR = Q3 - Q1
```

To detect the outliers using this method, we define a new range, let’s call it decision range, and any data point lying outside this range is considered as outlier and is accordingly dealt with. The range is as given below:

```
Lower Bound: (Q1 - 1.5 * IQR)
Upper Bound: (Q3 + 1.5 * IQR)
```

Any data point less than the _Lower Bound or _more than the _Upper Bound _is considered as an outlier.

But the question was: *Why only 1.5 times the IQR? Why not any other number?*

#outlier-detection #data-science #data-preprocessing #statistics #data-preparation