Ace Statistics Step by Step for Data Science

Why do we need to learn statistics for machine learning?Statistics help us analyze the data and draw inferences from it, which in turn helps us understand the data. For example, with the help of statistics, we can understand whether our data is skewed or normally distributed or if the data contains outliers. It helps us to detect the mean/median/mode of our data and allows us to see the range within which most data points lie. So, in short, it helps in the EDA part of machine learning which requires lots of data cleaning and also helps in feature engineering.Statistics can be divided into two parts:a) Descriptive Statistics: This allows us to analyze and summarize the data with the help of different plots/graphs and tables.Graphs:· Box plot· HistogramTabular representation:· Central Tendency (informs about mean/ median/ mode)· Standard Deviation· Variance· Range of datab) **Inferential Statistics: **inferential statistics help us to infer a conclusion from the sample data about the population after performing descriptive statistical analysis on the sample data.It helps us identify if the sample correctly represents the whole population or not and how confident we are to claim so, with the help of **confidence interval.**Also, it is beneficial in choosing among multiple samples from the same population as to which one of them is more accurately describing the population.We have multiple hypothesis testing method which helps us to draw such kind of conclusions about a population from sample data and those are:· Null and Alternate hypothesis.· Z-test· T-test· Chi-square test· ANOVA and ANCOVA test

🎯 What is PopulationPopulation: Population represents a large volume of entity data points which we intend to analyze.Ex: If we want to find out the average height of all the people of a country, then the height of all the people in the country represents a population.

🎯 What is SampleSample: It is a small collection of data points that are picked up from population data. A good sample can be a close representation of the population. A sample always contains fewer data points than that of a population.Ex: Suppose I have chosen 1000 people from a country and analyze their average height and then decide about the average height of all the people in the country.

🎯 Why is Sampling Required:The population contains a huge volume of data, and it is practically impossible to collect that amount of data. Also, even if it is possible, it will be time-consuming. Sampling makes the work easier, and it is less time-consuming and practically possible as, in sampling, we don’t choose the whole population. Rather we pick a decent number of elements from the population, which can potentially summarize the population.Note: Sample should be a close representation of the population.

🎯 How does sampling affect the analysis if not properly done or the right amount of elements are not chosen from the Population?As we saw, we cannot analyze the whole country’s data, so we chose a small group of people within the country, which can more or less represent the country’s overall population. But we need to be sure that the sample we have chosen is not biased and correctly representing the population; otherwise, the sample will produce an incorrect result. Sample size (number of data points within the sample) also plays a vital role in the overall sampling performance.

#data-analytics #data-science-interview #statistics #data-science

medium.com

Ace Statistics Step by Step for Data Science