Statistics and Probability form the core of Machine Learning and Data Science. It is the statistical analysis coupled with computing power and optimization that Machine Learning is capable of achieving what it’s achieving today.
Statistics and Probability form the core of Machine Learning and Data Science. It is the statistical analysis coupled with computing power and optimization that Machine Learning is capable of achieving what it’s achieving today. From the basics of probability to descriptive and inferential statistics, these topics make the base of Machine Learning.
By the end of this tutorial, you will know the following:
Independent and Dependent events
Let’s consider 2 events, event A and event B. When the probability of occurrence of event A doesn’t depend on the occurrence of event B, then A and B are independent events. For eg., if you have 2 fair coins, then the probability of getting heads on both the coins will be 0.5 for both. Hence the events are independent.
Now consider a box containing 5 balls — 2 black and 3 red. The probability of drawing a black ball first will be 2/5. Now the probability of drawing a black ball again from the remaining 4 balls will be 1/4. In this case, the two events are dependent as the probability of drawing a black ball for the second time depends on what ball was drawn on the first go.
It’s the probability of an event irrespective of the outcomes of other random variables, e.g. P(A) or P(B).
It’s the probability of two different events occurring at the same time, i.e., two (or more) simultaneous events, e.g. P(A and B) or P(A, B).
It’s the probability of one (or more) events, given the occurrence of another event or in other words, it is the probability of an event A occurring when a secondary event B is true. e.g. P(A given B) or P(A | B).
Probability Distributions depict the distribution of data points in a sample space. It helps us see the probability of sampling certain data points when sampled at random from the population. For example, if a population consists of marks of students of a school, then the probability distribution will have Marks on the X-axis and the number of students with those marks on the Y-axis. This is also called a Histogram. The histogram is a type of Discrete Probability Distribution. The main types of Discrete Distribution are Binomial Distribution, Poisson Distribution and Uniform Distribution.
On the other hand, a Continuous Probability Distribution is made for data that has continuous value. In other words, when it can have an infinite set of values like height, speed, temperature, etc. Continuous Probability Distributions have tremendous use in Data Science and statistical analysis for checking feature importance, data distributions, statistical tests, etc.
The most well-known continuous distribution is Normal Distribution, which is also known as the Gaussian distribution or the “Bell Curve.”
Consider a normal distribution of heights of people. Most of the heights are clustered in the middle part which is taller and gradually reduces towards left and right extremes which denote a lower probability of getting that value randomly.
This curve is centred at its mean and can be tall and slim or it can be short and spread out. A slim one denotes that there is less number of distinct values that we can sample. And a more spread out curve shows that there is a larger range of values. This spread is defined by its Standard Deviation.
Greater the Standard Deviation, more spread will be your data. Standard Deviation is just a mathematical derivation of another property called the Variance, which defines how much the data ‘varies’. And variance is what data is all about, Variance is information. No Variance, no information. The Normal Distribution has a crucial role in stats – The Central Limit Theorem.
Measures of Central Tendency are the ways by which we can summarize a dataset by taking a single value. There are 3 Measures of Tendency mainly:
1. Mean: The mean is just the arithmetic mean or the average of the values in the data/feature. Sum of all values divided by the number of values gives us the mean. Mean is usually the most common way to measure the centre of any data, but can be misleading in some cases. For example, when there are a lot of outliers, the mean will start to shift towards the outliers and be a bad measure of the centre of your data.
2. Median: Median is the data point that lies exactly in the centre when the data is sorted in increasing or decreasing order. When the number of data points is odd, then the median is easily picked as the centre most point. When the number of data points is even, then the median is calculated as the mean of the 2 centre most data points.
3. Mode: Mode is the data point that is most frequently present in a dataset. The mode remains most robust to outliers as it will still remain fixed at the most frequent point.
We are a Machine Learning Services provider offering custom AI solutions, Machine Learning as a service & deep learning solutions. Hire Machine Learning experts & build AI Chatbots, Neural networks, etc. 16+ yrs & 2500+ clients.
We supply you with world class machine learning experts / ML Developers with years of domain experience who can add more value to your business.
Check out the 5 latest technologies of machine learning trends to boost business growth in 2021 by considering the best version of digital development tools. It is the right time to accelerate user experience by bringing advancement in their lifestyle.
You got intrigued by the machine learning world and wanted to get started as soon as possible, read all the articles, watched all the videos, but still isn’t sure about where to start, welcome to the club.
One of the best machine learning companies in India, we provide customized machine learning services. 15+Yrs Exp, 500+ Staff, 13800+ Projects, 6800+ Clients.