A Beginners Guide to Logistic Regression in Python. Fundamentals of Logistic Regression, Confusion Metrics, AOC, and Solvers using scikit learn

## Introduction

In statistics, a logistic model is applied to predict a *binary dependent variable*. When we are working with a data set where we need to predict 1s and 0s we usually rely on logistic regression or other classification algorithms. Statistically, logistic regression is used to predict the probability of an event happening. Based on the probability values and a corresponding threshold we specify if the event is going to happen or not. E.g. The probability of raining today is 0.73 and considering that our threshold is 0.5 we can confirm that it is going to rain today, i.e. the output is 1 which translates to the fact that the event (in this case “it will rain today”) is TRUE.

*It is raining as I write this article down. What a coincidence!*

Now the question is what are these **probabilities ****and **threshold values? How do we model the data? Can we use a linear regression instead? In a nutshell, when the dependent variable is 1 or 0, _linear regression is not an option as linear regression follows the basic assumption of the dependent variable being continuous. _When we model data using linear regression, the dependent variable (Y) can take any range of values. It is challenging to scale the output of a dependent variable to 0 and 1 respectively when predicted using a linear model. That’s why for logistic regression we model the probability of an event Y given independent variables X1, X2, X3, and so on. **Confused?**

Let’s figure this out using the same example as above. What is the **probability ****that **it will rain (Y here is the probability of raining today) given it rained yesterday (X1), the temperature today is 20 degrees (X2), the month is October (X3), and the humidity is 20% (X4). Now if we build a regression model, say Y = a*X1 + b*X2 + c*X3 + d*X4 + constant, the value of Y can range between [A, B], where A and B can take any possible set of Values. However, we know that *a probability can lie between [0, 1]*. This demands us to model the data in a way that our output always lies between 0 and 1, hence we use a sigmoid function as shown in the figure below.

machine-learning
python
artificial-intelligence
programming
data-science