Image by statistics authority of UK

The quote “Data is everywhere” can’t be more true.We tend to see only what is presented to us, but we often overlook what information might be hidden inside each data.This article provides information about how we can find links and correlation between various features of a given data. Sometimes very important information which is vital for our analysis is hidden as a combination of individual features. Let’s jump in to the task at hand and find out the power of statistics and python.

Explore the Data

To illustrate my point, I have used the advertising data from the textbook “An Introduction to Statistical Learning with Applications in R” (Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani). The data can be found using the author’s website or the link: http://faculty.marshall.usc.edu/gareth-james/ISL/data.html

url = 'http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv'
ds = pd.read_csv(url,index_col=0)

Now to visualize the relation between the different features of our data set we can use pair plot() which gives us the scatter plot between two features and the histogram.

sns.pairplot(ds);

Pair wise plot for different features

Isn’t python amazing? Seaborn is an amazing data visualization library and you can use its various functions to utilize it to its full content.

Understand the data:

The pair wise plot above can be overwhelming if you are new to this but it is just a plot of two features taken one at time, hence the name pair plot. In our data the plot towards the bottom right is a scatter plot of sales against the budget of advertising for television. Now to quickly get a glimpse of how our data is related,you can use the following commands:

g = sns.pairplot(ds, kind="reg")

This will plot a linear regression line fitting all the pair wise plots.

Linear Regression fit to data

Another great function of seaborn library is that it tells us about the correlation of different data with each other, using just a line:

corr = ds.corr()
sns.heatmap(corr)

Heat map showing correlation

The heat map gives us the extent to which these features are correlated with each other and especially focusing on the last row which is the relation between the sales and the advertising budget.It is quite clear that the sales of TV is positively correlated to the budget of advertising while the correlation between radio and sales is decreasing and it decreases even further for the the sales of newspaper and the budget of advertising.

#data-science #machine-learning #data-visualization #data-analytics #data #data analysis

Data Beyond Perception
3.20 GEEK