Practical Guide into Data Analysis and Machine Learning using Python

A hands-on guide to mastering the first baby steps in building Machine Learning applications.

Image for post

Machine Learning is continuously evolving. Along with that evolution comes a spike in demand and importance. Corporations and startups are needing Data Scientists and Machine Learning Engineers now more than ever to turn those troves of data into useful wisdom. There’s probably not a better time than now (except 5 years ago) to delve into Machine Learning. And of course, there’s not a better tool to develop those applications than Python. Python has a vibrant and active community. Many of its developers came from the scientific community, thus providing Python with vast numbers of libraries for scientific computing.

In this article, we will discuss some of the features of Python’s key scientific libraries and also employ them in a proper Data Analysis and Machine Learning workflow.

What you will learn;

Understand what Pandas is and why it is very integral to your workflow.
How to use Pandas to inspect your dataset
How to prepare the data and feature-engineer with Pandas
Understand why Data Visualization matters.
How to visualize data with Matplotlib and Seaborn.
How to build a statistical model with Statsmodel.
How to build an ML model with Scikit-Learn’s algorithms.
How to rank your model’s feature importances and perform feature selection.

If you’d like to go straight to code, it is here on GitHub.

Disclaimer: This article assumes that

You have at least, a usable knowledge of Python and
You already are familiar with the Data Science/Machine Learning workflow.

Pandas

Pandas is an extraordinary tool for data analysis built to become the most powerful and flexible open-source data analysis/manipulation tool available in any language. Let’s take a look at what Pandas is capable of:

Data Acquisition

import numpy as np
import pandas as pd
from sklearn import datasets

iris_data = datasets.load_iris()
iris_data.keys()
## THIS IS THE OUTPUT FOR 'iris_data.keys()'
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
iris_data['target_names']
## THIS IS THE OUTPUT FOR 'iris_data['target_names']'
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
df_data = pd.DataFrame(iris_data['data'], columns=iris_data['feature_names'])
df_target = pd.DataFrame(iris_data['target'], columns=['species'])
df = pd.concat([df_data, df_target], axis=1)
df

png

In the above cells, you’d notice that I have imported the classic dataset, the Iris dataset, using Scikit-learn (we’ll explore Scikit-learn later on). I then passed the data into a Pandas DataFrame while also including the column headers. I also created another DataFrame to contain the iris species which were code-named 0 for setosa, 1 for versicolor, and 2 for virginica. The final step was to concatenate the two DataFrames into a single DataFrame.

When working with data that can fit on a single machine, Pandas is the ultimate tool. It’s more like Excel but on steroids. Just like Excel, the basic units of operations are rows and columns where columns of data are Series, and a collection of Series is the DataFrame.

#machine-learning #data-science #data-analysis #python

Pandas

Data Acquisition

levelup.gitconnected.com

Practical Guide into Data Analysis and Machine Learning using Python