Michael Bryan

Michael Bryan

1565853237

How To Prepare Your Dataset For Machine Learning In Python

So in real life, we do not always have the correct data to work with. If the data is not processed correctly, then we need to prepare it and then start training our model. So in this post, we will see step by step to transform our initial data into Training and Test data. For this example, we use python libraries like **scikit learn, numpy, **and pandas.

Content Overview

  • 1 Prepare Dataset For Machine Learning in Python
  • 2 #Steps To Prepare The Data.
  • 3 #1: Get The Dataset.
  • 4 #2: Handle Missing Data.
  • 5 #3: Encode Categorical data.
  • 6 #4: Split the dataset into Training Set and Test Set.
  • 7 #5: Feature Scaling

Prepare Dataset For Machine Learning in Python

We use the Python programming language to create a perfect dataset. For preparing a dataset, we need to perform the following steps.

Steps To Prepare The Data.

  1. Get the dataset and import the libraries.
  2. Handle missing data.
  3. Encode categorical data.
  4. Splitting the dataset into the Training set and Test set.
  5. Feature Scaling, if all the columns are not scaled correctly.

So, we will be all the steps on the dataset one by one and prepare the final dataset on which we can apply regression and different algorithms.

1: Get The Dataset.

Okay, now we are going to use Indian Liver Patient’s data. So we first prepare the complete dataset for this kind of data. I am putting the link here to download the data. Remember, this is not a real dataset, this is just the demo dataset. It looks like the actual dataset. You can get the Real Dataset on this link.

Download File: patientData

Now, we need to create a project directory. So let us build using the following command.

mkdir predata

Now go into the directory.

cd predata

We need to move the CSV file inside this folder.

Now, open the Anaconda Navigator software. If you are new to Anaconda, then please check out this How To Get Started With Machine Learning In Python. After opening Navigator, you can see a screen like below.

Now, launch the **Spyder **application and navigate to your project folder. You can see, we have already moved the **patientData.csv **file so that you can see that file over there.

Okay, now we need to create one Python file called **datapre.py **and start importing the mathematical libraries.

Write the following code inside datapre.py file. So, your file looks like this. Remember, we are usingPython 3

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 25 18:52:15 2018

@author: your name
"""
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Now, select code of three import statements and hit the **command + enter **and you can see at the right side down, the code is running successfully.

That means, we have successfully imported the libraries. If you found any error then possibly the **numpy, pandas, or matplotlib **library is missing. So you need to install that, and that is it.

2: Handle Missing Data.

In real-time, missing the data happens quite a lot. If you are finding the real-time data set like for the patients, then there is always missing the data. To train the model correctly, we need to fill the data somehow. Otherwise, the model will mispredict the values. Luckily libraries are already available to do that; we need to use the proper function to do that. Now, in our dataset, there is missing data, so we need to fill the data with either mean values or to use some other algorithms. In this example, we are using MEAN to supply the values. So let us do that.

But first, let us divide the dataset into our X and Y axis.

Okay, now write the following code after the importing the libraries.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 25 18:52:15 2018

@author: krunal
"""
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv('patientData.csv')

Now, select the following line and hit the command + enter.

dataset = pd.read_csv('patientData.csv')

Okay, so we have included our initial dataset, and you can see here.

Here, you can see that if the value is empty, then nan is displaying. So we need to change it with theMEAN values. So let us do that.

Write the following code.

X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 3].values

So, here in the X, we have selected the first four columns and leave the fifth column. It will be our Y.

Remember, indexes are starting from 0. So -1 means last column. So we are selecting the all the columns except the last column.

For Y, we have explicitly selected the fourth column, and the index is 3.

Okay, now we need to handle the missing data. We will use a library Scikit learn.

Write the following code.

...

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

So, here we have to use Imputer module to use the strategy ‘mean’ and fill the missing values with the mean values. Run the above lines and type the X in the Console. You can see something like below. Here, column 1 and 2 have missing values, but we have written 1:3 because the upper bound is excluded that is why we have taken 1 and 3, and it is working fine. Finally, transform the whole column values which have NaN values, and now we have got the filled values.

Here, you can see that the mean values of that particular column fill the missing values.

So, we have handled the missing data. Now, head over to the next step.

3: Encode Categorical data.

In our dataset, there are two categorical columns.

  1. Gender
  2. Liver Disease

So, we need to encode this two columns of data.

# Encode Categorical Data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

Here, we have encoded the values of the first column. Now, here, we have only two cases for the first column, and that is **Female **and **Male. **Now, after transform, the values are 1 for Female, and 0for Male.

Run the above line and see the changes in categorical data. So, here for Female, it is **1 and Male is0. **It has created one more column and replaces Male and Female according to 1 and 0. That is why it becomes from 3 columns to 4 columns.

4: Split the dataset into Training Set and Test Set.

Now, generally, we split the data with the ratio of 70% for the Training Data and 30% to test data. For our example, we split into the 80% for training data and 20% for the test data.

Write the following code inside the Spyder.

# Split the data between the Training Data and Test Data

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2
                                                    ,random_state = 0)

Run the code, and you can get the four more variables. So, we have the total of seven variables.

So, here, we have split the both Axis X and Y into X_train and X_test

Y-axis becomes Y_train and Y_test.

So, you have 80% data on the Xtrain and Ytrain and 20% data on the X_test and Y_test.

5: Feature Scaling

In a general scenario, machine learning is based on Euclidean Distance. Here for the column Albuminand **Age **column has an entirely different range of values. So we need to convert those values and make it under the range of values. That is why this is called feature scaling. We need to scale the values for Agecolumn. So let us scale the X_train and X_test.

# Feature Scaling

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

Here, we do not need for Y because it is already in scaled. Now run the above code and hit the following command.

Here, we can see that all the values are appropriately scaled and also you can check the **X_test **variable as well.

So, we have successfully cleared and prepared the data.

Here, is the final code of our datapre.py.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 25 18:52:15 2018

@author: krunal
"""

# Importing Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing Dataset

dataset = pd.read_csv('patientData.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 3].values

# Handing Missing Dataset

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

# Encode Categorical Data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

# Split the data between the Training Data and Test Data

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2
                                                    ,random_state = 0)

# Feature Scaling

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

So, we have successfully Prepare Dataset For Machine Learning in Python.

Machine Learning has very complex computation. It totally depends on how you get the data and in which condition. Based on the condition of the data, you will start to preprocess the data and split the data into Train and Test model.

Finally, **Prepare Dataset For Machine Learning in Python **is over. Thanks for taking.

#python #machine-learning #data-science

What is GEEK

Buddha Community

How To Prepare Your Dataset For Machine Learning In Python
Ray  Patel

Ray Patel

1625843760

Python Packages in SQL Server – Get Started with SQL Server Machine Learning Services

Introduction

When installing Machine Learning Services in SQL Server by default few Python Packages are installed. In this article, we will have a look on how to get those installed python package information.

Python Packages

When we choose Python as Machine Learning Service during installation, the following packages are installed in SQL Server,

  • revoscalepy – This Microsoft Python package is used for remote compute contexts, streaming, parallel execution of rx functions for data import and transformation, modeling, visualization, and analysis.
  • microsoftml – This is another Microsoft Python package which adds machine learning algorithms in Python.
  • Anaconda 4.2 – Anaconda is an opensource Python package

#machine learning #sql server #executing python in sql server #machine learning using python #machine learning with sql server #ml in sql server using python #python in sql server ml #python packages #python packages for machine learning services #sql server machine learning services

Ray  Patel

Ray Patel

1619518440

top 30 Python Tips and Tricks for Beginners

Welcome to my Blog , In this article, you are going to learn the top 10 python tips and tricks.

1) swap two numbers.

2) Reversing a string in Python.

3) Create a single string from all the elements in list.

4) Chaining Of Comparison Operators.

5) Print The File Path Of Imported Modules.

6) Return Multiple Values From Functions.

7) Find The Most Frequent Value In A List.

8) Check The Memory Usage Of An Object.

#python #python hacks tricks #python learning tips #python programming tricks #python tips #python tips and tricks #python tips and tricks advanced #python tips and tricks for beginners #python tips tricks and techniques #python tutorial #tips and tricks in python #tips to learn python #top 30 python tips and tricks for beginners

Ray  Patel

Ray Patel

1619643600

Top Machine Learning Projects in Python For Beginners [2021]

If you want to become a machine learning professional, you’d have to gain experience using its technologies. The best way to do so is by completing projects. That’s why in this article, we’re sharing multiple machine learning projects in Python so you can quickly start testing your skills and gain valuable experience.

However, before you begin, make sure that you’re familiar with machine learning and its algorithm. If you haven’t worked on a project before, don’t worry because we have also shared a detailed tutorial on one project:

#artificial intelligence #machine learning #machine learning in python #machine learning projects #machine learning projects in python #python

Top Machine Learning Projects in Python For Beginners [2021] | upGrad blog

If you want to become a machine learning professional, you’d have to gain experience using its technologies. The best way to do so is by completing projects. That’s why in this article, we’re sharing multiple machine learning projects in Python so you can quickly start testing your skills and gain valuable experience.

However, before you begin, make sure that you’re familiar with machine learning and its algorithm. If you haven’t worked on a project before, don’t worry because we have also shared a detailed tutorial on one project:

The Iris Dataset: For the Beginners

The Iris dataset is easily one of the most popular machine learning projects in Python. It is relatively small, but its simplicity and compact size make it perfect for beginners. If you haven’t worked on any machine learning projects in Python, you should start with it. The Iris dataset is a collection of flower sepal and petal sizes of the flower Iris. It has three classes, with 50 instances in every one of them.

We’ve provided sample code on various places, but you should only use it to understand how it works. Implementing the code without understanding it would fail the premise of doing the project. So be sure to understand the code well before implementing it.

#artificial intelligence #machine learning #machine learning in python #machine learning projects #machine learning projects in python #python

sophia tondon

sophia tondon

1620898103

5 Latest Technology Trends of Machine Learning for 2021

Check out the 5 latest technologies of machine learning trends to boost business growth in 2021 by considering the best version of digital development tools. It is the right time to accelerate user experience by bringing advancement in their lifestyle.

#machinelearningapps #machinelearningdevelopers #machinelearningexpert #machinelearningexperts #expertmachinelearningservices #topmachinelearningcompanies #machinelearningdevelopmentcompany

Visit Blog- https://www.xplace.com/article/8743

#machine learning companies #top machine learning companies #machine learning development company #expert machine learning services #machine learning experts #machine learning expert