A Beginner’s Guide to Linear Regression in Python

A Beginner’s Guide to Linear Regression in Python

What linear regression is and how it can be implemented for both two variables and multiple variables using Scikit-Learn, which is one of the most popular machine learning libraries for Python. The term “linearity” in algebra refers to a linear relationship between two or more variables. If we draw this relationship in a two-dimensional space (between two variables), we get a straight line.

Basic programming concept in any language will help but not require to attend this tutorial

Description

Become a Python Programmer and learn one of employer's most requested skills of 21st century!

This is the most comprehensive, yet straight-forward, course for the Python programming language on Simpliv! Whether you have never programmed before, already know basic syntax, or want to learn about the advanced features of Python, this course is for you! In this course we will teach you Python 3. (Note, we also provide older Python 2 notes in case you need them)

With over 40 lectures and more than 3 hours of video this comprehensive course leaves no stone unturned! This course includes tests, and homework assignments as well as 3 major projects to create a Python project portfolio!

This course will teach you Python in a practical manner, with every lecture comes a full coding screencast and a corresponding code notebook! Learn in whatever manner is best for you!

We will start by helping you get Python installed on your computer, regardless of your operating system, whether its Linux, MacOS, or Windows, we've got you covered!

We cover a wide variety of topics, including:

Command Line Basics

Installing Python

Running Python Code

Strings

Lists

Dictionaries

Tuples

Sets

Number Data Types

Print Formatting

Functions

Scope

Built-in Functions

Debugging and Error Handling

Modules

External Modules

Object Oriented Programming

Inheritance

Polymorphism

File I/O

Web scrapping

Database Connection

Email sending

and much more!

Project that we will complete:

Guess the number

Guess the word using speech recognition

Love Calculator

google search in python

Image download from a link

Click and save image using openCV

Ludo game dice simulator

open wikipedia on command prompt

Password generator

QR code reader and generator

You will get lifetime access to over 40 lectures.

So what are you waiting for? Learn Python in a way that will advance your career and increase your knowledge, all in a fun and practical way!

Basic knowledge

Basic programming concept in any language will help but not require to attend this tutorial

What will you learn

Learn to use Python professionally, learning both Python 2 and Python 3!

Create games with Python, like Tic Tac Toe and Blackjack!

Learn advanced Python features, like the collections module and how to work with timestamps!

Learn to use Object Oriented Programming with classes!

Understand complex topics, like decorators.

Understand how to use both the pycharm and create .py files

Get an understanding of how to create GUIs in the pycharm!

Build a complete understanding of Python from the ground up!

Worried that you have no experience in handling Python? Don’t! Python programming language teaching from Simpliv puts you right there to be able to write Python programs with ease. Place object-oriented programing in a Python context and use Python to perform complicated text processing.

Description

A Note on the Python versions 2 and 3: The code-alongs in this class all use Python 2.7. Source code (with copious amounts of comments) is attached as a resource with all the code-alongs. The source code has been provided for both Python 2 and Python 3 wherever possible.

What's Covered:

Introductory Python: Functional language constructs; Python syntax; Lists, dictionaries, functions and function objects; Lambda functions; iterators, exceptions and file-handling

Database operations: Just as much database knowledge as you need to do data manipulation in Python

Auto-generating spreadsheets: Kill the drudgery of reporting tasks with xlsxwriter; automated reports that combine database operations with spreadsheet auto-generation

Text processing and NLP: Python’s powerful tools for text processing - nltk and others.

Website scraping using Beautiful Soup: Scrapers for the New York Times and Washington Post

Machine Learning : Use sk-learn to apply machine learning techniques like KMeans clustering

Hundreds of lines of code with hundreds of lines of comments

Drill #1: Download a zip file from the National Stock Exchange of India; unzip and process to find the 3 most actively traded securities for the day

Drill #2: Store stock-exchange time-series data for 3 years in a database. On-demand, generate a report with a time-series for a given stock ticker

Drill #3: Scrape a news article URL and auto-summarize into 3 sentences

Drill #4: Scrape newspapers and a blog and apply several machine learning techniques - classification and clustering to these

Using discussion forums

Please use the discussion forums on this course to engage with other students and to help each other out. Unfortunately, much as we would like to, it is not possible for us at Loonycorn to respond to individual questions from students:-(

We're super small and self-funded with only 2 people developing technical video content. Our mission is to make high-quality courses available at super low prices.

The only way to keep our prices this low is to *NOT offer additional technical support over email or in-person*. The truth is, direct support is hugely expensive and just does not scale.

We understand that this is not ideal and that a lot of students might benefit from this additional support. Hiring resources for additional support would make our offering much more expensive, thus defeating our original purpose.

It is a hard trade-off.

Thank you for your patience and understanding!

Who is the target audience?

Yep! Folks with zero programming experience looking to learn a new skill

Machine Learning and Language Processing folks looking to apply concepts in a full-fledged programming language

Yep! Computer Science students or software engineers with no experience in Java, but experience in Python, C++ or even C#. You might need to skip over some bits, but in general the class will still have new learning to offer you :-)

Basic knowledge

No prior programming experience is needed :-)

The course will use a Python IDE (integrated development environment) called iPython from Anaconda. We will go through a step-by-step procedure on downloading and installing this IDE.

What will you learn

Pick up programming even if you have NO programming experience at all

Write Python programs of moderate complexity

Perform complicated text processing - splitting articles into sentences and words and doing things with them

Work with files, including creating Excel spreadsheets and working with zip files

Apply simple machine learning and natural language processing concepts such as classification, clustering and summarization

Understand Object-Oriented Programming in a Python context

scikit-learn is an open source Python library that implements a range of machine learning, pre-processing, cross-validation and visualization algorithms using a unified interface.

scikit-learn is an open source Python library that implements a range of machine learning, pre-processing, cross-validation and visualization algorithms using a unified interface.

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

In this article, we are going to see how we can easily build a machine learning model using scikit-learn.

Scikit-learn requires:

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

Before installing scikit-learn, ensure that you have NumPy and SciPy installed. Once you have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip:

```
pip install -U scikit-learn
```

Let us get started with the modeling process now.

**Step 1: Load a dataset**

A dataset is nothing but a collection of data. A dataset generally has two main components:

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

**Loading exemplar dataset: **scikit-learn comes loaded with a few example datasets like the iris and digits datasets for classification and the boston house prices dataset for regression.

Given below is an example of how one can load an exemplar dataset:

```
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
# store the feature and target names
feature_names = iris.feature_names
target_names = iris.target_names
# printing features and target names of our dataset
print("Feature names:", feature_names)
print("Target names:", target_names)
# X and y are numpy arrays
print("\nType of X is:", type(X))
# printing first 5 input rows
print("\nFirst 5 rows of X:\n", X[:5])
```

Output:

```
Feature names: ['sepal length (cm)','sepal width (cm)',
'petal length (cm)','petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
Type of X is:
First 5 rows of X:
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]]
```

**Loading external dataset:** Now, consider the case when we want to load an external dataset. For this purpose, we can use **pandas library** for easily loading and manipulating dataset.

To install pandas, use the following pip command:

```
pip install pandas
```

In pandas, important data types are:

**Series**: Series is a one-dimensional labeled array capable of holding any data type.

**DataFrame**: It is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

Note: The CSV file used in example below can be downloaded from here: weather.csv

```
import pandas as pd
# reading csv file
data = pd.read_csv('weather.csv')
# shape of dataset
print("Shape:", data.shape)
# column names
print("\nFeatures:", data.columns)
# storing the feature matrix (X) and response vector (y)
X = data[data.columns[:-1]]
y = data[data.columns[-1]]
# printing first 5 rows of feature matrix
print("\nFeature matrix:\n", X.head())
# printing first 5 values of response vector
print("\nResponse vector:\n", y.head())
```

Output:

```
Shape: (14, 5)
Features: Index([u'Outlook', u'Temperature', u'Humidity',
u'Windy', u'Play'], dtype='object')
Feature matrix:
Outlook Temperature Humidity Windy
0 overcast hot high False
1 overcast cool normal True
2 overcast mild high True
3 overcast hot normal False
4 rainy mild high False
Response vector:
0 yes
1 yes
2 yes
3 yes
4 yes
Name: Play, dtype: object
```

**Step 2: Splitting the dataset**

One important aspect of all machine learning models is to determine their accuracy. Now, in order to determine their accuracy, one can train the model using the given dataset and then predict the response values for the same dataset using that model and hence, find the accuracy of the model.

But this method has several flaws in it, like:

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

A better option is to split our data into two parts: first one for training our machine learning model, and second one for testing our model.

**To summarize:**

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

**Advantages of train/test split:**

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

Consider the example below:

```
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
# printing the shapes of the new X objects
print(X_train.shape)
print(X_test.shape)
# printing the shapes of the new y objects
print(y_train.shape)
print(y_test.shape)
```

Output:

```
(90L, 4L)
(60L, 4L)
(90L,)
(60L,)
```

The **train_test_split** function takes several arguments which are explained below:

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

**Step 3: Training the model**

Now, its time to train some prediction-model using our dataset. Scikit-learn provides a wide range of machine learning algorithms which have a unified/consistent interface for fitting, predicting accuracy, etc.

The example given below uses KNN (K nearest neighbors) classifier.

**Note**: We will not go into the details of how the algorithm works as we are interested in understanding its implementation only.

Now, consider the example below:

```
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
# training the model on training set
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# making predictions on the testing set
y_pred = knn.predict(X_test)
# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("kNN model accuracy:", metrics.accuracy_score(y_test, y_pred))
# making prediction for out of sample data
sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
preds = knn.predict(sample)
pred_species = [iris.target_names[p] for p in preds]
print("Predictions:", pred_species)
# saving the model
from sklearn.externals import joblib
joblib.dump(knn, 'iris_knn.pkl')
```

Output:

```
kNN model accuracy: 0.983333333333
Predictions: ['versicolor', 'virginica']
```

Important points to note from the above code:

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

```
knn = KNeighborsClassifier(n_neighbors=3)
```

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

```
knn.fit(X_train, y_train)
```

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

```
y_pred = knn.predict(X_test)
```

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

```
print(metrics.accuracy_score(y_test, y_pred))
```

```
sample = [[3, 5, 4, 2], [2, 3, 5, 4]] preds = knn.predict(sample)
```

```
joblib.dump(knn, 'iris_knn.pkl')
```

```
knn = joblib.load('iris_knn.pkl')
```

As we approach the end of this article, here are some benefits of using scikit-learn over some other machine learning libraries(like R):

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

**References:**

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.

Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.