scikit-learn is an open source Python library that implements a range of machine learning, pre-processing, cross-validation and visualization algorithms using a unified interface.
scikit-learn is an open source Python library that implements a range of machine learning, pre-processing, cross-validation and visualization algorithms using a unified interface.
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
In this article, we are going to see how we can easily build a machine learning model using scikit-learn.
Scikit-learn requires:
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
Before installing scikit-learn, ensure that you have NumPy and SciPy installed. Once you have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip:
pip install -U scikit-learn
Let us get started with the modeling process now.
Step 1: Load a dataset
A dataset is nothing but a collection of data. A dataset generally has two main components:
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
**Loading exemplar dataset: **scikit-learn comes loaded with a few example datasets like the iris and digits datasets for classification and the boston house prices dataset for regression.
Given below is an example of how one can load an exemplar dataset:
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
# store the feature and target names
feature_names = iris.feature_names
target_names = iris.target_names
# printing features and target names of our dataset
print("Feature names:", feature_names)
print("Target names:", target_names)
# X and y are numpy arrays
print("\nType of X is:", type(X))
# printing first 5 input rows
print("\nFirst 5 rows of X:\n", X[:5])
Output:
Feature names: ['sepal length (cm)','sepal width (cm)',
'petal length (cm)','petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
Type of X is:
First 5 rows of X:
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]]
Loading external dataset: Now, consider the case when we want to load an external dataset. For this purpose, we can use pandas library for easily loading and manipulating dataset.
To install pandas, use the following pip command:
pip install pandas
In pandas, important data types are:
Series: Series is a one-dimensional labeled array capable of holding any data type.
DataFrame: It is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
Note: The CSV file used in example below can be downloaded from here: weather.csv
import pandas as pd
# reading csv file
data = pd.read_csv('weather.csv')
# shape of dataset
print("Shape:", data.shape)
# column names
print("\nFeatures:", data.columns)
# storing the feature matrix (X) and response vector (y)
X = data[data.columns[:-1]]
y = data[data.columns[-1]]
# printing first 5 rows of feature matrix
print("\nFeature matrix:\n", X.head())
# printing first 5 values of response vector
print("\nResponse vector:\n", y.head())
Output:
Shape: (14, 5)
Features: Index([u'Outlook', u'Temperature', u'Humidity',
u'Windy', u'Play'], dtype='object')
Feature matrix:
Outlook Temperature Humidity Windy
0 overcast hot high False
1 overcast cool normal True
2 overcast mild high True
3 overcast hot normal False
4 rainy mild high False
Response vector:
0 yes
1 yes
2 yes
3 yes
4 yes
Name: Play, dtype: object
Step 2: Splitting the dataset
One important aspect of all machine learning models is to determine their accuracy. Now, in order to determine their accuracy, one can train the model using the given dataset and then predict the response values for the same dataset using that model and hence, find the accuracy of the model.
But this method has several flaws in it, like:
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
A better option is to split our data into two parts: first one for training our machine learning model, and second one for testing our model.
To summarize:
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
Advantages of train/test split:
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
Consider the example below:
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
# printing the shapes of the new X objects
print(X_train.shape)
print(X_test.shape)
# printing the shapes of the new y objects
print(y_train.shape)
print(y_test.shape)
Output:
(90L, 4L)
(60L, 4L)
(90L,)
(60L,)
The train_test_split function takes several arguments which are explained below:
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
Step 3: Training the model
Now, its time to train some prediction-model using our dataset. Scikit-learn provides a wide range of machine learning algorithms which have a unified/consistent interface for fitting, predicting accuracy, etc.
The example given below uses KNN (K nearest neighbors) classifier.
Note: We will not go into the details of how the algorithm works as we are interested in understanding its implementation only.
Now, consider the example below:
# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
# training the model on training set
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# making predictions on the testing set
y_pred = knn.predict(X_test)
# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("kNN model accuracy:", metrics.accuracy_score(y_test, y_pred))
# making prediction for out of sample data
sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
preds = knn.predict(sample)
pred_species = [iris.target_names[p] for p in preds]
print("Predictions:", pred_species)
# saving the model
from sklearn.externals import joblib
joblib.dump(knn, 'iris_knn.pkl')
Output:
kNN model accuracy: 0.983333333333
Predictions: ['versicolor', 'virginica']
Important points to note from the above code:
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
knn = KNeighborsClassifier(n_neighbors=3)
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
knn.fit(X_train, y_train)
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
y_pred = knn.predict(X_test)
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
print(metrics.accuracy_score(y_test, y_pred))
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
sample = [[3, 5, 4, 2], [2, 3, 5, 4]] preds = knn.predict(sample)
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
joblib.dump(knn, 'iris_knn.pkl')
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
knn = joblib.load('iris_knn.pkl')
As we approach the end of this article, here are some benefits of using scikit-learn over some other machine learning libraries(like R):
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
References:
Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.Accessible to everybody and reusable in various contexts.Built on the top of NumPy, SciPy, and matplotlib.Open source, commercially usable – BSD license.
Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.
Complete hands-on Machine Learning tutorial with Data Science, Tensorflow, Artificial Intelligence, and Neural Networks. Introducing Tensorflow, Using Tensorflow, Introducing Keras, Using Keras, Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Learning Deep Learning, Machine Learning with Neural Networks, Deep Learning Tutorial with Python
Machine Learning, Data Science and Deep Learning with PythonExplore the full course on Udemy (special discount included in the link): http://learnstartup.net/p/BkS5nEmZg
In less than 3 hours, you can understand the theory behind modern artificial intelligence, and apply it with several hands-on examples. This is machine learning on steroids! Find out why everyone’s so excited about it and how it really works – and what modern AI can and cannot really do.
In this course, we will cover:
• Deep Learning Pre-requistes (gradient descent, autodiff, softmax)
• The History of Artificial Neural Networks
• Deep Learning in the Tensorflow Playground
• Deep Learning Details
• Introducing Tensorflow
• Using Tensorflow
• Introducing Keras
• Using Keras to Predict Political Parties
• Convolutional Neural Networks (CNNs)
• Using CNNs for Handwriting Recognition
• Recurrent Neural Networks (RNNs)
• Using a RNN for Sentiment Analysis
• The Ethics of Deep Learning
• Learning More about Deep Learning
At the end, you will have a final challenge to create your own deep learning / machine learning system to predict whether real mammogram results are benign or malignant, using your own artificial neural network you have learned to code from scratch with Python.
Separate the reality of modern AI from the hype – by learning about deep learning, well, deeply. You will need some familiarity with Python and linear algebra to follow along, but if you have that experience, you will find that neural networks are not as complicated as they sound. And how they actually work is quite elegant!
This is hands-on tutorial with real code you can download, study, and run yourself.
Python tutorial for beginners - Learn Python for Machine Learning and Web Development. Can Python be used for machine learning? Python is widely considered as the preferred language for teaching and learning ML (Machine Learning). Can I use Python for web development? Python can be used to build server-side web applications. Why Python is suitable for machine learning? How Python is used in AI? What language is best for machine learning?
Python tutorial for beginners - Learn Python for Machine Learning and Web DevelopmentTABLE OF CONTENT
Thanks for reading ❤
If you liked this post, share it with all of your programming buddies!
Follow us on Facebook | Twitter
☞ Complete Python Bootcamp: Go from zero to hero in Python 3
☞ Machine Learning A-Z™: Hands-On Python & R In Data Science
☞ Python and Django Full Stack Web Developer Bootcamp
☞ Python Programming Tutorial | Full Python Course for Beginners 2019 👍
☞ Top 10 Python Frameworks for Web Development In 2019
☞ Python for Financial Analysis and Algorithmic Trading
☞ Building A Concurrent Web Scraper With Python and Selenium
This complete Machine Learning full course video covers all the topics that you need to know to become a master in the field of Machine Learning.
Machine Learning Full Course | Learn Machine Learning | Machine Learning Tutorial
It covers all the basics of Machine Learning (01:46), the different types of Machine Learning (18:32), and the various applications of Machine Learning used in different industries (04:54:48).This video will help you learn different Machine Learning algorithms in Python. Linear Regression, Logistic Regression (23:38), K Means Clustering (01:26:20), Decision Tree (02:15:15), and Support Vector Machines (03:48:31) are some of the important algorithms you will understand with a hands-on demo. Finally, you will see the essential skills required to become a Machine Learning Engineer (04:59:46) and come across a few important Machine Learning interview questions (05:09:03). Now, let's get started with Machine Learning.
Below topics are explained in this Machine Learning course for beginners:
Basics of Machine Learning - 01:46
Why Machine Learning - 09:18
What is Machine Learning - 13:25
Types of Machine Learning - 18:32
Supervised Learning - 18:44
Reinforcement Learning - 21:06
Supervised VS Unsupervised - 22:26
Linear Regression - 23:38
Introduction to Machine Learning - 25:08
Application of Linear Regression - 26:40
Understanding Linear Regression - 27:19
Regression Equation - 28:00
Multiple Linear Regression - 35:57
Logistic Regression - 55:45
What is Logistic Regression - 56:04
What is Linear Regression - 59:35
Comparing Linear & Logistic Regression - 01:05:28
What is K-Means Clustering - 01:26:20
How does K-Means Clustering work - 01:38:00
What is Decision Tree - 02:15:15
How does Decision Tree work - 02:25:15
Random Forest Tutorial - 02:39:56
Why Random Forest - 02:41:52
What is Random Forest - 02:43:21
How does Decision Tree work- 02:52:02
K-Nearest Neighbors Algorithm Tutorial - 03:22:02
Why KNN - 03:24:11
What is KNN - 03:24:24
How do we choose 'K' - 03:25:38
When do we use KNN - 03:27:37
Applications of Support Vector Machine - 03:48:31
Why Support Vector Machine - 03:48:55
What Support Vector Machine - 03:50:34
Advantages of Support Vector Machine - 03:54:54
What is Naive Bayes - 04:13:06
Where is Naive Bayes used - 04:17:45
Top 10 Application of Machine Learning - 04:54:48
How to become a Machine Learning Engineer - 04:59:46
Machine Learning Interview Questions - 05:09:03