Scikit-learn is one of the most widely used Python machine learning libraries. It has a standardized and simple interface for preprocessing data and model training, optimisation and evaluation.

The project began life as a Google Summer of Code project developed by David Cournapeau and had its first public release in 2010. Since its creation, the library has evolved into a rich ecosystem for the development of machine learning models.

Over time the project has developed many handy functions and capabilities that enhance its ease of use. In this article, I will cover 10 of the most useful features that you might not know about.

1. Scikit-learn has built-in data sets

The Scikit-learn API has a variety of both toy and real-world datasets built-in. These can be accessed with a single line of code and are extremely useful if you are either learning or just want to quickly try out a new bit of functionality.

You can also easily generate synthetic data sets using the generators for regression make_regression() , clustering make_blobs() and classification make_classification().

All the loading utilities provide the option to return the data already split into X (features) and y (target) so that they can be used directly to train a model.

# Toy regression data set loading
	from sklearn.datasets import load_boston

	X,y = load_boston(return_X_y = True)

	# Synthetic regresion data set loading
	from sklearn.datasets import make_regression

	X,y = make_regression(n_samples=10000, noise=100, random_state=0)
view raw hosted with ❤ by GitHub

2. Third-party public data sets are also easily available

If you want to access a greater variety of publically available data sets directly through Scikit-learn there is a handy function that enables you to import data directly from the website. This website contains over 21,000 varied data sets for use in machine learning projects.

from sklearn.datasets import fetch_openml

	X,y = fetch_openml("wine", version=1, as_frame=True, return_X_y=True)
view raw hosted with ❤ by GitHub

3. There are ready-made classifiers to train baseline models

When developing a machine learning model for a project it is sensible to create a baseline model first. This model should be in essence a ‘dummy’ model such as one that always predicts the most frequently occurring class. This provides a baseline on which to benchmark your ‘intelligent’ model so that you can ensure that it is performing better than random results for example.

Scikit-learn includes a [**DummyClassifier()**]( classification tasks and a **DummyRegressor()** for regression-based problems.

#programming #education #machine-learning #artificial-intelligence #data-science #deep learning

10 Things You Didn’t Know About Scikit-Learn
1.40 GEEK