Scikit-learn is one of the most widely used Python machine learning libraries. It has a standardized and simple interface for preprocessing data and model training, optimisation and evaluation.

The project began life as a Google Summer of Code project developed by David Cournapeau and had its first public release in 2010. Since its creation, the library has evolved into a rich ecosystem for the development of machine learning models.

Over time the project has developed many handy functions and capabilities that enhance its ease of use. In this article, I will cover 10 of the most useful features that you might not know about.


1. Scikit-learn has built-in data sets

The Scikit-learn API has a variety of both toy and real-world datasets built-in. These can be accessed with a single line of code and are extremely useful if you are either learning or just want to quickly try out a new bit of functionality.

You can also easily generate synthetic data sets using the generators for regression make_regression() , clustering make_blobs() and classification make_classification().

All the loading utilities provide the option to return the data already split into X (features) and y (target) so that they can be used directly to train a model.

# Toy regression data set loading
	from sklearn.datasets import load_boston

	X,y = load_boston(return_X_y = True)

	# Synthetic regresion data set loading
	from sklearn.datasets import make_regression

	X,y = make_regression(n_samples=10000, noise=100, random_state=0)
view raw
sklearn_datasets.py hosted with ❤ by GitHub

2. Third-party public data sets are also easily available

If you want to access a greater variety of publically available data sets directly through Scikit-learn there is a handy function that enables you to import data directly from the openml.org website. This website contains over 21,000 varied data sets for use in machine learning projects.

from sklearn.datasets import fetch_openml

	X,y = fetch_openml("wine", version=1, as_frame=True, return_X_y=True)
view raw
fetch_openml.py hosted with ❤ by GitHub

3. There are ready-made classifiers to train baseline models

When developing a machine learning model for a project it is sensible to create a baseline model first. This model should be in essence a ‘dummy’ model such as one that always predicts the most frequently occurring class. This provides a baseline on which to benchmark your ‘intelligent’ model so that you can ensure that it is performing better than random results for example.

Scikit-learn includes a [**DummyClassifier()**](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)for classification tasks and a **DummyRegressor()** for regression-based problems.

#programming #education #machine-learning #artificial-intelligence #data-science #deep learning

10 Things You Didn’t Know About Scikit-Learn
1.40 GEEK