Scikit-Learn provides clean datasets for you to use when building ML models. And when I say clean, I mean the type of clean that’s ready to be used to train a ML model. The best part? The datasets come with the Scikit-Learn package itself. You don’t need to download anything. Within just a few lines of code, you’ll be working with the data.

Having ready-made datasets is a huge asset because you can get straight to creating models, not having to spend time obtaining, cleaning, and transforming the data — something data scientists spend lots of their time on.

Even with all the ground work complete, you might find using the Scikit-Learn datasets a bit confusing at first. Not to worry, in few minutes you’re going to know exactly how to use the datasets and be well on your way to exploring the world of Artificial Intelligence. This article assumes you have python, scikit-learn, pandas, and Jupyter Notebook (or you may use Google Collab) installed. Let’s begin.


Intro to Scikit-Learn’s Datasets

Scikit-Learn provides seven datasets, which they call toy datasets. Don’t be fooled by the word “toy”. These datasets are powerful and serve as a strong starting point for learning ML. Here are few of the datasets and how ML can be used:

  • Boston House Prices — use ML to predict house prices based on attributes such as number of rooms, crime rate in that town
  • Breast Cancer Wisconsin (diagnostic) dataset — use ML to diagnose cancer scans as malignant (does not spread to the rest of the body) or benign (spreads to rest of the body)
  • Wine Recognition — use ML to identify the type of wine based on chemical features

In this article, we’ll be working with the “Breast Cancer Wisconsin” dataset. We will import the data and understand how to read it. As a bonus, we’ll build a simple ML model that is able to classify cancer scans either as malignant or benign. To read more about the datasets, click here for Scikit-Learn’s documentation.

#machine-learning #data-science #python #scikit-learn #ai

How to use Scikit-Learn Datasets for Machine Learning
1.40 GEEK