Kaggle serves as a wonderful host to Data Science and Machine Learning challenges. One of them is the Histopathologic Cancer Detection Challenge. In this challenge, we are provided with a dataset of images on which we are supposed to create an algorithm (it says algorithm and not explicitly a machine learning model, so if you are a genius with an alternate way to detect metastatic cancer in images; go for it!) to detect metastatic cancer.

This article serves as a guide on how to prepare Kaggle’s dataset and the guide covers the following 4 things:

  • How to download the dataset into your notebook from Kaggle
  • How to augment the dataset’s images
  • How to balance target distributions, and split the data for training/test/validation.
  • How to structure data for model training in Keras

Downloading the Dataset within a notebook

Download the kaggle package using the commands below. I ran them in Google’s Colab but you should just be able to do it using your command line/Jupyter Notebook.

## Install a kaggle package to download the dataset

! pip install -q kaggle
! pip install --upgrade --force-reinstall --no-deps kaggle

In order to use Kaggle’s API to download the data with your account, you need to do the following two things.

  • Go to your account settings, scroll to the API section, click **Expire API Token **to remove previous tokens (in case you have any) then click on Create New API Token. This will download a ‘kaggle.json’ file.
  • Upload this file to your project directory.

Then run the code below which uses the json file to grant access, and download the dataset.

! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions download -c histopathologic-cancer-detection

#kaggle-competition #keras #machine-learning #data-science

Data Preparation Guide for detecting Histopathologic Cancer Detection
1.25 GEEK