Data Preparation Guide for detecting Histopathologic Cancer Detection

Kaggle serves as a wonderful host to Data Science and Machine Learning challenges. One of them is the Histopathologic Cancer Detection Challenge. In this challenge, we are provided with a dataset of images on which we are supposed to create an algorithm (it says algorithm and not explicitly a machine learning model, so if you are a genius with an alternate way to detect metastatic cancer in images; go for it!) to detect metastatic cancer.

This article serves as a guide on how to prepare Kaggle’s dataset and the guide covers the following 4 things:

How to download the dataset into your notebook from Kaggle
How to augment the dataset’s images
How to balance target distributions, and split the data for training/test/validation.
How to structure data for model training in Keras

Downloading the Dataset within a notebook

Download the kaggle package using the commands below. I ran them in Google’s Colab but you should just be able to do it using your command line/Jupyter Notebook.

## Install a kaggle package to download the dataset

! pip install -q kaggle
! pip install --upgrade --force-reinstall --no-deps kaggle

In order to use Kaggle’s API to download the data with your account, you need to do the following two things.

Go to your account settings, scroll to the API section, click **Expire API Token **to remove previous tokens (in case you have any) then click on Create New API Token. This will download a ‘kaggle.json’ file.
Upload this file to your project directory.

Then run the code below which uses the json file to grant access, and download the dataset.

! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions download -c histopathologic-cancer-detection

#kaggle-competition #keras #machine-learning #data-science

Downloading the Dataset within a notebook

towardsdatascience.com

Data Preparation Guide for detecting Histopathologic Cancer Detection