Software Engineering for Data Scientist — Test-Driven Development (Example)

This is the fourth article in the series. For the list of the article in the series, please check the section Previous Articles.

The previous article is available at — Software Engineering for Data Scientist — Test-Driven Development https://medium.com/@jaganadhg/software-engineering-for-data-scientist-test-driven-development-65f1cdf52d58

Introduction

In the previous article, we discussed test-driven development in Data Science. Two specific test cases introduced in the article include checking model against a dummy/guess machine and prediction consistency checking. The current article is a quick tutorial on the same topic.

In this tutorial, we are building a binary classifier. The data used in this exercise is Pokemon data taken from Kaggle [1]. This exercise will build a Random Forest classifier and compare it against a guessing machine (with reference to ROC AUC score) and how consistent the prediction is.

Dataset

The data for this exercise is ‘Pokemon for Data Mining and Machine Learning’ [1]. The dataset includes Pokemons until generation 6. There are 21 attributes in this data, including identity attribute (“Number”). We selected the following attributes for the exercise: ‘isLegendary’,’Generation’, ‘Type_1’, ‘Type_2’, ‘HP’, ‘Attack’,’Defense’, ‘Sp_Atk’, ‘Sp_Def’, ‘Speed’,’Color’,’Egg_Group_1’,’Height_m’,’Weight_kg’,’Body_Style’. The attribute ‘Generation’ was used for splitting the data and then dropped from the dataset. Attribute ‘isLegendary’ is the target here. There are five categorical attributes, they are ‘Egg_Group_1’, ‘Body_Style’, ‘Color’,’Type_1’, ‘Type_2’. We one-hot transformed these attributes were before train/validation/test.

Softwares

For this tutorial, we will use the following Python packages.

pandas 1.1.1

sklearn 0.23.2

pytest 6.1.0

ipytest 0.9.1

numpy 1.19.1

toolz 0.11.1

#deep-learning #machine-learning #data-science #software-engineering #artificial-intelligence

Introduction

Dataset

Softwares

towardsdatascience.com

Software Engineering for Data Scientist — Test-Driven Development (Example)