This is the fourth article in the series. For the list of the article in the series, please check the section Previous Articles.

The previous article is available at — Software Engineering for Data Scientist — Test-Driven Development https://medium.com/@jaganadhg/software-engineering-for-data-scientist-test-driven-development-65f1cdf52d58

Introduction

In the previous article, we discussed test-driven development in Data Science. Two specific test cases introduced in the article include checking model against a dummy/guess machine and prediction consistency checking. The current article is a quick tutorial on the same topic.

In this tutorial, we are building a binary classifier. The data used in this exercise is Pokemon data taken from Kaggle [1]. This exercise will build a Random Forest classifier and compare it against a guessing machine (with reference to ROC AUC score) and how consistent the prediction is.

Dataset

The data for this exercise is ‘Pokemon for Data Mining and Machine Learning’ [1]. The dataset includes Pokemons until generation 6. There are 21 attributes in this data, including identity attribute (“Number”). We selected the following attributes for the exercise: ‘isLegendary’,’Generation’, ‘Type_1’, ‘Type_2’, ‘HP’, ‘Attack’,’Defense’, ‘Sp_Atk’, ‘Sp_Def’, ‘Speed’,’Color’,’Egg_Group_1’,’Height_m’,’Weight_kg’,’Body_Style’. The attribute ‘Generation’ was used for splitting the data and then dropped from the dataset. Attribute ‘isLegendary’ is the target here. There are five categorical attributes, they are ‘Egg_Group_1’, ‘Body_Style’, ‘Color’,’Type_1’, ‘Type_2’. We one-hot transformed these attributes were before train/validation/test.

Softwares

For this tutorial, we will use the following Python packages.

pandas 1.1.1

sklearn 0.23.2

pytest 6.1.0

ipytest 0.9.1

numpy 1.19.1

toolz 0.11.1

#deep-learning #machine-learning #data-science #software-engineering #artificial-intelligence

Software Engineering for Data Scientist — Test-Driven Development (Example)
1.10 GEEK