This is the fourth article in the series. For the list of the article in the series, please check the section Previous Articles.
The previous article is available at — Software Engineering for Data Scientist — Test-Driven Development https://medium.com/@jaganadhg/software-engineering-for-data-scientist-test-driven-development-65f1cdf52d58
In the previous article, we discussed test-driven development in Data Science. Two specific test cases introduced in the article include checking model against a dummy/guess machine and prediction consistency checking. The current article is a quick tutorial on the same topic.
In this tutorial, we are building a binary classifier. The data used in this exercise is Pokemon data taken from Kaggle [1]. This exercise will build a Random Forest classifier and compare it against a guessing machine (with reference to ROC AUC score) and how consistent the prediction is.
The data for this exercise is ‘Pokemon for Data Mining and Machine Learning’ [1]. The dataset includes Pokemons until generation 6. There are 21 attributes in this data, including identity attribute (“Number”). We selected the following attributes for the exercise: ‘isLegendary’,’Generation’, ‘Type_1’, ‘Type_2’, ‘HP’, ‘Attack’,’Defense’, ‘Sp_Atk’, ‘Sp_Def’, ‘Speed’,’Color’,’Egg_Group_1’,’Height_m’,’Weight_kg’,’Body_Style’. The attribute ‘Generation’ was used for splitting the data and then dropped from the dataset. Attribute ‘isLegendary’ is the target here. There are five categorical attributes, they are ‘Egg_Group_1’, ‘Body_Style’, ‘Color’,’Type_1’, ‘Type_2’. We one-hot transformed these attributes were before train/validation/test.
For this tutorial, we will use the following Python packages.
pandas 1.1.1
sklearn 0.23.2
pytest 6.1.0
ipytest 0.9.1
numpy 1.19.1
toolz 0.11.1
#deep-learning #machine-learning #data-science #software-engineering #artificial-intelligence