Software Engineering for Data Scientist — Test-Driven Development (Example)

Software Engineering for Data Scientist — Test-Driven Development (Example)

In this tutorial, we are building a binary classifier. The data used in this exercise is Pokemon data taken from Kaggle [1]. This exercise will build a Random Forest classifier and compare it against a guessing machine (with reference to ROC AUC score) and how consistent the prediction is.

This is the fourth article in the series. For the list of the article in the series, please check the section Previous Articles.

The previous article is available at — Software Engineering for Data Scientist — Test-Driven Development https://medium.com/@jaganadhg/software-engineering-for-data-scientist-test-driven-development-65f1cdf52d58

Introduction

In the previous article, we discussed test-driven development in Data Science. Two specific test cases introduced in the article include checking model against a dummy/guess machine and prediction consistency checking. The current article is a quick tutorial on the same topic.

In this tutorial, we are building a binary classifier. The data used in this exercise is Pokemon data taken from Kaggle [1]. This exercise will build a Random Forest classifier and compare it against a guessing machine (with reference to ROC AUC score) and how consistent the prediction is.

Dataset

The data for this exercise is ‘Pokemon for Data Mining and Machine Learning’ [1]. The dataset includes Pokemons until generation 6. There are 21 attributes in this data, including identity attribute (“Number”). We selected the following attributes for the exercise: ‘isLegendary’,’Generation’, ‘Type_1’, ‘Type_2’, ‘HP’, ‘Attack’,’Defense’, ‘Sp_Atk’, ‘Sp_Def’, ‘Speed’,’Color’,’Egg_Group_1',’Height_m’,’Weight_kg’,’Body_Style’. The attribute ‘Generation’ was used for splitting the data and then dropped from the dataset. Attribute ‘isLegendary’ is the target here. There are five categorical attributes, they are ‘Egg_Group_1’, ‘Body_Style’, ‘Color’,’Type_1', ‘Type_2’. We one-hot transformed these attributes were before train/validation/test.

Softwares

For this tutorial, we will use the following Python packages.

pandas 1.1.1

sklearn 0.23.2

pytest 6.1.0

ipytest 0.9.1

numpy 1.19.1

toolz 0.11.1

deep-learning machine-learning data-science software-engineering artificial-intelligence

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Most popular Data Science and Machine Learning courses — July 2020

Most popular Data Science and Machine Learning courses — August 2020. This list was last updated in August 2020 — and will be updated regularly so as to keep it relevant

Learn Programming, Software Engineering, Machine Learning, And More

Best Free Resources to Learn Programming, Software Engineering, Machine Learning, And More All you need to learn. Do you know that you can take the courses from MIT, Stanford.

Machine Learning Engineer vs Data Scientist (Is Data Science Over?)

Machine Learning Engineer vs Data Scientist (Is Data Science Over?) vs Data Analyst vs Research Scientist vs Applied Scientist vs…

Artificial Intelligence vs. Machine Learning vs. Deep Learning

Simple explanations of Artificial Intelligence, Machine Learning, and Deep Learning and how they’re all different

Artificial Intelligence, Machine Learning, Deep Learning 

Artificial Intelligence (AI) will and is currently taking over an important role in our lives — not necessarily through intelligent robots.