Prior to training a machine learning or deep learning model, it is important to cleanse, pre-process and analyse the dataset at hand. Processes like dealing with missing values, converting text data into numbers and so on are all part of the pre-processing phase. More often than not, these processes come across as being repetitive and monotonous. Although there are tools for automating this process, they behave like a black box and do not give intuition about how they changed the data. To overcome this problem, python introduced a library called dabl – Data Analysis Baseline library. Dabl can be used to automate many of the tasks that seem repetitive in the early stages of model development. This was developed quite recently and the latest version of Dabl was released earlier this year. The number of available features currently are less, but the development process is happening at a good pace at Dabl.

In this article, we will use this tool for data pre-processing, visualisation and analysis as well as model development. Let’s get started.

Data pre-processing

To use dabl to perform data analysis we need to first install the package. You can install this using the pip command as

pip install dabl

Once the installation is done, let us go ahead and pick a dataset. I will select a sample dataset from Kaggle. You can click this link to download the data. I have chosen the diabetes dataset. It is a small dataset which will make it easy to understand how dabl works.

After downloading the dataset, let us import the important libraries and look at our dataset.

import numpy as np
import dabl
import pandas as pd
db_data=pd.read_csv('diabetes.csv')
db_data.head()

dabl

Usually, after looking at the dataset you would get into the data cleaning process by trying to identify missing rows, identify the erroneous data and understand the datatypes of the columns. These processes are made easy using dabl by automating these.

db_clean = dabl.clean(db_data, verbose=1)

dablPIN IT

We have a list of detected feature types for the dataset given. These types indicate the following.

Continuous: This is the number of columns containing continuous values and columns with high cardinality.

Dirty_float: Float variables that sometimes take string values are called dirty_float.


#developers corner #automation #dabl #data preprocessing #visualization

Let’s Learn Dabl: A Python Tool for Data Analysis and ML Automation
6.15 GEEK