A Powerful Framework to Configure your Data Science Projects

Motivation

It is fun to play with different feature engineering methods and machine learning models, but you will most likely need to adjust your feature engineering methods and tuning your machine learning models before getting a good result.

For example, in the speed dating data below, you might want to drop iid, id, idg, wave, career considering that they are not important features. But after doing more research about the data, you realize that career would be an important feature to predict whether two people would have the next date. So you decide not to dropcareer column.

If you are hard coding, which means to embed data directly into the source code of a script, like below

columns = ['iid', 'id', 'idg', 'wave', 'career']
	df.drop(columns, axis=1, inplace=True)
view raw
hardcode.py hosted with ❤ by GitHub

and your file is long, it might take a while for you to find the code that specifies which columns to drop. Wouldn’t it be great if you fix the columns from a simple text that solely contains information about the data without other python code like this instead?

variables:
	  drop_features: ['iid','id','idg','wave','position','positin1', 'pid',  'field', 'from', 'career']

	  ## categorical variables to transform to numerical variables
	  numerical_vars_from_numerical: ['income','mn_sat', 'tuition']

	  ## categorical variables to encode
	  categorical_vars: ['undergra', 'zipcode']
	  categorical_label_extraction: ['zipcode']
	  categorical_onehot: ['undergra']
view raw
config.yaml hosted with ❤ by GitHub

This is when you need a configuration file.

#python #data-analytics #coding #programming #data-science #big data

Motivation

towardsdatascience.com

A Powerful Framework to Configure your Data Science Projects