It is fun to play with different feature engineering methods and machine learning models, but you will most likely need to adjust your feature engineering methods and tuning your machine learning models before getting a good result.
For example, in the speed dating data below, you might want to drop iid, id, idg, wave, career
considering that they are not important features. But after doing more research about the data, you realize that career
would be an important feature to predict whether two people would have the next date. So you decide not to dropcareer
column.
If you are hard coding, which means to embed data directly into the source code of a script, like below
columns = ['iid', 'id', 'idg', 'wave', 'career']
df.drop(columns, axis=1, inplace=True)
view raw
hardcode.py hosted with ❤ by GitHub
and your file is long, it might take a while for you to find the code that specifies which columns to drop. Wouldn’t it be great if you fix the columns from a simple text that solely contains information about the data without other python code like this instead?
variables:
drop_features: ['iid','id','idg','wave','position','positin1', 'pid', 'field', 'from', 'career']
## categorical variables to transform to numerical variables
numerical_vars_from_numerical: ['income','mn_sat', 'tuition']
## categorical variables to encode
categorical_vars: ['undergra', 'zipcode']
categorical_label_extraction: ['zipcode']
categorical_onehot: ['undergra']
view raw
config.yaml hosted with ❤ by GitHub
This is when you need a configuration file.
#python #data-analytics #coding #programming #data-science #big data