Understanding Feature Engineering techniques in Python

Merge Train and Test

When performing features engineering it is always recommended to work on the whole DataFrame, if you have two 2 files (train and test).

df = pd.concat([train[col],test[col]],axis=0)
#The label column will be set as NULL for test rows
# FEATURE ENGINEERING HERE
train[col] = df[:len(train)]
test[col] = df[len(train):]

Memory reduction

Sometimes the type encoding of a column is not the best choice, as for example encoding in int32 a column containing only value from 0 to 10. One of the most popular function used a function to reduce the memory usage by converting the type of column to the best type as possible.

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.
    """
    start_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

Remove Outliers value

A common way to remove outliers is to use the Z-score.

If you are looking to remove each row where at least one column contains an outliers defined with the Z-score you can use the following code :

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

NAN trick

Some tree based algorithm can handle NAN value but he will had a step between NAN et non-NAN value, that could be non sense sometime. A common trick is just to fill all nan value by a value lower than the lowest value in the column considered (for example -9999).

df[col].fillna(-9999, inplace=True)

Categorical Features

You can treat categorical features with a label encoding to deal with them as a numeric. You can also decide to treat them as category. I recommend to try both and keep what improve you Cross-Validation by this line of code (after label encoding).

df[col] = df[col].astype('category')

Combining / Splitting

Sometime string variable contain multiple information in one variable. For example FRANCE_Paris . You will need to split it with a regex or using a split method for example :

new = df["localisation"].str.split("_", n = 1, expand = True)
df['country'] = new[0]
df['city']=new[1]

Otherwise, two (string or numeric) columns can be combined into one column. For example a column with a department of france (75 , for Paris) and the district code (001) can become a zip code : 75001)

df['zipcode'] = df['departement_code'].astype(str)
                +'_'
                +df['disctrict_code'].astype(str)

Linear combinations

One of the common feature engineering is to apply simple mathematical operation to create new feature. For example if we have the width and the height of a rectangle we can calculate the area.

df['area'] = df['width'] * df['height']

Count column

Create a column that create a column from the popular value_count method is a powerful technique for tree based algorithm, to define if a value is rare or common.

counts = df[col].value_counts.to_dict()
df[col+'_counts'] = df[col].map(counts)

Deal with Date

Dealing with Date and parse each element of a date is crucial in order to analyze event.

First things with we need to convert our Date column (often considered as a string column with pandas). One of the most important field is to know how to use the format parameters. I strongly recommend to save this site as bookmark ! :)

For exampel if we are looking to convert a date column with this following format : 30 Sep 2019 we will use this piece of code:

df['date'] =  pd.to_datetime(df[col], format='%d %b %Y')

Once your column is converted to datetime we may need to extract date components in news columns :

df['year'] =  df['date'].year
df['month'] = df['date'].month
df['day'] = df['date'].day

Aggregations / Group Statistics

In order to continue to detect rare and common value, that is really imporant for Machine Learning prediction, we can decide to detect if a value is rare or common in a subgroup based on a static method. For example here we will like to know which Smartphone brand user do the longest call by calculating the mean of each subclass.

temp = df.groupby('smartphone_brand')['call_duration']
       .agg(['mean'])
       .rename({'mean':'call_duration_mean'},axis=1)
df = pd.merge(df,temp,on='smartphone_brand',how=’left’)

With this method a ML algorithm will be able to tell which call have a non common value of call_duration regarding the smartphone brand.

Normalize / Standardize

Normalization could be sometime really useful.

In order to achieve a normalization of a column against itself:

df[col] = ( df[col]-df[col].mean() ) / df[col].std()

Or you can normalize one column against another column. For example if you create a Group Statistic (described above) indicating the mean value for call_duration each week. Then you can remove time dependence by

df[‘call_duration_remove_time’] = df[‘call_duration’] — 
df[‘call_duration_week_mean’]

The new variable call_duration_remove no longer increases as we advance in time because we have normalized it against the affects of time.

Ultime Features engineering tips

Each column add time computing for you preprocessing but also your model training. I strongly recommend to test a new feature and see how the features improve (or not…) you evaluation metrics. If it is not the case you should just removed the features created / modified.

Originally published by Anis A at towardsdatascience.com

#python #data-science #machine-learning