This article demonstrates exploratory data analysis (EDA), feature engineering, and splitting strategies for unbalanced data using the seismic bumps dataset from the UCI Data Archive.

Image for post

Photo by Dominik Vanyi on Unsplash

Introduction:

The seismic bumps dataset is one of the lesser-known binary classification datasets that capture geological conditions using seismic and seismo-acoustic systems in longwall coal mines to assess if they are prone to rockburst causing seismic hazards or not.

Link to the dataset: https://archive.ics.uci.edu/ml/datasets/seismic-bumps

This is a good dataset that gives practical exposure to unbalanced datasets, works on different kinds of data splits, and assessing a classifier’s performance metrics including exhibiting accuracy paradox.

The other thing about this dataset is that it has both **categorical **as well as **numerical **features which provides a playground to wrangle and try out different **feature transformation **methods to use.

This article is not code-heavy but a bit more intuitive in understanding what went right and wrong! The code can be found in my GitHub repository.


Note — I haven’t elaborated on each feature in the EDA and feature engineering steps since they are repetitive. I only have samples in this blog and the full code is available on GitHub in this link.


Exploratory Data Analysis

Image for post

Photo by Andrew Neel on Unsplash

Data facts:

This dataset has 2584 instances with 19 columns, out of which there are 4 categorical features, 8 discrete features, and 6 numeric features. The last one is the label column which contains 0 for non-hazardous and 0 for non-hazardous seismic bumps. For ease of use, I categorized and saved the feature names as follows:

col_list_categorical = ['seismic', 'seismoacoustic', 'shift', 'ghazard']
col_list_numerical = ['genergy', 'gpuls', 'gdenergy', 'gdpuls', 'energy', 'maxenergy']
col_list_discrete = ['nbumps', 'nbumps2', 'nbumps3', 'nbumps4', 'nbumps5', 'nbumps6', 'nbumps7', 'nbumps89']
label = 'class'

Attribute information [Source]:

1. seismic: the result of shift seismic hazard assessment in the mine working obtained by the seismic method (a — lack of hazard, b — low hazard, c — high hazard, d — danger state);

2. seismoacoustic: the result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;

3. shift: information about the type of a shift (W — coal-getting, N -preparation shift);

4. genergy: seismic energy recorded within the previous shift by the most active geophone (GMax) out of

geophones monitoring the longwall;

5. gpuls: a number of pulses recorded within the previous shift by GMax;

6. gdenergy: a deviation of energy recorded within the previous shift by GMax from average energy recorded during eight previous shifts;

7. gdpuls: a deviation of a number of pulses recorded within the previous shift by GMax from the average number of pulses recorded during eight previous shifts;

8. ghazard: the result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming to from GMax only;

9. nbumps: the number of seismic bumps recorded within the previous shift;

10. nbumps2: the number of seismic bumps (in energy range [1⁰²,1⁰³)) registered within the previous shift;

11. nbumps3: the number of seismic bumps (in energy range [1⁰³,1⁰⁴)) registered within the previous shift;

12. nbumps4: the number of seismic bumps (in energy range [1⁰⁴,1⁰⁵)) registered within the previous shift;

13. nbumps5: the number of seismic bumps (in energy range [1⁰⁵,1⁰⁶)) registered within the last shift;

14. nbumps6: the number of seismic bumps (in energy range [1⁰⁶,1⁰⁷)) registered within the previous shift;

15. nbumps7: the number of seismic bumps (in energy range [1⁰⁷,1⁰⁸)) registered within the previous shift;

16. nbumps89: the number of seismic bumps (in energy range [1⁰⁸,1⁰¹⁰)) registered within the previous shift;

17. energy: the total energy of seismic bumps registered within the previous shift;

18. maxenergy: the maximum energy of the seismic bumps registered within the previous shift;

19. class: the decision attribute — ‘1’ means that high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ means that no high energy seismic bumps occurred in the next shift (‘non-hazardous state’).

#exploratory-data-analysis #feature-engineering #python #machine-learning #imbalanced-data

Predicting Hazardous Seismic Bumps Part I : EDA, Feature Engineering
1.50 GEEK