Machine learning algorithms normally assume roughly similar classes in number of objects. However, in real-life scenario, the data distribution is mostly skewed and some of classes appear much more frequently than others. So, when facing such disproportions we must design an intelligent system that is able to overcome such a bias.

Here, we will work with a multi-class problem where data are taken from UCI ML library as shown below.

url = ("https://archive.ics.uci.edu/ml/machine-learning-"
"databases/glass/glass.data")
df = pd.read_csv(url, header=None)
df.columns = ['Id', 'RI', 'Na', 'Mg', 'Al', 'Si','K', 'Ca', 'Ba', 'Fe', 'type']
df.set_index('Id', inplace=True)
print('Data loading:')
df.head()

Image for post

Here, we have different chemical compositions in the features and different type of glasses as multi-class. The problem presents chemical compositions of various types of glass with the objective of the problem is to determine the use for the glass.

Image for post

Exploratory analysis

Class visualization

figure, ax = plt.subplots(1,1, figsize=(10,5))
sns.countplot(x = 'type', data=df)
ax.set_xticklabels( ('building_windows_float_processed', 'building_windows_non_float_processed','vehicle_windows_float_processed','containers',  'tableware', 'headlamps'), rotation = 90 ) plt.show()

# summarize the class distribution
target = df.values[:,-1]
counter = Counter(df['type'])
for k,v in counter.items():
per = v / len(df) * 100
print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

Image for post

Image for post

We can observe that the data is somewhat unbalanced. Each class runs from 76 for the most populous class to 9 for the least populous. The average statistics can be dominated by the values for the most populous classes and there’s no reason to expect members of other classes to have similar attribute values. The radical behavior can be a good thing for distinguishing classes from one another, but it also means that a method for making predictions has to be able to trace a fairly complicated boundary between the different classes.

#machine-learning #classification-algorithms #random-forest #algorithms

Forensic Analysis of Crime and Accident Scene
1.40 GEEK