1621047675

# Day 60 - Random Forest Implementation

This is a video series on learning data science in 100 days. In this video, I have covered the implementation of the Random Forest algorithm using python. This also includes few reference materials for reading purposes.

#data-science

1624985580

## Introduction to Random Forest Algorithm: Functions, Applications & Benefits

Random Forest is a mainstream AI algorithm that has a place with the regulated learning strategy. It might be used for both Classification and Regression issues in ML. It depends on the idea of ensemble learning, which is a cycle of joining numerous classifiers to tackle an intricate issue and to improve the presentation of the model.

As the name proposes, “Random Forest is a classifier that contains different decision trees on various subsets of the given dataset and takes the typical to improve the perceptive precision of that dataset.”

Instead of relying upon one decision tree, the random forest takes the figure from each tree and subject it to the larger part votes of desires, and it predicts the last yield. The more noticeable number of trees in the forest prompts higher exactness and forestalls the issue of overfitting.

### **Presumptions for Random Forest **

Since the random forest consolidates various trees to anticipate the class of the dataset, it is conceivable that some choice trees may foresee the right yield, while others may not. Yet, together, all the trees anticipate the right yield. In this way, beneath are two presumptions for a superior random forest classifier:

• There should be some real qualities in the component variable of a dataset with a goal that the classifier can foresee precise outcomes as opposed to a speculated result.
• The forecasts from each tree must have low connections.

#artificial intelligence #random forest #introduction to random forest algorithm #random forest algorithm #algorithm

1625103060

## Random Forest Algorithm in Python from Scratch

### Coding the powerful algorithm in python using (mainly) arrays and loops

This article aims to demystify the popular random forest (here and throughout the text —** RF**) algorithm and show its principles by using graphs, code snippets and code outputs.

The full implementation of the RF algorithm written by me in python can be accessed via: https://github.com/Eligijus112/decision-tree-python

I highly encourage anyone who stumbled upon this article to dive deep into the code because the understanding of the code will make any future documentation reading about **RF **much more straightforward and less stressful.

Any suggestions about optimizations are highly encouraged and are welcomed via a pull request on GitHub.

The building blocks of RF are simple decision trees. This article will be much easier to read if the reader is familiar with the concept of a classification decision tree. It is highly recommended to go through the following article before going any further:

#coding #machine-learning #random-forest #python #python from scratch #random forest algorithm

1625013180

## Generate Random Numbers in Python

There are two types of random number generators: pseudo-random number generator and true random number generator.

Pseudorandom numbers depend on computer algorithms. The computer uses algorithms to generate random numbers. These random numbers are not truly random because they are predictable like the generated numbers using NumPy random seed.

Whereas, truly random numbers are generated by measuring truly physical random parameters so we can ensure that the generated numbers are truly random.

The pseudo-random numbers are not safe to use in cryptography because they can be guessed by attackers.

In Python, the built-in random module generates pseudo-random numbers. In this tutorial, we will discuss both types. So let’s get started.

#python #random #generate random numbers #random numbers #generate random numbers in python

1676976904

## Sklearn-compatible Random Bits Forest

Scikit-learn compatible wrapper of the Random Bits Forest program written by Wang et al., 2016, available as a binary on Sourceforge. All credits belong to the authors. This is just some quick and dirty wrapper and testing code.

The authors present "...a classification and regression algorithm called Random Bits Forest (RBF). RBF integrates neural network (for depth), boosting (for wideness) and random forest (for accuracy). It first generates and selects ~10,000 small three-layer threshold random neural networks as basis by gradient boosting scheme. These binary basis are then feed into a modified random forest algorithm to obtain predictions. In conclusion, RBF is a novel framework that performs strongly especially on data with large size."

Note: the executable supplied by the authors has been compiled for Linux, and for CPUs supporting SSE instructions.

Usage

Usage example of the Random Bits Forest:

``````from uci_loader import *
from randombitsforest import RandomBitsForest
X, y = getdataset('diabetes')

from sklearn.ensemble.forest import RandomForestClassifier

classifier = RandomBitsForest()
classifier.fit(X[:len(y)/2], y[:len(y)/2])
p = classifier.predict(X[len(y)/2:])
print "Random Bits Forest Accuracy:", np.mean(p == y[len(y)/2:])

classifier = RandomForestClassifier(n_estimators=20)
classifier.fit(X[:len(y)/2], y[:len(y)/2])
print "Random Forest Accuracy:", np.mean(classifier.predict(X[len(y)/2:]) == y[len(y)/2:])
``````

Usage example for the UCI comparison:

``````from uci_comparison import compare_estimators
from sklearn.ensemble.forest import RandomForestClassifier, ExtraTreesClassifier
from randombitsforest import RandomBitsForest

estimators = {
'RandomForest': RandomForestClassifier(n_estimators=200),
'ExtraTrees': ExtraTreesClassifier(n_estimators=200),
'RandomBitsForest': RandomBitsForest(number_of_trees=200)
}

# optionally, pass a list of UCI dataset identifiers as the datasets parameter, e.g. datasets=['iris', 'diabetes']
# optionally, pass a dict of scoring functions as the metric parameter, e.g. metrics={'F1-score': f1_score}
compare_estimators(estimators)

"""
ExtraTrees F1score RandomBitsForest F1score RandomForest F1score
========================================================================================
breastcancer (n=683)      0.960 (SE=0.003)      0.954 (SE=0.003)     *0.963 (SE=0.003)
breastw (n=699)     *0.956 (SE=0.003)      0.951 (SE=0.003)      0.953 (SE=0.005)
creditg (n=1000)     *0.372 (SE=0.005)      0.121 (SE=0.003)      0.371 (SE=0.005)
haberman (n=306)      0.317 (SE=0.015)     *0.346 (SE=0.020)      0.305 (SE=0.016)
heart (n=270)      0.852 (SE=0.004)     *0.854 (SE=0.004)      0.852 (SE=0.006)
ionosphere (n=351)      0.740 (SE=0.037)     *0.741 (SE=0.037)      0.736 (SE=0.037)
labor (n=57)      0.246 (SE=0.016)      0.128 (SE=0.014)     *0.361 (SE=0.018)
liverdisorders (n=345)      0.707 (SE=0.013)     *0.723 (SE=0.013)      0.713 (SE=0.012)
tictactoe (n=958)      0.030 (SE=0.007)     *0.336 (SE=0.040)      0.030 (SE=0.007)
vote (n=435)     *0.658 (SE=0.012)      0.228 (SE=0.017)     *0.658 (SE=0.012)
"""``````

1623055162

## Random Forest Algorithm: When to Use & How to Use? [With Pros & Cons]

Data Science encompasses a wide range of algorithms capable of solving problems related to classification. Random forest is usually present at the top of the classification hierarchy. Other algorithms include- Support vector machine, Naive Bias classifier, and Decision Trees.

Before learning about the Random forest algorithm, let’s first understand the basic working of Decision trees and how they can be combined to form a Random Forest.

### Decision Trees

Decision Tree algorithm falls under the category of Supervised learning algorithms. The goal of a decision tree is to predict the class or the value of the target variable based on the rules developed during the training process. Beginning from the root of the tree we compare the value of the root attribute with the data point we wish to classify and on the basis of comparison we jump to the next node.

Moving on, let’s discuss some of the important terms and their significance in dealing with decision trees.

1. Root Node: It is the topmost node of the tree, from where the division takes place to form more homogeneous nodes.
2. Splitting of Data Points: Data points are split in a manner that reduces the standard deviation after the split.
3. Information Gain: Information gain is the reduction in standard deviation we wish to achieve after the split. More standard deviation reduction means more homogenous nodes.
4. Entropy: Entropy is the irregularity present in the node after the split has taken place. More homogeneity in the node means less entropy.

### Need for Random forest algorithm

Decision Tree algorithm is prone to overfitting i.e high accuracy on training data and poor performance on the test data. Two popular methods of preventing overfitting of data are Pruning and Random forest. Pruning refers to a reduction of tree size without affecting the overall accuracy of the tree.

Now let’s discuss the Random forest algorithm.

One major advantage of random forest is its ability to be used both in classification as well as in regression problems.

As its name suggests, a forest is formed by combining several trees. Similarly, a random forest algorithm combines several machine learning algorithms (Decision trees) to obtain better accuracy. This is also called Ensemble learning. Here low correlation between the models helps generate better accuracy than any of the individual predictions. Even if some trees generate false predictions a majority of them will produce true predictions therefore the overall accuracy of the model increases.

Random forest algorithms can be implemented in both python and R like other machine learning algorithms.

### When to use Random Forest and when to use the other models?

First of all, we need to decide whether the problem is linear or nonlinear. Then, If the problem is linear, we should use Simple Linear Regression in case only a single feature is present, and if we have multiple features we should go with Multiple Linear Regression. However, If the problem is non-linear, we should Polynomial Regression, SVR, Decision Tree, or Random

Forest. Then using very relevant techniques that evaluate the model’s performance such as k-Fold Cross-Validation, Grid Search, or XGBoost we can conclude the right model that solves our problem.

### How do I know how many trees I should use?

For any beginner, I would advise determining the number of trees required by experimenting. It usually takes less time than actually using techniques to figure out the best value by tweaking and tuning your model. By experimenting with several values of hyperparameters such as the number of trees. Nevertheless, techniques like cover k-Fold Cross-Validation and Grid Search can be used, which are powerful methods to determine the optimal value of a hyperparameter, like here the number of trees.

### Can p-value be used for Random forest?

Here, the p-value will be insignificant in the case of Random forest as they are non-linear models.

### Bagging

Decision trees are highly sensitive to the data they are trained on therefore are prone to Overfitting. However, Random forest leverages this issue and allows each tree to randomly sample from the dataset to obtain different tree structures. This process is known as Bagging.

Bagging does not mean creating a subset of the training data. It simply means that we are still feeding the tree with training data but with size N. Instead of the original data, we take a sample of size N (N data points) with replacement.

### Feature Importance

Random forest algorithms allow us to determine the importance of a given feature and its impact on the prediction. It computes the score for each feature after training and scales them in a manner that summing them adds to one. This gives us an idea of which feature to drop as they do not affect the entire prediction process. With lesser features, the model will less likely fall prey to overfitting.

### Hyperparameters

The use of hyperparameters either increases the predictive capability of the model or make the model faster.

To begin with, the n_estimator parameter is the number of trees the algorithm builds before taking the average prediction. A high value of n_estimator means increased performance with high prediction. However, its high value also reduces the computational time of the model.

Another hyperparameter is max_features, which is the total number of features the model considers before splitting into subsequent nodes.

Further, min_sample_leaf is the minimum number of leaves required to split the internal node.

Lastly, random_state is used to produce a fixed output when a definite value of random_state is chosen along with the same hyperparameters and the training data.