If you are a junior data scientist who sort of understands how neural nets work, or a machine learning enthusiast who only knows a little about deep learning, this is the article that you cannot miss. Here is how you can build a neural net from scratch using NumPy in 9 steps — from data pre-processing to back-propagation — a must-do practice.
Basic understanding of machine learning, artificial neural network, Python syntax, and programming logic is preferred (but not necessary as you can learn on the go).
Codes are available on Github.
Step one. Import NumPy. Seriously.
import numpy as np
np.random.seed(42) # for reproducibility
Deep learning is data-hungry. Although there are many clean datasets available online, we will generate our own for simplicity — for inputs a and b, we have outputs a+b, a-b, and |a-b|. 10,000 datum points are generated.
X_num_row, X_num_col = [2, 10000] # Row is no. of feature, col is no. of datum points
X_raw = np.random.rand(X_num_row,X_num_col) * 100
X_raw
Output:
array([[37.45401188, 95.07143064, 73.19939418, ..., 94.6707915 ,
39.74879924, 21.7140404 ],
[37.36408185, 33.29120962, 17.61539125, ..., 30.36984691,
44.33200065, 17.22648144]])
y_raw = np.concatenate(([(X_raw[0,:] + X_raw[1,:])], [(X_raw[0,:] - X_raw[1,:])], np.abs([(X_raw[0,:] - X_raw[1,:])])))
# for input a and b, output is a+b; a-b and |a-b|
y_num_row, y_num_col = y_raw.shape
y_raw.shape
Output:
(3, 10000)
Our dataset is split into training (70%) and testing (30%) set. Only training set is leveraged for tuning neural networks. Testing set is used only for performance evaluation when the training is complete.
train_ratio = 0.7
num_train_datum = int(train_ratio*X_num_col)
X_raw_train = X_raw[:,0:num_train_datum]
X_raw_test = X_raw[:,num_train_datum:]
y_raw_train = y_raw[:,0:num_train_datum]
y_raw_test = y_raw[:,num_train_datum:]
Data in the training set is standardized so that the distribution for each standardized feature is zero-mean and unit-variance. The scalers generated from the abovementioned procedure can then be applied to the testing set.
class scaler:
def __init__(self, mean, std):
self.mean = mean
self.std = std
def get_scaler(row):
mean = np.mean(row)
std = np.std(row)
return scaler(mean, std)
def standardize(data, scaler):
return (data - scaler.mean) / scaler.std
def unstandardize(data, scaler):
return (data * scaler.std) + scaler.mean
X_scalers = [get_scaler(X_raw_train[row,:]) for row in range(X_num_row)]
X_train = np.array([standardize(X_raw_train[row,:], X_scalers[row]) for row in range(X_num_row)])
y_scalers = [get_scaler(y_raw_train[row,:]) for row in range(y_num_row)]
y_train = np.array([standardize(y_raw_train[row,:], y_scalers[row]) for row in range(y_num_row)])
X_test = np.array([standardize(X_raw_test[row,:], X_scalers[row]) for row in range(X_num_row)])
y_test = np.array([standardize(y_raw_test[row,:], y_scalers[row]) for row in range(y_num_row)])
print([X_train[row,:].mean() for row in range(X_num_row)]) # should be close to zero
print([X_train[row,:].std() for row in range(X_num_row)]) # should be close to one
print([y_train[row,:].mean() for row in range(y_num_row)]) # should be close to zero
print([y_train[row,:].std() for row in range(y_num_row)]) # should be close to one
[2.740664837931815e-16, 5.227564413092166e-17]
[0.9999999999999999, 1.0]
[-4.608377171929793e-16, 1.0658141036401503e-17, 6.471014200672341e-18]
[0.9999999999999999, 1.0, 1.0]
The scaler therefore does not contain any information from our testing set. We do not want our neural net to gain any information regarding testing set before network tuning.
We have now completed the data pre-processing procedures in 4 steps.
We objectify a ‘layer’ using class in Python. Every layer (except the input layer) has a weight matrix W, a bias vector b, and an activation function. Each layer is appended to a list called neural_net. That list would then be a representation of your fully connected neural network.
class layer:
def __init__(self, layer_index, is_output, input_dim, output_dim, activation):
self.layer_index = layer_index # zero indicates input layer
self.is_output = is_output # true indicates output layer, false otherwise
self.input_dim = input_dim
self.output_dim = output_dim
self.activation = activation
# the multiplication constant is sorta arbitrary
if layer_index != 0:
self.W = np.random.randn(output_dim, input_dim) * np.sqrt(2/input_dim)
self.b = np.random.randn(output_dim, 1) * np.sqrt(2/input_dim)
layers_dim = [X_num_row, 4, 4, y_num_row] # input layer --- hidden layers --- output layers
neural_net = []
for layer_index in range(len(layers_dim)):
if layer_index == 0: # if input layer
neural_net.append(layer(layer_index, False, 0, layers_dim[layer_index], 'irrelevant'))
elif layer_index+1 == len(layers_dim): # if output layer
neural_net.append(layer(layer_index, True, layers_dim[layer_index-1], layers_dim[layer_index], activation='linear'))
else:
neural_net.append(layer(layer_index, False, layers_dim[layer_index-1], layers_dim[layer_index], activation='relu'))
pred_n_param = sum([(layers_dim[layer_index]+1)*layers_dim[layer_index+1] for layer_index in range(len(layers_dim)-1)])
act_n_param = sum([neural_net[layer_index].W.size + neural_net[layer_index].b.size for layer_index in range(1,len(layers_dim))])
print(f'Predicted number of hyperparameters: {pred_n_param}')
print(f'Actual number of hyperparameters: {act_n_param}')
print(f'Number of data: {X_num_col}')
if act_n_param >= X_num_col:
raise Exception('It will overfit.')
Predicted number of hyperparameters: 47
Actual number of hyperparameters: 47
Number of data: 10000
Finally, we do a sanity check on the number of hyperparameters using the following formula, and by counting. The number of datums available should exceed the number of hyperparameters, otherwise it will definitely overfit.
N^l is number of hyperparameters at l-th layer, L is number of layers (excluding input layer)
We define a function for forward propagation given a certain set of weights and biases. The connection between layers is defined in matrix form as:
σ is element-wise activation function, superscript T means transpose of a matrix
Activation functions are defined one by one. ReLU is implemented as a → max(a,0), whereas sigmoid function should return a → 1/(1+e^(-a)), and its implementation is left as an exercise to the reader.
def activation(input_, act_func):
if act_func == 'relu':
return np.maximum(input_, np.zeros(input_.shape))
elif act_func == 'linear':
return input_
else:
raise Exception('Activation function is not defined.')
def forward_prop(input_vec, layers_dim=layers_dim, neural_net=neural_net):
neural_net[0].A = input_vec # Define A in input layer for for-loop convenience
for layer_index in range(1,len(layers_dim)): # W,b,Z,A are undefined in input layer
neural_net[layer_index].Z = np.add(np.dot(neural_net[layer_index].W, neural_net[layer_index-1].A), neural_net[layer_index].b)
neural_net[layer_index].A = activation(neural_net[layer_index].Z, neural_net[layer_index].activation)
return neural_net[layer_index].A
# test run
forward_prop(X_train).shape == y_train.shape # should be True
Output:
True
This is the most tricky part where many of us simply do not understand. Once we have defined a loss metric e for evaluating performance, we would like to know how the loss metric change when we perturb each weight or bias.
def get_loss(y, y_hat, metric='mse'):
if metric == 'mse':
individual_loss = 0.5 * (y_hat - y) ** 2
return np.mean([np.linalg.norm(individual_loss[:,col], 2) for col in range(individual_loss.shape[1])])
else:
raise Exception('Loss metric is not defined.')
def get_dZ_from_loss(y, y_hat, metric):
if metric == 'mse':
return y_hat - y
else:
raise Exception('Loss metric is not defined.')
def get_dactivation(A, act_func):
if act_func == 'relu':
return np.maximum(np.sign(A), np.zeros(A.shape)) # 1 if backward input >0, 0 otherwise; then diaganolize
elif act_func == 'linear':
return np.ones(A.shape)
else:
raise Exception('Activation function is not defined.')
def backward_prop(y, y_hat, metric='mse', layers_dim=layers_dim, neural_net=neural_net, num_train_datum=num_train_datum):
for layer_index in range(len(layers_dim)-1,0,-1):
if layer_index+1 == len(layers_dim): # if output layer
dZ = get_dZ_from_loss(y, y_hat, metric)
else:
dZ = np.multiply(np.dot(neural_net[layer_index+1].W.T, dZ),
get_dactivation(neural_net[layer_index].A, neural_net[layer_index].activation))
dW = np.dot(dZ, neural_net[layer_index-1].A.T) / num_train_datum
db = np.sum(dZ, axis=1, keepdims=True) / num_train_datum
neural_net[layer_index].dW = dW
neural_net[layer_index].db = db
We now have every building block for training a neural network.
Once we know the sensitivities of weights and biases, we try to minimize (hence the minus sign) the loss metric iteratively by gradient descent using the following update rule:
W = W - learning_rate * ∂W
b = b - learning_rate * ∂b
learning_rate = 0.01
max_epoch = 1000000
for epoch in range(1,max_epoch+1):
y_hat_train = forward_prop(X_train) # update y_hat
backward_prop(y_train, y_hat_train) # update (dW,db)
for layer_index in range(1,len(layers_dim)): # update (W,b)
neural_net[layer_index].W = neural_net[layer_index].W - learning_rate * neural_net[layer_index].dW
neural_net[layer_index].b = neural_net[layer_index].b - learning_rate * neural_net[layer_index].db
if epoch % 100000 == 0:
print(f'{get_loss(y_train, y_hat_train):.4f}')
0.3502
0.3450
0.3450
0.3450
0.3450
0.3450
0.3450
0.3450
0.3450
0.3450
Training loss should be going down as it iterates
The model generalizes well if the testing loss is not much higher than the training loss. We also make some test cases to see how the model performs.
# test loss
get_loss(y_test, forward_prop(X_test))
Output:
0.3194238845483295
def predict(X_raw_any):
X_any = np.array([standardize(X_raw_any[row,:], X_scalers[row]) for row in range(X_num_row)])
y_hat = forward_prop(X_any)
y_hat_any = np.array([unstandardize(y_hat[row,:], y_scalers[row]) for row in range(y_num_row)])
return y_hat_any
predict(np.array([[30,70],[70,30],[3,5],[888,122]]).T)
Output:
array([[ 98.56378114, 102.63221611, 5.56349031, 562.15335048],
[-26.25088759, 14.80155747, 21.3245024 , 221.60597885],
[ 48.08520947, 25.18205121, 15.71597463, -64.01423955]])
This is how you can build a neural net from scratch using NumPy in 9 steps. Some of you might have already built neural nets using some high-level frameworks such as TensorFlow, PyTorch, or Keras. However, building a neural net using only low-level libraries enable us to truly understand the mathematics behind the mystery.
My implementation by no means is the most efficient way to build and train a neural net. There is so much room for improvement but that is a story for another day. Codes are available on Github. Happy coding!
Source: https://github.com/edenau/ML-in-NumPy/blob/master/neural-net.ipynb
#data-science #machine-learning #python #numpy