# Neural Networks (Multilayer Perceptron -MLP)¶

## 1. Background¶

In this tutorial I describe the basic idea for a simple neural network, and provide a couple of simple examples using the scikit learn toolbox.

Neural networks, or multilayer perceptions (MLP), are a biologically inspired technique for classification and regression. A neuron, or cell unit, is modelled as a logistic regression model, the idea then is to stack many of these neurons together in order to model complex functions. In fact, it can be shown that given enough hidden units, a neural network can model any smooth function.

The simplest neural network takes input features X, and a target y, and uses a hidden layer to learn a non-linear function approximator for either classification or regression.

A one hidden layer MLP learns the function $f(x) = W_2 g(W_1^T x + b_1) + b_2$ where $W_1 \in \mathbf{R}^m$ and $W_2, b_1, b_2 \in \mathbf{R}$ are model parameters. $W_1, W_2$ represent the weights of the input layer and hidden layer, and $b_1, b_2$ represent the bias added to the hidden layer and the output layer, respectively. $g(\cdot) : R \rightarrow R$ is the non-linear activation function.

In the example above, the leftmost layer, known as the input layer, consists of a set of neurons $x_i | x_1, x_2, …, x_m|$ representing the input features. Each neuron in the hidden layer transforms the values from the previous layer with a weighted linear summation $w_1x_1 + w_2x_2 + … + w_mx_m$, followed by a non-linear activation function $g(\cdot):R \rightarrow R$ – like the hyperbolic tan function or a rectified linear unit (RELU). The output layer receives the values from the last hidden layer and transforms them into output values.

For binary classification, $f(x)$ passes through the logistic function $g(z)=1/(1+e^{-z})$ to obtain output values between zero and one. A threshold, set to 0.5, would assign samples of outputs larger or equal 0.5 to the positive class, and the rest to the negative class.

If there are more than two classes, $f(x)$ itself would be a vector of size $n$ classes. Instead of passing through logistic function, it passes through the softmax function, which is written as,

$$\text{softmax}(z)_i = \frac{\exp(z_i)}{\sum_{l=1}^k\exp(z_l)}$$

where $z_i$ represents the $ith$ element of the input to softmax, which corresponds to class $i$, and $K$ is the number of classes. The result is a vector containing the probabilities that sample x belong to each class. The output is the class with the highest probability.

In regression, the output remains as $f(x)$; therefore, output activation function is just the identity function.

MLP uses different loss functions depending on the problem type. The loss function for classification is Cross-Entropy, which in binary case is given as,

$$Loss(\hat{y},y,W) = -y \ln {\hat{y}} – (1-y) \ln{(1-\hat{y})} + \alpha ||W||_2^2$$

where $\alpha ||W||_2^2$ is an L2-regularization term (aka penalty) that penalizes complex models; and $\alpha > 0$ is a non-negative hyperparameter that controls the magnitude of the penalty.

For regression, MLP uses the Square Error loss function; written as,

$$Loss(\hat{y},y,W) = \frac{1}{2}||\hat{y} – y ||_2^2 + \frac{\alpha}{2} ||W||_2^2$$

Starting from initial random weights, multi-layer perceptron (MLP) minimizes the loss function by repeatedly updating these weights. The update step is accomplished through ** backpropagation**, such that after computing the loss, a backward pass propagates it from the output layer to the previous layers, providing each weight parameter with an update value meant to decrease the loss.

A detailed explanation of neural networks can be found in the video series by Nando de Freitas, and also in the deep learning book by Ian Goodfellow, Joshua Bengio, and Aaron Courville.

```
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_mldata
from sklearn.neural_network import MLPClassifier
mnist = fetch_mldata("MNIST original")
# rescale the data, use the traditional train/test split
X, y = mnist.data / 255., mnist.target
X_train, X_test = X[:60000], X[60000:]
y_train, y_test = y[:60000], y[60000:]
#mlp = MLPClassifier(hidden_layer_sizes=(100, 100), max_iter=400, alpha=1e-4,
# solver='sgd', verbose=10, tol=1e-4, random_state=1)
mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=10, alpha=1e-4,
solver='sgd', verbose=10, tol=1e-4, random_state=1,
learning_rate_init=.1)
mlp.fit(X_train, y_train)
print("Training set score: %f" % mlp.score(X_train, y_train))
print("Test set score: %f" % mlp.score(X_test, y_test))
fig, axes = plt.subplots(4, 4)
# use global min / max to ensure all weights are shown on the same scale
vmin, vmax = mlp.coefs_[0].min(), mlp.coefs_[0].max()
for coef, ax in zip(mlp.coefs_[0].T, axes.ravel()):
ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin,
vmax=.5 * vmax)
ax.set_xticks(())
ax.set_yticks(())
plt.show()
```

```
```