Neural nets/Deep learning

Reading: Elements of Statistical Learning, Chapter 11.3-11.8

Review: The brain

wikipedia's illustration

Idea:

Neural networks

Neural networks are made up of units that are supposed to mimic neurons in the brain:

Activation functions:

Any non-linear activation function allows the net to go beyond linear functions of the input

Activation functions should be smooth for fitting purposes (gradient descent)

Neural net structures: putting the units together

Multiple hidden layers vs. one hidden layer

Special cases:

Neural nets for regression

Notice that the net is just a fancy function of the inputs, parameterized by the weights. Therefore, we can choose the weights so that the net predicts a response, just like in standard linear regression.

Backpropagation derivation

Simple case:

Derivative for the weights connecting the hidden layer to the output layer: \[ \frac{\partial R_i}{\partial \beta_{m}} = -2(y_i - f(x_i)) g'(\beta^T z_i) z_{mi} \]

Derivative for the weights connecting the input layer to the hidden layer: \[ \frac{\partial R_i}{\partial \alpha_{ml}} = -2(y_i - f(x_i)) g'(\beta^T z_i) \beta_m\sigma'(\alpha_m^T x_i) x_{il} \]

Gradient descent update is then: \[ \begin{align*} \beta_m^{(r+1)} = \beta_{m}^{(r)} - \gamma_r \sum_{i=1}^N \frac{\partial R_i}{\partial \beta_{m}^{(r)}}\\ \alpha_{lm}^{(r+1)} = \alpha_{ml}^{(r)} - \gamma_r \sum_{i=1}^N \frac{\partial R_i}{\partial \alpha_{ml}^{(r)}} \end{align*} \]

\(\gamma_r\) is referred to as the "learning rate", we've seen it as the step size before.

Back-propagation equations, aka "what order do we do the computations in"?

Write \[ \begin{align*} \frac{\partial R_i}{\partial \beta_{m}} &= \delta_{i} z_{mi} \\ \frac{\partial R_i}{\partial \alpha_{ml}} &= s_{mi} x_{il} \end{align*} \] so \[ \begin{align*} \delta_i &= -2(y_i - f(x_i))g'(\beta^T z_i) \\ s_{mi} &= -2(y_i - f(x_i)) g'(\beta^T z_i) \beta_m \sigma'(\alpha_m^T x_i) \end{align*} \] and \[ s_{mi} = \sigma'(\alpha_m^T x_i) \beta_m \delta_i \]

Interpretation: \(\delta_i\) and \(s_{mi}\) are the "errors" from the current model on the output layer and the hidden layers, respectively.

Finally, backpropagation algorithm to compute the gradients:

Forward pass:

Backward pass:

Notes:

Issues with fitting:

Example: zip code data

Goal: Given images representing digits, classify them correctly.

Input data, \(x_i\), are \(16 \times 16\) grayscale images, represented as vectors in \(\mathbb R^{256}\)

Responses \(y_i\) give the digit in the image.

Encode this as a classification problem, use neural nets with different architectures to fit

Some net architectures

All cases: 10 output units, corresponding to the 10 possible digits. In all cases the output unit is sigmoidal.

Idea behind weight constraints: Each unit computes the same functional of the previous layer, so they are extracting the same features from different parts of the image. A net with this sort of weight sharing is referred to as a convolutional network.

Summing up