Neural nets/Deep learning

Review: The brain

wikipedia’s illustration

Idea:

Neural networks

Neural networks are made up of units that are supposed to mimic neurons in the brain:

Activation functions:

Any non-linear activation function allows the net to go beyond linear functions of the input

Activation functions should be smooth for fitting purposes (gradient descent)

Neural net structures: putting the units together

Multiple hidden layers vs. one hidden layer

Special cases:

Neural nets for regression

Notice that the net is just a fancy function of the inputs, parameterized by the weights. Therefore, we can choose the weights so that the net predicts a response, just like in linear regression.

Backpropagation derivation

Simple case:

Derivative for the weights connecting the hidden layer to the output layer for one sample: \[ \frac{\partial R_i}{\partial \beta_{m}} = \begin{cases} -2(y_i - f(x_i)) g'(\beta_0 + \beta^T z_i) z_{im} & m = 1,\ldots, M \\ -2(y_i - f(x_i)) g'(\beta_0 + \beta^T z_i) & m = 0 \end{cases} \]

Derivative for the weights connecting the input layer to the hidden layer for one sample: \[ \frac{\partial R_i}{\partial \alpha_{ml}} = \begin{cases} -2(y_i - f(x_i)) g'(\beta_0 + \beta^T z_i) \beta_m\sigma'(\alpha_{m0} + \alpha_m^T x_i) x_{il} & l = 1, \ldots, p \\ -2(y_i - f(x_i)) g'(\beta_0 + \beta^T z_i) \beta_m\sigma'(\alpha_{m0} + \alpha_m^T x_i) & l = 0 \end{cases} \]

Gradient descent update is then: \[ \begin{align*} \beta_m^{(r+1)} = \beta_{m}^{(r)} - \gamma_r \sum_{i=1}^N \frac{\partial R_i}{\partial \beta_{m}^{(r)}}\\ \alpha_{ml}^{(r+1)} = \alpha_{ml}^{(r)} - \gamma_r \sum_{i=1}^N \frac{\partial R_i}{\partial \alpha_{ml}^{(r)}} \end{align*} \]

\(\gamma_r\) is referred to as the “learning rate”, we’ve seen it as the step size before.

Back-propagation equations, aka “what order do we do the computations in”?

Write \[ \begin{align*} \frac{\partial R_i}{\partial \beta_{m}} &= \delta_{i} z_{im} \\ \frac{\partial R_i}{\partial \alpha_{ml}} &= s_{im} x_{il} \end{align*} \] so \[ \begin{align*} \delta_i &= -2(y_i - f(x_i))g'(\beta_0 + \beta^T z_i) \\ s_{im} &= -2(y_i - f(x_i)) g'(\beta_0 + \beta^T z_i) \beta_m \sigma'(\alpha_{m0} + \alpha_m^T x_i) \end{align*} \] and \[ s_{im} = \sigma'(\alpha_{m0} + \alpha_m^T x_i) \beta_m \delta_i \]

Back-propagation:

Interpretation: \(\delta_i\) and \(s_{im}\) are the “errors” from the current model on the output layer and the hidden layers, respectively.

Notes:

Issues with fitting:

Example: zip code data

Goal: Given images representing digits, classify them correctly.

Input data, \(x_i\), are \(16 \times 16\) grayscale images, represented as vectors in \(\mathbb R^{256}\)

Responses \(y_i\) give the digit in the image.

Encode this as a classification problem, use neural nets with different architectures to fit

If you want to play with this in R

Example: the same zip code data

## if you want to do this you'll have to install some the python version of keras first, which requires you to have TensorFlow, CNTK, or Theano installed as well
library(keras)
mnist = dataset_mnist()
x_train = mnist$train$x
y_train = mnist$train$y
y_train_matrix = to_categorical(y_train, num_classes = 10)
x_test = mnist$test$x
y_test = mnist$test$y

Let’s look at some of the images:

## function to rearrange things so that we can plot them
flip_image = function(x) {
    n = nrow(x)
    return(t(x[n:1,]))
}
par(mfrow = c(3,3))
for(i in 1:9) {
    image(flip_image(x_train[i,,]), col = topo.colors(100), axes = FALSE,
          main = y_train[i])
}

model = keras_model_sequential()
model %>%
  layer_flatten(input_shape = c(28, 28)) %>%
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dense(units = 10, activation = 'softmax')
model %>% compile(
    optimizer = 'adam', 
    loss = loss_categorical_crossentropy,
    metrics = 'accuracy'
)
model
## Model
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## flatten_1 (Flatten)                 (None, 784)                     0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 128)                     100480      
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 10)                      1290        
## ================================================================================
## Total params: 101,770
## Trainable params: 101,770
## Non-trainable params: 0
## ________________________________________________________________________________
## number of parameters for the first layer: each hidden unit has a weight associated with each of the 784 predictor units, plus a bias term
(784 + 1) * 128
## [1] 100480
## number of parameters for the second layer: each output unit has a weight associated with each of the 128 hidden units, plus a bias term
(128 + 1)* 10
## [1] 1290

Fit the model, look at the predictions:

model %>% fit(x = x_train, y = y_train_matrix, epochs = 15)
test_predictions = model %>% predict_classes(x_test)
par(mfrow = c(3,3))
for(i in 1:9) {
    image(flip_image(x_test[i,,]), col = topo.colors(100), axes = FALSE,
          main = sprintf("True digit: %i, Prediction: %i", y_test[i], test_predictions[i]))
}

More elaborate architectures do much better, for example the convolutional model.

Some net architectures

All cases: 10 output units, corresponding to the 10 possible digits.

Idea behind weight constraints: Each unit computes the same functional of the previous layer, so they are extracting the same features from different parts of the image. A net with this sort of weight sharing is referred to as a convolutional network.

Summing up