A Recurrent Neural Network (RNN) is a feedforward network (FFN) with a loop that allows the network to retain information from a previous timestep in sequential data. Both RNN and FFN models are able to “remember” information but differ in how and the extent with which that happens. Say we’re training an image classifier for handwritten digits. A feedforward network learns what each image, in its entirety, looks like, whereas a recurrent neural network learns the correct sequence with which each column in the image occurs. Both networks then use that learned knowledge to classify given samples in production.
Figure 1: Input for feedforward network (FNN) and recurrent neural network (RNN)
RNN’s main appeal is that it is able to generate a feature vector of the summarized data once presented with the entire sequence. This happens because RNNs take into consideration the previous timestep in a sequence by receiving the cell’s previous hidden state. By considering the previous hidden state in every step, RNNs allow for better modeling of data with time properties and thus sequence learning. The rolled and unrolled RNN architecture can be seen in Fig. 2.
2. Forward Propagation
During forward propagation, information in the RNN is propagated throughout the network one step at a time. This can be mathematically defined as in Equation (1):
where is the input, is the output, is the previous hidden state, and W, V and U are their respective trainable weight matrices. The activation function is represented by g(.), and we’ll be considering softmax (σ) for this article.
3. Backpropagation Through Time (BPTT)
Recurrent Neural Networks are trained through a modified backpropagation algorithm on the unrolled RNN called Backpropagation Through Time (BPTT). The goal of backpropagation is to iteratively minimize the error of the network by updating its weights so that the network is able to output values closer to the intended target.
Calculating the loss
After the forward pass in Equation 1, this algorithm calculates the loss between the predicted and correct words in order to realize the backward pass. Considering the standard logistic regression loss, or cross-entropy loss, we have:
where and represent the correct and predicted word at time step t, is the loss for word at time step t and E is the total loss for a full sentence.
The goal of BPTT is to allow the network to learn appropriate U, V and W by Stochastic Gradient Descent. The mentioned weights are updated as in Eq. (3):
where η is the learning rate. In that manner, we need to calculate the error in relation to each of these weights. For simplicity, since the total loss is a sum of each word’s loss, we choose to calculate the gradients in relation to only . Additionally, biases will not be considered.
Figure 2: Architecture of a Recurrent Neural Network
Calculating gradient of error with respect to V
Related with . Using the Chain Rule:
and, knowing that the derivative of sigmoid or softmax activation function is σ(.), is σ(.)(1 − σ(.)),
By substituting (4) and (5) in Eq. (3), we have: