Recurrent Neural Network (RNN)

1.   Introduction

A Recurrent Neural Network (RNN) is a feedforward network (FFN) with a loop that allows the network to retain information from a previous timestep in sequential data. Both RNN and FFN models are able to “remember” information but differ in how and the extent with which that happens. Say we’re training an image classifier for handwritten digits. A feedforward network learns what each image, in its entirety, looks like, whereas a recurrent neural network learns the correct sequence with which each column in the image occurs. Both networks then use that learned knowledge to classify given samples in production.



Figure 1: Input for feedforward network (FNN) and recurrent neural network (RNN)

RNN’s main appeal is that it is able to generate a feature vector of the summarized data once presented with the entire sequence. This happens because RNNs take into consideration the previous timestep in a sequence by receiving the cell’s previous hidden state. By considering the previous hidden state in every step, RNNs allow for better modeling of data with time properties and thus sequence learning. The rolled and unrolled RNN architecture can be seen in Fig. 2.

2.   Forward Propagation

During forward propagation, information in the RNN is propagated throughout the network one step at a time. This can be mathematically defined as in Equation (1):


where      is the input,       is the output,          is the previous hidden state, and W, V and U are their respective trainable weight matrices. The activation function is represented by g(.), and we’ll be considering softmax (σ) for this article.

3.   Backpropagation Through Time (BPTT)

Recurrent Neural Networks are trained through a modified backpropagation algorithm on the unrolled RNN called Backpropagation Through Time (BPTT). The goal of backpropagation is to iteratively minimize the error of the network by updating its weights so that the network is able to output values closer to the intended target.

Calculating the loss

After the forward pass in Equation 1, this algorithm calculates the loss between the predicted and correct words in order to realize the backward pass. Considering the standard logistic regression loss, or cross-entropy loss, we have:


where       and       represent the correct and predicted word at time step t,          is the loss for word at time step t and E is the total loss for a full sentence.

Updating weights

The goal of BPTT is to allow the network to learn appropriate U, V and W by Stochastic Gradient Descent. The mentioned weights are updated as in Eq. (3):

where η is the learning rate. In that manner, we need to calculate the error in relation to each of these weights. For simplicity, since the total loss is a sum of each word’s loss, we choose to calculate the gradients in relation to only      . Additionally, biases will not be considered.

Figure 2: Architecture of a Recurrent Neural Network

Calculating gradient of error with respect to V

Related with     . Using the Chain Rule:




and, knowing that the derivative of sigmoid or softmax activation function is σ(.), is σ(.)(1 − σ(.)),


By substituting (4) and (5) in Eq. (3), we have:

Figure 2:   Simple RNN architecture with BPTT

Information sharing through time

Consider a model with 4 states as illustrated in Fig. 2. Calculating the changes in W and U are tricky because they involve information being transferred from one unrolled cell to another, unlike V. That’s where BPTT differentiates from the conventional backpropagation algorithm, since in the latter there’s no sharing of parameters between neurons of the same layer.

In our example,      with respect to U is influenced solely by      , but       is influenced by     , which is influenced by     , which is influence by      . In other words, when calculating the derivative of       with respect to W or U, one can’t simply consider           as a constant since it’s also being influenced by the same variable in previous timesteps.

Calculating gradient of error with respect to U

Related with h. Take into consideration the example in Fig. 2 and calculate the derivatives of       ,       and       with respect to U using the Chain Rule.

With       and        being simplified as:


And generalized as:


where          was calculated in Eq. (4) and           can be calculated similarly to Eq. (5):

which can be used for the simplification step in Eq. 7:

As for          and           , and knowing that                                      ,

which can be used for the simplification step in Eq. 6:

Calculating gradient of error with respect to W

Related with x. Similarly to U, using the Chain Rule,


In our example,

As expected,        with respect to W depends on       ,        and       , all inputs up to timestep 3.

4.   Limitations

RNN is popular in Natural Language Processing and is currently being applied to Machine Translation, Language Summarization, and Image Caption. However, they also suffer from what is known as the vanishing and exploding gradient problem, caused by the gradient becoming increasingly small or explosively large due to the many gradient multiplications during backprop. The consequence is that the network becomes incapable of learning temporal dependencies of long-term nature, such as long audios or longer sentences in language.

5.   Variants

5.1.   Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) is an RNN variant proposed to solve the vanishing and exploding gradient issue. It does so by the addition of output, input and forget gates and a memory component (cell state), allowing for a better control of the information flow. Eq. (8) shows the mathematical description of an LSTM cell:


where     ,      and      are the forget, input and output gates respectively;         is the candidate cell state considering the previous hidden state and the current input;       is the updated cell state after considering what must be forgotten from the previous cell state and what must be considered from the current candidate cell; and       is the cell’s hidden state.

yes / no

yes / no

forget gate

input gate

output gate




5.2.   Gated Recurrent Unit (GRU)

Another variant is the Gated Recurrent Unit (GRU) (Cho et al., 2014), proposed with the same goal as LSTM. It has the additional advantage of performing at a similar level whilst being less computationally expensive due to a reduced number of gates (reset and update) and no memory cell. GRU’s mathematical description is shown in Eq. (9):


where x is the input,      and      are the reset and update gates respectively,      is called the candidate activation after considering the output from the reset gate and       is the hidden state or the final activation of the cell.

Although GRU is less computationally expensive than LSTM, when considering its performance, researchers have demonstrated that the advantage of using GRU versus LSTM is not black and white and that it depends on the application that it’s being used. (Young et al., 2018)


(=              )




6.   Code

View/download the PyTorch and Tensorflow codes in the git repository here.