Vanishing Gradient Problem in RNNs

Say we are building a language model that aims to predict the next word in a sentence, given the previous words.

A simple RNN is affected by local influence, i.e. the probability of a word is more influenced by the most recent words rather than by the words that occurred earlier on in the sentence.

This is due to the vanishing gradient problem.

To combat this issue, Gated Recurrent Units (GRUs) are used.

Note: The exploding gradient problem can also occur i.e. gradients become too large and weights become NaN. To combat it, clip the gradients.

Last updated