RNNs suffer from short-term memory → if sequence is very long, have a hard time carrying information from earlier time steps to later ones
RNNs suffer from vanishing gradient problem → gradient shrinks as it back propogates through time, doesn't contribute to the learning
layers that get a small graident update stop learning → usually the earlier layers and because of this, RNN's forget what happens early on in longer sequences, having short-term memory
LSTM's developed as a solution to short-term memory and use mechanisms called gates to regulate flow of information
gates learn which data in sequence is important to keep or throw away → passes relevant information down sequences to make predictions
LSTM's and GRU's work for speech recog, synthesis, text-generation, and more
Status Quo: RNNs: