LSTM Networks: A Deep Dive with Code & Variants
April 8, 2026
TL;DR
LSTM (Long Short-Term Memory) networks are a type of recurrent neural network that solve the vanishing gradient problem through gated memory cells, enabling them to learn long-range dependencies in sequential data. This guide walks through the complete LSTM architecture with working Keras/TensorFlow code for time series forecasting, sentiment analysis, text generation, and speech recognition — plus GRU and Bidirectional LSTM variants with practical guidance on when to use each.
What You'll Learn
- How traditional RNNs work and why they struggle with long sequences
- The complete LSTM architecture: gates, cell state, and mathematical foundations
- Practical Keras/TensorFlow implementations for time series, NLP, and speech recognition
- LSTM variants including GRUs and Bidirectional LSTMs with a feature comparison
- Hyperparameter tuning strategies and troubleshooting common training issues
Introduction to Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs) represent a class of neural networks specifically designed to handle sequential data. Unlike traditional feedforward networks, RNNs maintain an internal state (hidden state) that captures information about previous elements in the sequence. This makes them particularly effective for tasks where the order of inputs matters, such as time series analysis, natural language processing, and speech recognition.
The basic RNN structure processes sequences one element at a time, updating its hidden state based on both the current input and the previous hidden state. The mathematical representation is:
h_t = tanh(W_hh · h_{t-1} + W_xh · x_t + b_h)
y_t = W_hy · h_t + b_y
Where:
h_tis the hidden state at time tx_tis the input at time tWterms are weight matricesbterms are bias vectorsy_tis the output at time t
While RNNs are theoretically capable of learning long-term dependencies, they face significant practical challenges, most notably the vanishing gradient problem.
The Vanishing Gradient Problem
The vanishing gradient problem occurs during backpropagation through time (BPTT), where gradients become exponentially small as they propagate backward through the network. This makes it difficult for the network to learn long-range dependencies because the weights in the earlier layers receive extremely small updates.
Consider a simple RNN with a single hidden layer. The gradient of the loss with respect to the hidden state at time step k is:
∂L/∂h_k = ∂L/∂h_T · ∏_{t=k}^{T-1} diag(tanh'(h_{t+1})) · W_hh^T
Where:
tanh'is the derivative of the tanh activation function (equal to1 - tanh²(x))W_hhis the recurrent weight matrixTis the sequence length
The product of these terms causes the gradient to shrink exponentially when the eigenvalues of W_hh are less than 1, or explode when they're greater than 1.
To visualize this, let's compute gradients for a simple RNN:
import numpy as np
import matplotlib.pyplot as plt
def vanishing_gradient_demo(sequence_length=50):
# Initialize weights
W = np.array([[0.5]]) # Recurrent weight
h0 = np.array([[1.0]]) # Initial hidden state (same shape as W @ h)
# Simulate forward pass
h = [h0]
for _ in range(sequence_length):
h_next = np.tanh(W * h[-1])
h.append(h_next)
# Compute gradients via chain rule (backpropagation through time)
gradients = []
for t in range(sequence_length, 0, -1):
grad = np.ones((1, 1))
for k in range(t, sequence_length):
grad *= (1 - h[k]**2) * W # Derivative of tanh
gradients.append(grad[0, 0])
# Plot
plt.figure(figsize=(10, 6))
plt.plot(range(1, sequence_length+1), gradients[::-1])
plt.yscale('log')
plt.xlabel('Time Step')
plt.ylabel('Gradient Magnitude (log scale)')
plt.title('Vanishing Gradient Problem in RNNs')
plt.grid(True)
plt.show()
vanishing_gradient_demo()
This code demonstrates how gradients diminish exponentially as we backpropagate through time, making it difficult for the network to learn dependencies that span many time steps.
Understanding LSTM Networks
Long Short-Term Memory (LSTM) networks were introduced by Hochreiter and Schmidhuber in 19971 to address the vanishing gradient problem2. The key innovation is the introduction of a memory cell that can maintain information over long periods and gating mechanisms that regulate the flow of information.
LSTM Cell Structure
An LSTM cell consists of three main gates and a cell state:
- Forget Gate: Decides what information to discard from the cell state
- Input Gate: Determines what new information to store in the cell state
- Output Gate: Controls what information to output based on the cell state
Here's the complete set of equations for an LSTM cell:
f_t = σ(W_f · [h_{t-1}, x_t] + b_f) # Forget gate
i_t = σ(W_i · [h_{t-1}, x_t] + b_i) # Input gate
C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C) # Candidate cell state
C_t = f_t * C_{t-1} + i_t * C̃_t # Update cell state
o_t = σ(W_o · [h_{t-1}, x_t] + b_o) # Output gate
h_t = o_t * tanh(C_t) # Hidden state output
Where:
σis the sigmoid activation function*denotes element-wise multiplication[h_{t-1}, x_t]is the concatenation of the previous hidden state and current inputWterms are weight matricesbterms are bias vectors
Mathematical Foundations of LSTMs
The key to LSTMs' success lies in their additive cell state update and gating mechanisms:
-
Cell State (C_t): The memory of the LSTM, which can maintain information over long sequences. The additive update (
C_t = f_t * C_{t-1} + i_t * C̃_t) allows gradients to flow more easily during backpropagation. -
Gates: Each gate uses a sigmoid activation (σ) to produce values between 0 and 1, which are then multiplied element-wise with other vectors:
- Forget gate (f_t): Controls how much of the previous cell state to remember
- Input gate (i_t): Controls how much of the candidate state to add
- Output gate (o_t): Controls how much of the cell state to expose
-
Gradient Flow: The derivative of the cell state with respect to the previous cell state is:
∂C_t/∂C_{t-1} = f_t + (∂f_t/∂C_{t-1} * C_{t-1} + ∂i_t/∂C_{t-1} * C̃_t + i_t * ∂C̃_t/∂C_{t-1})The additive nature means the gradient doesn't necessarily vanish, as it's not a product of many small numbers.
Let's implement a basic LSTM cell from scratch to understand the forward-pass computations. Note that this is a pedagogical implementation — it covers inference only. A full training implementation would also require a backward pass that computes gradients for each gate and updates weights via BPTT:
import numpy as np
class LSTMCell:
def __init__(self, input_size, hidden_size):
# Xavier/Glorot initialization (same approach Keras uses by default)
scale = np.sqrt(2.0 / (hidden_size + input_size + hidden_size))
self.W_f = np.random.randn(hidden_size, hidden_size + input_size) * scale
self.W_i = np.random.randn(hidden_size, hidden_size + input_size) * scale
self.W_C = np.random.randn(hidden_size, hidden_size + input_size) * scale
self.W_o = np.random.randn(hidden_size, hidden_size + input_size) * scale
self.b_f = np.ones((hidden_size, 1)) # Initialize to 1 so forget gate defaults to "remember"
self.b_i = np.zeros((hidden_size, 1))
self.b_C = np.zeros((hidden_size, 1))
self.b_o = np.zeros((hidden_size, 1))
self.hidden_size = hidden_size
def forward(self, x, h_prev, C_prev):
# Concatenate h_prev and x
concat = np.vstack((h_prev, x))
# Forget gate
f = self._sigmoid(np.dot(self.W_f, concat) + self.b_f)
# Input gate
i = self._sigmoid(np.dot(self.W_i, concat) + self.b_i)
# Candidate cell state
C_tilde = np.tanh(np.dot(self.W_C, concat) + self.b_C)
# Update cell state
C = f * C_prev + i * C_tilde
# Output gate
o = self._sigmoid(np.dot(self.W_o, concat) + self.b_o)
# Update hidden state
h = o * np.tanh(C)
return h, C, (f, i, C_tilde, o)
def _sigmoid(self, x):
return 1 / (1 + np.exp(-x))
Implementing LSTMs with Keras/TensorFlow
Let's implement a practical LSTM model for time series forecasting using Keras. We'll use the Air Passengers dataset to predict monthly airline passenger numbers.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.optimizers import Adam
# Load and preprocess data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
df = pd.read_csv(url, parse_dates=['Month'], index_col='Month')
data = df['Passengers'].values.astype('float32').reshape(-1, 1)
# Normalize data
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(data)
# Create sequences
def create_sequences(data, seq_length):
X, y = [], []
for i in range(len(data) - seq_length):
X.append(data[i:i+seq_length])
y.append(data[i+seq_length])
return np.array(X), np.array(y)
seq_length = 12
X, y = create_sequences(scaled_data, seq_length)
# Split into train/test
train_size = int(len(X) * 0.8)
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
# Build LSTM model
model = Sequential([
LSTM(50, input_shape=(seq_length, 1), return_sequences=True),
LSTM(50),
Dense(1)
])
model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
# Train the model
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.2,
verbose=1
)
# Make predictions
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)
# Inverse transform predictions
train_predict = scaler.inverse_transform(train_predict)
y_train_inv = scaler.inverse_transform(y_train)
test_predict = scaler.inverse_transform(test_predict)
y_test_inv = scaler.inverse_transform(y_test)
# Plot results
plt.figure(figsize=(12, 6))
plt.plot(df.index[seq_length:seq_length+train_size], y_train_inv, label='Actual Train')
plt.plot(df.index[seq_length+train_size:], y_test_inv, label='Actual Test')
plt.plot(df.index[seq_length:seq_length+train_size], train_predict, label='Train Predictions')
plt.plot(df.index[seq_length+train_size:], test_predict, label='Test Predictions')
plt.legend()
plt.title('Air Passengers: Actual vs Predicted')
plt.show()
This implementation demonstrates:
- Data preprocessing and sequence creation
- Building a stacked LSTM model
- Training and evaluation
- Visualization of results
LSTM Variants: GRUs and Bidirectional LSTMs
Gated Recurrent Units (GRUs)
GRUs, introduced by Cho et al. in 20143, are a simplified variant of LSTMs that combine the forget and input gates into a single "update gate." They also merge the cell state and hidden state, resulting in fewer parameters and faster training.
The GRU equations are:
z_t = σ(W_z · [h_{t-1}, x_t]) # Update gate
r_t = σ(W_r · [h_{t-1}, x_t]) # Reset gate
h̃_t = tanh(W_h · [r_t * h_{t-1}, x_t]) # Candidate hidden state
h_t = (1 - z_t) * h_{t-1} + z_t * h̃_t # Final hidden state
Key differences from LSTMs:
- No separate cell state — the hidden state serves both roles
- The main simplification is merging the forget and input gates into a single update gate, reducing from 4 weight matrices to 3
- Fewer parameters overall, leading to faster training
Bidirectional LSTMs
Bidirectional LSTMs4 process sequences in both forward and backward directions, allowing the network to capture context from both past and future states. This is particularly useful for tasks where the full input is available upfront — named entity recognition, sentiment analysis, and speech recognition. (For machine translation, Transformer-based encoder-decoder architectures have been the standard since 2018.)
The output at each time step is typically the concatenation of the forward and backward hidden states:
h_t = [h_t_forward; h_t_backward]
GRU vs. LSTM: A Detailed Comparison
| Feature | LSTM | GRU |
|---|---|---|
| Parameters | More (4 weight matrices: 3 gates + candidate) | Fewer (3 weight matrices: 2 gates + candidate) |
| Training Speed | Slower | Faster |
| Memory | Better for long sequences | Slightly worse for very long sequences |
| Performance | Better on larger datasets | Comparable on smaller datasets |
| Architecture Complexity | More complex | Simpler |
Implementing a Bidirectional LSTM for Sentiment Analysis
from tensorflow.keras.datasets import imdb
from tensorflow.keras.utils import pad_sequences
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense
from tensorflow.keras.models import Sequential
# Load IMDB dataset
max_features = 10000 # Number of words to consider as features
maxlen = 500 # Cut texts after this number of words
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
# Pad sequences
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)
# Build Bidirectional LSTM model
model = Sequential([
Embedding(max_features, 128, input_length=maxlen),
Bidirectional(LSTM(64, return_sequences=True, dropout=0.3, recurrent_dropout=0.2)),
Bidirectional(LSTM(32, dropout=0.3, recurrent_dropout=0.2)),
Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the model
history = model.fit(x_train, y_train,
epochs=5,
batch_size=32,
validation_split=0.2)
# Evaluate
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test accuracy: {test_acc:.4f}')
Real-World Applications of LSTMs
Time Series Forecasting with LSTMs
LSTMs excel at time series forecasting due to their ability to capture temporal dependencies. Common applications include:
- Stock price prediction
- Energy demand forecasting
- Weather prediction
- Sales forecasting
Key considerations for time series with LSTMs:
- Proper sequence length selection
- Handling seasonality and trends
- Appropriate scaling of input features
- Using multiple time steps ahead for prediction
Natural Language Processing with LSTMs
LSTMs were foundational in advancing NLP, and remain useful for many sequence tasks:
- Sentiment Analysis: Classifying text as positive, negative, or neutral
- Machine Translation: Historically significant for LSTM-based seq2seq models (now largely superseded by Transformers)
- Text Generation: Creating coherent text based on a given prompt
- Named Entity Recognition: Identifying entities like names, locations, and organizations
Example of text generation with LSTMs:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical, pad_sequences
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.models import Sequential
# Sample text
text = """Deep learning is a subset of machine learning which is essentially a neural network with three or more layers.
These neural networks attempt to simulate the behavior of the human brain—albeit far from matching its ability—allowing it to 'learn' from large amounts of data."""
# Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1
# Create input sequences
input_sequences = []
for line in text.split('\n'):
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
# Pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')
# Create predictors and label
X = input_sequences[:, :-1]
y = input_sequences[:, -1]
y = to_categorical(y, num_classes=total_words)
# Build model
model = Sequential([
Embedding(total_words, 100, input_length=max_sequence_len-1),
LSTM(150, return_sequences=True),
LSTM(100),
Dense(total_words, activation='softmax')
])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Note: We intentionally train without validation on this small corpus.
# For text generation, we WANT the model to memorize the training patterns.
# This is the opposite of typical ML practice — do not apply this to other tasks.
model.fit(X, y, epochs=100, verbose=1)
# Generate text
def generate_text(seed_text, next_words, max_sequence_len):
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted = model.predict(token_list, verbose=0)
predicted_index = np.argmax(predicted[0])
predicted_word = tokenizer.index_word.get(predicted_index, "")
seed_text += " " + predicted_word
return seed_text
print(generate_text("Deep learning", 5, max_sequence_len))
Speech Recognition with LSTMs
LSTMs have been widely used in speech recognition systems due to their ability to process variable-length audio sequences, and the architectural pattern remains relevant even as Transformer-based models have gained ground. Key components include:
- Feature Extraction: Converting raw audio to spectrograms or MFCCs
- Acoustic Modeling: Mapping audio features to phoneme probabilities per frame
- Decoder: Combining acoustic scores with a language model to produce word sequences (typically using CTC or attention-based decoding)
Example architecture for speech recognition:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (
Conv1D, BatchNormalization, Bidirectional,
LSTM, Dense, TimeDistributed
)
def build_speech_model(input_shape, num_classes):
model = Sequential([
# Feature extraction layers
Conv1D(64, 11, strides=2, padding='same', activation='relu', input_shape=input_shape),
BatchNormalization(),
Conv1D(128, 11, padding='same', activation='relu'),
BatchNormalization(),
Conv1D(256, 11, padding='same', activation='relu'),
BatchNormalization(),
# Bidirectional LSTM layers
Bidirectional(LSTM(256, return_sequences=True)),
BatchNormalization(),
Bidirectional(LSTM(256, return_sequences=True)),
BatchNormalization(),
# Output layer
TimeDistributed(Dense(num_classes, activation='softmax'))
])
return model
# Note: Actual implementation would require CTC loss and decoding
Hyperparameter Tuning and Best Practices
Key Hyperparameters
Greff et al.5 found that the forget gate and output activation are the most critical LSTM components — removing either significantly hurts performance — while other architectural variations (peephole connections, coupled gates) have minimal impact. In practice, the most important hyperparameters to tune are:
- Number of Layers: Start with 1-3 layers. More layers can capture complex patterns but may overfit.
- Number of Units: Typically 32-512. Start with 128 and adjust based on performance.
- Learning Rate: Use learning rate scheduling (e.g., ReduceLROnPlateau).
- Batch Size: 16-128. Smaller batches often generalize better but train slower.
- Dropout: 0.2-0.5 for regularization.
- Sequence Length: Should capture relevant context (e.g., 30-100 for text, 12-24 for monthly data).
Best Practices
-
Data Preprocessing:
- Normalize/standardize input features
- Handle missing values appropriately
- Create proper sequences with correct time steps
-
Model Architecture:
- Start simple and gradually increase complexity
- Use return_sequences=True for stacked LSTMs
- Consider bidirectional layers for sequence-to-sequence tasks
-
Training:
- Use early stopping to prevent overfitting
- Monitor validation loss, not just training loss
- Use gradient clipping to prevent exploding gradients
-
Regularization:
- Dropout between LSTM layers
- Recurrent dropout for regularization within LSTM cells
- L2 regularization for weights
Example of hyperparameter tuning with Keras Tuner:
from tensorflow import keras
from tensorflow.keras import layers
from keras_tuner.tuners import RandomSearch
def build_model(hp):
model = keras.Sequential()
# Tune number of layers
num_layers = hp.Int('num_layers', 1, 3)
for i in range(num_layers):
kwargs = dict(
units=hp.Int(f'units_{i}', min_value=32, max_value=256, step=32),
return_sequences=(i < num_layers - 1),
dropout=hp.Float(f'dropout_{i}', 0, 0.5, step=0.1),
)
if i == 0:
kwargs['input_shape'] = (seq_length, 1) # Define seq_length before calling tuner.search()
model.add(layers.LSTM(**kwargs))
model.add(layers.Dense(1))
model.compile(
optimizer=keras.optimizers.Adam(
hp.Float('learning_rate', 1e-4, 1e-2, sampling='log')),
loss='mse',
metrics=['mae']
)
return model
tuner = RandomSearch(
build_model,
objective='val_loss',
max_trials=10,
executions_per_trial=2,
directory='tuner_results',
project_name='lstm_tuning'
)
tuner.search(X_train, y_train,
epochs=50,
validation_split=0.2,
callbacks=[keras.callbacks.EarlyStopping(patience=3)])
# Get best model
best_model = tuner.get_best_models(num_models=1)[0]
Common Pitfalls and Troubleshooting
1. Overfitting
Symptoms: Model performs well on training data but poorly on validation/test data.
Solutions:
- Increase dropout rate
- Add L2 regularization
- Reduce model complexity
- Use more training data
- Implement early stopping
2. Exploding Gradients
Symptoms: Loss becomes NaN during training.
Solutions:
- Use gradient clipping:
optimizer = Adam(clipnorm=1.0)(preferclipnormoverclipvalue— it preserves gradient direction) - Reduce learning rate
- Use smaller batch sizes
- Normalize input data
3. Vanishing Gradients
Symptoms: Model learns very slowly or not at all, especially for long sequences.
Solutions:
- Use LSTM or GRU instead of simple RNN
- Use skip connections
- Ensure proper weight initialization (e.g., Glorot for input weights, orthogonal for recurrent weights)
- Consider Transformer architectures for very long sequences
4. Underfitting
Symptoms: Poor performance on both training and validation data.
Solutions:
- Increase model capacity (more layers/units)
- Train for more epochs
- Reduce regularization
- Feature engineering
5. Memory Issues
Symptoms: Out of memory errors during training.
Solutions:
- Reduce batch size
- Use smaller sequences
- Use mixed-precision training
- Use gradient accumulation
6. Training Instability
Symptoms: Loss fluctuates wildly during training.
Solutions:
- Use learning rate scheduling
- Normalize input data
- Use batch normalization
- Try different optimizers (e.g., Adam, RMSprop)
Example of gradient clipping and learning rate scheduling:
from tensorflow.keras.callbacks import LearningRateScheduler
from tensorflow.keras.optimizers import Adam
# Gradient clipping (clipnorm preserves gradient direction; clipvalue clips per-element)
optimizer = Adam(clipnorm=1.0)
# Learning rate schedule (halve every 10 epochs after warmup)
def lr_schedule(epoch, lr):
if epoch < 10:
return lr
elif epoch % 10 == 0:
return lr * 0.5
else:
return lr
# Apply to any Sequential LSTM model (replace with your model and data variables)
model.compile(optimizer=optimizer, loss='mse')
model.fit(X_train, y_train,
epochs=100,
callbacks=[LearningRateScheduler(lr_schedule)])
When to Use LSTMs vs. Transformers
LSTMs dominated sequence modeling from 1997 through approximately 2018, when Transformer architectures began outperforming them on many NLP benchmarks. The question practitioners face today is not whether Transformers are powerful — they clearly are — but whether LSTMs still have a role.
The answer is yes, in specific contexts. LSTMs remain a strong choice when working with limited training data (Transformers are data-hungry), when computational resources are constrained (LSTMs have far fewer parameters for equivalent sequence tasks), and when processing streaming or real-time sequential data where the input length is unknown at inference time. Time series forecasting, embedded systems, sensor data processing, and on-device applications are areas where LSTMs continue to see active production use.
For large-scale NLP tasks with abundant data and compute — machine translation, text summarization, question answering — Transformers have largely superseded LSTMs. The attention mechanism removes the sequential bottleneck that limits LSTM parallelization during training.
The practical recommendation: start with the simplest architecture that meets your requirements. For many sequence tasks, an LSTM trained in minutes will match a Transformer that takes hours. Scale up only when the simpler model demonstrably falls short. If you are new to deep learning fundamentals, our deep learning beginner's guide covers the foundational concepts, and neural networks from scratch walks through building the core components by hand.
The Bottom Line
LSTM networks solved one of deep learning's foundational problems — learning long-range dependencies in sequential data — and remain a practical, efficient tool for sequence modeling tasks. While Transformers have captured the spotlight for large-scale NLP, LSTMs continue to deliver strong results for time series forecasting, real-time sensor processing, and applications where computational efficiency matters more than raw scale.
The code examples in this guide cover the core patterns you will encounter in production: time series prediction with stacked LSTMs, sentiment analysis with bidirectional LSTMs, text generation, and speech recognition architectures. The hyperparameter tuning and troubleshooting sections address the problems that consume most debugging time in practice — overfitting, exploding gradients, and training instability.
Start with the simplest model that fits your problem, measure against a baseline, and scale up only when the data justifies it. That principle applies to LSTMs as much as any other architecture.
Footnotes
-
Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735-1780. ↩
-
Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning Long-Term Dependencies with Gradient Descent is Difficult." IEEE Transactions on Neural Networks, 5(2), 157-166. ↩
-
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." arXiv:1406.1078. ↩
-
Schuster, M., & Paliwal, K. K. (1997). "Bidirectional Recurrent Neural Networks." IEEE Transactions on Signal Processing, 45(11), 2673-2681. ↩
-
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2017). "LSTM: A Search Space Odyssey." IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222-2232. ↩