... | ... | @@ -49,30 +49,30 @@ An **LSTM** is a type of RNN designed to mitigate the vanishing gradient problem |
|
|
The LSTM cell updates are governed by the following equations:
|
|
|
|
|
|
1. **Forget Gate**: Decides what information to forget:
|
|
|
$$
|
|
|
f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
|
|
|
$$
|
|
|
$$
|
|
|
f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)
|
|
|
$$
|
|
|
|
|
|
2. **Input Gate**: Decides which new information to store:
|
|
|
$$
|
|
|
i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
|
|
|
$$
|
|
|
$$
|
|
|
\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
|
|
|
$$
|
|
|
$$
|
|
|
i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
|
|
|
$$
|
|
|
$$
|
|
|
\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)
|
|
|
$$
|
|
|
|
|
|
3. **Cell State Update**:
|
|
|
$$
|
|
|
C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t
|
|
|
$$
|
|
|
$$
|
|
|
C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t
|
|
|
$$
|
|
|
|
|
|
4. **Output Gate**: Decides the new hidden state:
|
|
|
$$
|
|
|
o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
|
|
|
$$
|
|
|
$$
|
|
|
h_t = o_t \cdot \tanh(C_t)
|
|
|
$$
|
|
|
$$
|
|
|
o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
|
|
|
$$
|
|
|
$$
|
|
|
h_t = o_t \cdot \tanh(C_t)
|
|
|
$$
|
|
|
|
|
|
Where:
|
|
|
- **$f_t$, $i_t$, $o_t$** are the forget, input, and output gates.
|
... | ... | @@ -103,14 +103,14 @@ A **Gated Recurrent Unit (GRU)** is a simpler alternative to LSTMs. It has fewer |
|
|
The GRU uses two gates to control the flow of information:
|
|
|
|
|
|
1. **Update Gate**: Decides how much past information to keep:
|
|
|
$$
|
|
|
z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)
|
|
|
$$
|
|
|
$$
|
|
|
z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)
|
|
|
$$
|
|
|
|
|
|
2. **Reset Gate**: Decides how much of the previous hidden state to forget:
|
|
|
$$
|
|
|
r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)
|
|
|
$$
|
|
|
$$
|
|
|
r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)
|
|
|
$$
|
|
|
|
|
|
The new hidden state is calculated as:
|
|
|
$$
|
... | ... | |