# RNN Tutorial¶

This tutorial describes how to implement recurrent neural network (RNN) on MinPy. RNN has different architecture, the backprop-through-time (BPTT) coupled with various gating mechanisms can make implementation challenging. MinPy focuses on imperative programming and simplifies reasoning logics. This tutorial explains how, with a simple toy data set and three RNNs (vanilla RNN, LSTM and GRU).

We do suggest you start with Complete solver and optimizer guide for MinPy’s conventional solver architecture.

## Toy Dataset: the Adding Problem¶

The adding problem is a toy dataset for RNN used by many researchers. The input
data of the dataset consists of two rows. The first row contains random float
numbers between 0 and 1; the second row are all zeros, expect two randomly
chosen locations being marked as 1. The corresponding output label is a float
number summing up two numbers in the first row of the input data where marked as
1 in the second row. The length of the row `T`

is the length of the input
sequence.

The paper [1] indicates that a less than 0.1767 Mean Square Error (MSE) proves the effectiveness of the network compared to the random guess. We set 0.1767 as the MSE baseline of our experiment.

We prepared an adding problem generator at `examples.utils.data_utils.adding_problem_generator`

(
here). We append its
implementation as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | ```
def adding_problem_generator(N, seq_len=6, high=1):
""" A data generator for adding problem.
The data definition strictly follows Quoc V. Le, Navdeep Jaitly, Geoffrey E.
Hintan's paper, A Simple Way to Initialize Recurrent Networks of Rectified
Linear Units.
The single datum entry is a 2D vector with two rows with same length.
The first row is a list of random data; the second row is a list of binary
mask with all ones, except two positions sampled by uniform distribution.
The corresponding label entry is the sum of the masked data. For
example:
input label
----- -----
1 4 5 3 -----> 9 (4 + 5)
0 1 1 0
:param N: the number of the entries.
:param seq_len: the length of a single sequence.
:param p: the probability of 1 in generated mask
:param high: the random data is sampled from a [0, high] uniform distribution.
:return: (X, Y), X the data, Y the label.
"""
X_num = np.random.uniform(low=0, high=high, size=(N, seq_len, 1))
X_mask = np.zeros((N, seq_len, 1))
Y = np.ones((N, 1))
for i in xrange(N):
# Default uniform distribution on position sampling
positions = np.random.choice(seq_len, size=2, replace=False)
X_mask[i, positions] = 1
Y[i, 0] = np.sum(X_num[i, positions])
X = np.append(X_num, X_mask, axis=2)
return X, Y
``` |

## Vanilla RNN¶

In Complete solver and optimizer guide, we introduced a simple model/solver architecture.
Implementing RNN in MinPy is very straightforward following the convention. The only
difference is the model part. The following MinPy code defines the vanilla RNN in
`RNNNet`

class. We also include solver code for completeness. (You can find it in
this folder.)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | ```
import minpy.numpy as np
from minpy.nn import layers
from minpy.nn.model import ModelBase
from minpy.nn.solver import Solver
from minpy.nn.io import NDArrayIter
from examples.utils.data_utils import adding_problem_generator as data_gen
class RNNNet(ModelBase):
def __init__(self,
batch_size=100,
input_size=2, # input dimension
hidden_size=64,
num_classes=1):
super(RNNNet, self).__init__()
self.add_param(name='Wx', shape=(input_size, hidden_size))\
.add_param(name='Wh', shape=(hidden_size, hidden_size))\
.add_param(name='b', shape=(hidden_size,))\
.add_param(name='Wa', shape=(hidden_size, num_classes))\
.add_param(name='ba', shape=(num_classes,))
def forward(self, X, mode):
seq_len = X.shape[1]
batch_size = X.shape[0]
hidden_size = self.params['Wh'].shape[0]
h = np.zeros((batch_size, hidden_size))
for t in xrange(seq_len):
h = layers.rnn_step(X[:, t, :], h, self.params['Wx'],
self.params['Wh'], self.params['b'])
y = layers.affine(h, self.params['Wa'], self.params['ba'])
return y
def loss(self, predict, y):
return layers.l2_loss(predict, y)
def main():
model = RNNNet()
x_train, y_train = data_gen(10000)
x_test, y_test = data_gen(1000)
train_dataiter = NDArrayIter(x_train,
y_train,
batch_size=100,
shuffle=True)
test_dataiter = NDArrayIter(x_test,
y_test,
batch_size=100,
shuffle=False)
solver = Solver(model,
train_dataiter,
test_dataiter,
num_epochs=10,
init_rule='xavier',
update_rule='adam',
task_type='regression',
verbose=True,
print_every=20)
solver.init()
solver.train()
if __name__ == '__main__':
main()
``` |

Layers are defined in `minpy.nn.layers`

(here).
Note that `data_gen`

is simply `adding_problem_generator`

.

The key layer of vanilla RNN is also shown as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | ```
def rnn_step(x, prev_h, Wx, Wh, b):
"""
Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
activation function.
The input data has dimension D, the hidden state has dimension H, and we use
a minibatch size of N.
Inputs:
- x: Input data for this timestep, of shape (N, D).
- prev_h: Hidden state from previous timestep, of shape (N, H)
- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
- b: Biases of shape (H,)
Returns a tuple of:
- next_h: Next hidden state, of shape (N, H)
"""
next_h = np.tanh(x.dot(Wx) + prev_h.dot(Wh) + b)
return next_h
``` |

We see building rnn through imperative programming is both convenient and intuitive.

## LSTM¶

LSTM was introduced by Hochreiter and Schmidhuber [2]. It adds more gates to control the process of forgetting and remembering. LSTM is also quite easy to implement in MinPy like vanilla RNN:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | ```
class LSTMNet(ModelBase):
def __init__(self,
batch_size=100,
input_size=2, # input dimension
hidden_size=100,
num_classes=1):
super(LSTMNet, self).__init__()
self.add_param(name='Wx', shape=(input_size, 4*hidden_size))\
.add_param(name='Wh', shape=(hidden_size, 4*hidden_size))\
.add_param(name='b', shape=(4*hidden_size,))\
.add_param(name='Wa', shape=(hidden_size, num_classes))\
.add_param(name='ba', shape=(num_classes,))
def forward(self, X, mode):
seq_len = X.shape[1]
batch_size = X.shape[0]
hidden_size = self.params['Wh'].shape[0]
h = np.zeros((batch_size, hidden_size))
c = np.zeros((batch_size, hidden_size))
for t in xrange(seq_len):
h, c = layers.lstm_step(X[:, t, :], h, c,
self.params['Wx'],
self.params['Wh'],
self.params['b'])
y = layers.affine(h, self.params['Wa'], self.params['ba'])
return y
def loss(self, predict, y):
return layers.l2_loss(predict, y)
``` |

Layers are defined in `minpy.nn.layers`

(here).

The key layer of LSTM is also shown as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | ```
def lstm_step(x, prev_h, prev_c, Wx, Wh, b):
"""
Forward pass for a single timestep of an LSTM.
The input data has dimension D, the hidden state has dimension H, and we use
a minibatch size of N.
Inputs:
- x: Input data, of shape (N, D)
- prev_h: Previous hidden state, of shape (N, H)
- prev_c: previous cell state, of shape (N, H)
- Wx: Input-to-hidden weights, of shape (D, 4H)
- Wh: Hidden-to-hidden weights, of shape (H, 4H)
- b: Biases, of shape (4H,)
Returns a tuple of:
- next_h: Next hidden state, of shape (N, H)
- next_c: Next cell state, of shape (N, H)
"""
N, H = prev_c.shape
# 1. activation vector
a = np.dot(x, Wx) + np.dot(prev_h, Wh) + b
# 2. gate fuctions
i = sigmoid(a[:, 0:H])
f = sigmoid(a[:, H:2*H])
o = sigmoid(a[:, 2*H:3*H])
g = np.tanh(a[:, 3*H:4*H])
# 3. next cell state
next_c = f * prev_c + i * g
next_h = o * np.tanh(next_c)
return next_h, next_c
``` |

The implementation of `lstm_step`

is quite straightforward in MinPy.

## GRU¶

GRU was proposed by Cho et al. [3]. It simplifies LSTM by using fewer gates and states. MinPy can also model GRU in an intuitive way:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | ```
class GRUNet(ModelBase):
def __init__(self,
batch_size=100,
input_size=2, # input dimension
hidden_size=100,
num_classes=1):
super(GRUNet, self).__init__()
self.add_param(name='Wx', shape=(input_size, 2*hidden_size))\
.add_param(name='Wh', shape=(hidden_size, 2*hidden_size))\
.add_param(name='b', shape=(2*hidden_size,))\
.add_param(name='Wxh', shape=(input_size, hidden_size))\
.add_param(name='Whh', shape=(hidden_size, hidden_size))\
.add_param(name='bh', shape=(hidden_size,))\
.add_param(name='Wa', shape=(hidden_size, num_classes))\
.add_param(name='ba', shape=(num_classes,))
def forward(self, X, mode):
seq_len = X.shape[1]
batch_size = X.shape[0]
hidden_size = self.params['Wh'].shape[0]
h = np.zeros((batch_size, hidden_size))
for t in xrange(seq_len):
h = layers.gru_step(X[:, t, :], h, self.params['Wx'],
self.params['Wh'], self.params['b'],
self.params['Wxh'], self.params['Whh'],
self.params['bh'])
y = layers.affine(h, self.params['Wa'], self.params['ba'])
return y
def loss(self, predict, y):
return layers.l2_loss(predict, y)
``` |

Layers are defined in `minpy.nn.layers`

(here).

The key layer of GRU is also shown as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | ```
def gru_step(x, prev_h, Wx, Wh, b, Wxh, Whh, bh):
"""
Forward pass for a single timestep of an GRU.
The input data has dimentsion D, the hidden state has dimension H, and we
use a minibatch size of N.
Parameters
----------
x : Input data, of shape (N, D)
prev_h : Previous hidden state, of shape (N, H)
prev_c : Previous hidden state, of shape (N, H)
Wx : Input-to-hidden weights for r and z gates, of shape (D, 2H)
Wh : Hidden-to-hidden weights for r and z gates, of shape (H, 2H)
b : Biases for r an z gates, of shape (2H,)
Wxh : Input-to-hidden weights for h', of shape (D, H)
Whh : Hidden-to-hidden weights for h', of shape (H, H)
bh : Biases, of shape (H,)
Returns
-------
next_h : Next hidden state, of shape (N, H)
Notes
-----
Implementation follows
http://jmlr.org/proceedings/papers/v37/jozefowicz15.pdf
"""
N, H = prev_h.shape
a = sigmoid(np.dot(x, Wx) + np.dot(prev_h, Wh) + b)
r = a[:, 0:H]
z = a[:, H:2 * H]
h_m = np.tanh(np.dot(x, Wxh) + np.dot(r * prev_h, Whh) + bh)
next_h = z * prev_h + (1 - z) * h_m
return next_h
``` |

`gru_step`

stays close with `lstm_step`

as expected.

## Training Result¶

We trained recurrent networks shown early in this tutorial and use `solver.loss_history`

to retrieve the MSE
history in the training process. The result is as follows.

rm e We observe that LSTM and GRU are more effective than vanilla RNN due to LSTM and GRU’s memory gates. In this particular case, GRU converges much faster than LSTM, probabuly due to fewer parameters.

Reference

[1] | (1, 2) Q. V. Le, N. Jaitly, and G. E. Hinton, “A Simple Way to Initialize Recurrent Networks
of Rectified Linear Units,” arXiv.org, vol. cs.NE. 04-Apr-2015. |

[2] | S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997. |

[3] | K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” arXiv.org, vol. cs.CL. 04-Jun-2014. |