Complete solver and optimizer guide

This tutorial explains the “pipeline” of a typical research project. The idea is to get a quick prototype with the most flexible and familiar package (NumPy), then move the codebase to a more efficient paradigm (MXNet). Typically, one might need to go back and forth iteratively to refine the model. The choice of performance and flexibility depends on the project stage, and is best left to user to decide. Importantly, we have made it as straightforward as possible. For example:

  • There is only one codebase to work with. NumPy and MXNet programming idioms mingle together rather easily.
  • Neither style, however, requires the user to explicitly write tedious and (often) error-prone backprop path.
  • Switching between GPU and CPU is straightforward, same code runs in either environment with only one line of change.

We will begin with a simple neural network using MinPy/NumPy and its Solver architecture. Then we will morph it gradually into a fully MXNet implementation, and then add NumPy statements as we see fit.

We do suggest you start with the simpler logistic regression example here.

Stage 0: Setup

All the codes covered in this tutorial could be found in this folder. All the codes in this folder are self-contained and ready-to-run. Before running, please make sure that you:

  • Correctly install MXNet and MinPy. For guidance, refer to installation guide.
  • Follow the instruction in the README.md to download the data.
  • Run the example you want.

Stage 1: Pure MinPy

(This section is also available in iPython Notebook here)

In general, we advocate following the common coding style with the following modular partition:

  • Model: your main job!
  • Layers: building block of your model.
  • Solver and optimizer: takes your model, training and testing data, train it.

The following MinPy code should be self-explainable; it is a simple two-layer feed-forward network. The model is defined in the TwoLayerNet class, where the init, forward and loss functions specify the parameters to be learnt, how the network computes all the way till the loss, and the computation of the loss itself, respectively. The crucial thing to note is the absense of back-prop, MinPy did it automatically for you.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
""" Simple multi-layer perception neural network using Minpy """
import minpy
import minpy.numpy as np
from minpy.nn import layers
from minpy.nn.model import ModelBase
from minpy.nn.solver import Solver
from minpy.nn.io import NDArrayIter
from examples.utils.data_utils import get_CIFAR10_data

batch_size=128
input_size=(3, 32, 32)
flattened_input_size=3 * 32 * 32
hidden_size=512
num_classes=10

class TwoLayerNet(ModelBase):
    def __init__(self):
        super(TwoLayerNet, self).__init__()
        # Define model parameters.
        self.add_param(name='w1', shape=(flattened_input_size, hidden_size)) \
            .add_param(name='b1', shape=(hidden_size,)) \
            .add_param(name='w2', shape=(hidden_size, num_classes)) \
            .add_param(name='b2', shape=(num_classes,))

    def forward(self, X, mode):
        # Flatten the input data to matrix.
        X = np.reshape(X, (batch_size, 3 * 32 * 32))
        # First affine layer (fully-connected layer).
        y1 = layers.affine(X, self.params['w1'], self.params['b1'])
        # ReLU activation.
        y2 = layers.relu(y1)
        # Second affine layer.
        y3 = layers.affine(y2, self.params['w2'], self.params['b2'])
        return y3

    def loss(self, predict, y):
        # Compute softmax loss between the output and the label.
        return layers.softmax_loss(predict, y)

def main(args):
    # Create model.
    model = TwoLayerNet()
    # Create data iterators for training and testing sets.
    data = get_CIFAR10_data(args.data_dir)
    train_dataiter = NDArrayIter(data=data['X_train'],
                                 label=data['y_train'],
                                 batch_size=batch_size,
                                 shuffle=True)
    test_dataiter = NDArrayIter(data=data['X_test'],
                                label=data['y_test'],
                                batch_size=batch_size,
                                shuffle=False)
    # Create solver.
    solver = Solver(model,
                    train_dataiter,
                    test_dataiter,
                    num_epochs=10,
                    init_rule='gaussian',
                    init_config={
                        'stdvar': 0.001
                    },
                    update_rule='sgd_momentum',
                    optim_config={
                        'learning_rate': 1e-4,
                        'momentum': 0.9
                    },
                    verbose=True,
                    print_every=20)
    # Initialize model parameters.
    solver.init()
    # Train!
    solver.train()

This simple network takes several common layers from layers file. The same file contains a few other useful layers, such as batch normalization and dropout. Here is how a new model incorporates them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
batch_size=128
input_size=(3, 32, 32)
flattened_input_size=3 * 32 * 32
hidden_size=512
num_classes=10

class TwoLayerNet(ModelBase):
    def __init__(self):
        super(TwoLayerNet, self).__init__()
        # Define model parameters.
        self.add_param(name='w1', shape=(flattened_input_size, hidden_size)) \
            .add_param(name='b1', shape=(hidden_size,)) \
            .add_param(name='w2', shape=(hidden_size, num_classes)) \
            .add_param(name='b2', shape=(num_classes,)) \
            .add_param(name='gamma', shape=(hidden_size,),
                       init_rule='constant', init_config={'value': 1.0}) \
            .add_param(name='beta', shape=(hidden_size,), init_rule='constant') \
            .add_aux_param(name='running_mean', value=None) \
            .add_aux_param(name='running_var', value=None)

    def forward(self, X, mode):
        # Flatten the input data to matrix.
        X = np.reshape(X, (batch_size, 3 * 32 * 32))
        # First affine layer (fully-connected layer).
        y1 = layers.affine(X, self.params['w1'], self.params['b1'])
        # ReLU activation.
        y2 = layers.relu(y1)
        # Batch normalization
        y3, self.aux_params['running_mean'], self.aux_params['running_var'] = layers.batchnorm(
            y2, self.params['gamma'], self.params['beta'],
            running_mean=self.aux_params['running_mean'],
            running_var=self.aux_params['running_var'])
        # Second affine layer.
        y4 = layers.affine(y3, self.params['w2'], self.params['b2'])
        # Dropout
        y5 = layers.dropout(y4, 0.5, mode=mode)
        return y5

    def loss(self, predict, y):
        # ... Same as above

Note that running_mean and running_var are defined as auxiliary parameters (aux_param). These parameters will not be updated by backpropagation.

The above code looks like fully NumPy, and yet it can run on GPU, and without explicit backprop needs. At this point, you might feel a little mysterious of what’s happening under the hood. For advanced readers, here are the essential bits:

  • The solver file takes the training and test dataset, and fits the model.
  • At the end of the _step function, the loss function is auto-differentiated, deriving gradients to update the parameters.

Stage 2: MinPy + MXNet

While these features are great, it is by no means complete. For example, it is possible to write nested loops to perform convolution in NumPy, and the code will not break. However, much better implementations exist, especially when running on GPU.

MinPy leverages and integrates seemlessly with MXNet’s symbolic programming (see MXNet Python Symbolic API). In a nutshell, MXNet’s symbolic programming interface allows one to write a sub-DAG with symbolic expression. MXNet’s convolutional kernel runs on both CPU and GPU, and its GPU version is highly optimized.

The following code shows how we add one convolutional layer and one pooling layer, using MXNet. Only the model is shown. You can get ready-to-run code for convolutional network.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import mxnet as mx

batch_size=128
input_size=(3, 32, 32)
flattened_input_size=3 * 32 * 32
hidden_size=512
num_classes=10

class ConvolutionNet(ModelBase):
    def __init__(self):
        super(ConvolutionNet, self).__init__()
        # Define symbols that using convolution and max pooling to extract better features
        # from input image.
        net = mx.sym.Variable(name='X')
        net = mx.sym.Convolution(
                data=net, name='conv', kernel=(7, 7), num_filter=32)
        net = mx.sym.Activation(
                data=net, act_type='relu')
        net = mx.sym.Pooling(
                data=net, name='pool', pool_type='max', kernel=(2, 2),
                stride=(2, 2))
        net = mx.sym.Flatten(data=net)
        # Create forward function and add parameters to this model.
        self.conv = Function(
                net, input_shapes={'X': (batch_size,) + input_size},
                name='conv')
        self.add_params(self.conv.get_params())
        # Define ndarray parameters used for classification part.
        output_shape = self.conv.get_one_output_shape()
        conv_out_size = output_shape[1]
        self.add_param(name='w1', shape=(conv_out_size, hidden_size)) \
            .add_param(name='b1', shape=(hidden_size,)) \
            .add_param(name='w2', shape=(hidden_size, num_classes)) \
            .add_param(name='b2', shape=(num_classes,))

    def forward(self, X, mode):
        out = self.conv(X=X, **self.params)
        out = layers.affine(out, self.params['w1'], self.params['b1'])
        out = layers.relu(out)
        out = layers.affine(out, self.params['w2'], self.params['b2'])
        return out

    def loss(self, predict, y):
        return layers.softmax_loss(predict, y)

Stage 3: Pure MXNet

Of course, in this example, we can program it in fully MXNet symbolic way. You can get the full file with only MXNet symbols. Model is as the followings.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class ConvolutionNet(ModelBase):
    def __init__(self):
        super(ConvolutionNet, self).__init__()
        # Define symbols that using convolution and max pooling to extract better features
        # from input image.
        net = mx.sym.Variable(name='X')
        net = mx.sym.Convolution(
                data=net, name='conv', kernel=(7, 7), num_filter=32)
        net = mx.sym.Activation(
                data=net, act_type='relu')
        net = mx.sym.Pooling(
                data=net, name='pool', pool_type='max', kernel=(2, 2),
                stride=(2, 2))
        net = mx.sym.Flatten(data=net)
        net = mx.sym.FullyConnected(
                data=net, name='fc1', num_hidden=hidden_size)
        net = mx.sym.Activation(
                data=net, act_type='relu')
        net = mx.sym.FullyConnected(
                data=net, name='fc2', num_hidden=num_classes)
        net = mx.sym.SoftmaxOutput(
                data=net, name='output')
        # Create forward function and add parameters to this model.
        self.cnn = Function(
                net, input_shapes={'X': (batch_size,) + input_size},
                name='cnn')
        self.add_params(self.cnn.get_params())

    def forward(self, X, mode):
        out = self.cnn(X=X, **self.params)
        return out

    def loss(self, predict, y):
        return layers.softmax_cross_entropy(predict, y)

Stage 3: MXNet + MinPy

However, the advantage of MinPy is that it brings in additional flexibility when needed, this is especially useful for quick prototyping to validate new ideas. Say we want to add a regularization in the loss term, this is done as the followings. Note that we only changed the loss function. Full code is available with regularization.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
weight_decay = 0.001

class ConvolutionNet(ModelBase):
    def __init__(self):
        # ... Same as above.

    def forward(self, X, mode):
        # ... Same as above.

    def loss(self, predict, y):
        # Add L2 regularization for all the weights.
        reg_loss = 0.0
        for name, weight in self.params.items():
            reg_loss += np.sum(weight ** 2) * 0.5
        return layers.softmax_cross_entropy(predict, y) + weight_decay * reg_loss