Simple MNIST Neural Network

Last Updated: March, 2019

Code for this project is available here.

This is a tutorial I created that walks through using the MNIST database to train a simple handwritten digit classifier.

MNIST is a dataset of handwritten digits. These digits are in the form of 28x28 grayscale images. The MNIST dataset contains 55,000 training images and an additional 10,000 test examples.

Importing Libraries

The first step for this project is to import all the python libraries we are going to be using. For this project we will use:

  • tensorflow: to build the neural network

  • matplotlib: to visualize the images

  • numpy: for data manipulation

In addition to these packages, we also need to import the MNIST data. For this particular project, our data is already provided in an easy to use format through tensorflows api. Note the use of the one_hot flag in the read data call, which ensures the class labels are provided in a one hot format. This means that instead of a single number specifying the class number, an array of 0s with a 1 at the class label index is used.

Ex. [3] -> [0,0,0,1,0,0,0,0,0,0]

Extract Data

Next we are going to extract the image and label arrays from the mnist object. To do this we run the following code:

And then to check that we imported everything correctly, let’s check the shape of those arrays.

Output:

x_train (55000, 784)

y_train (55000, 10)

x_test (10000, 784)

y_test (10000, 10)

Randomize Data

It is also generally a good idea to shuffle your data to make sure that it isn’t sorted. In this case it isn’t really necessary because the data from mnist is already randomly sorted, but we can do it anyways for practice. When shuffling the data we want to make sure that we shuffle the images and labels together so that the index for the label and image still match up. The following code segment shows how this can be done.

Visualize Data

Another important part of most machine learning projects is to get a better understanding of the data. In this case, we are going to do that by visualizing some of the images in our dataset and their corresponding labels. Run the following code and change the value of index to view other images.

Output:

3

We have now visualized individual images from our dataset and their labels, but it might also be good to know what the collection of all images with a particular label look like. To do this we can take the average of all images with that class label. The code to do so is as follows.

By changing the value of digit we can get the average image for each class:

Defining our network

Now that we have imported and visualized our training data, we are ready to create our neural network. The model we are going to create is going to look similar to the diagram below. Except the input layer will have 784 nodes to represent the pixel values, and we will use hidden layers of size 128 and 32 with the output layer having a size of 10 (to match the 10 digit classes).

In practice, the way this network works is by taking the input pixels as a vector and multipling it by a weight matrix (which is represented by the edges on the above diagram). If the weight matrix has shape 784 X 128, then the output of this will be a vector of 128 elements. Next a bias term is added to each of these elements, and then they are passed through a non-linear function. In this case, we are going to use the ReLU (Rectified Linear Unit) which is a simple but effective activation function and is defined as f(x) = max(0,x).

The ReLU Function

Notice that we first define X and labels as placeholders for our data. This tensorflow feature allows us to dynamically feed in our data later on.

This process is then repeated until we reach the output layer. For the output layer we won’t apply the ReLU activation. Instead we will use the output to make predictions. Because our training data is one hot encoded, we are training our network to output a vector of 10 values with the largest value corresponding to the correct digit class.

Training the model

The train the model we first have to define a loss function. A loss function tells your program how far off from the expected outputs it is when it makes predictions using the current weight configurations. Then we can improve our model by shifting our weights in the opposite direction of the loss function gradient. This means the weights should produce a slightly smaller loss the next time they make a prediction. By repeatedly taking these steps our model will eventually learn good weights. This process is called Gradient Descent.

Diagram of Gradient Descent

Tensorflow makes all of this very easy to do by automatically computing the gradients and updating the weights accordingly. To do so we define our loss function and optimization step.

Note: This doesn’t make our model start taking steps, but only defines the step operation.

Now we can define some training constants and initialize our Tensorflow Session. The Tensorflow session is the engine that runs all of our operations. It is also import to initialize all of our weights before we start training.

Finally, we code our training loop. We feed in our training examples in batches, which allows us to process and take steps on a group of examples at the same time. In addition, every 5 epochs (full pass through data), we make predictions and compute the loss on the test set.

Output:

Making Predictions

Now that our model is trained we can use it to make predictions on new data. Remember that we only took training steps on our training data, so our model has never learned to recognize images from our training set. Now we can see how well our model generalizes to these new examples. Already we can see that our test set accuracy is around 95%, which is very good considering how simple our model is.

And now we can use these to compare our models output to the images themselves. By running the following code and changing the value of index you can see the models prediction, the true value and the image of various test examples.