MNIST Digit Classification¶

SKIP most for PBDS Meetup 07/18/2019

[Show MNIST digit set image]

# Using TensorFlow 1.13.1
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
import os
import matplotlib.pyplot as plt
from datetime import datetime

# Add tensorflow.keras imports
from keras import models
from keras import layers
from keras.utils import to_categorical
from keras.datasets import mnist

print("TensorFlow:", tf.__version__, " Keras:", keras.__version__)

TensorFlow: 1.13.1  Keras: 2.2.4-tf

Using TensorFlow backend.

MNIST Digits Dataset¶

# Read Keras MNIST dataset
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

print("train_images.shape:", train_images.shape, "train_labels len:", len(train_labels))

train_images.shape: (60000, 28, 28) train_labels len: 60000

# First three labels
print("MNIST First four labels:\n", train_labels[:4])
# First digit
print("MNIST First digit (partial crop):\n", train_images[0,4:26,6:24])

MNIST First four labels:
 [5 0 4 1]
MNIST First digit (partial crop):
 [[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   3  18  18  18 126 136 175  26 166 255 247 127]
 [  0   0  30  36  94 154 170 253 253 253 253 253 225 172 253 242 195  64]
 [  0  49 238 253 253 253 253 253 253 253 253 251  93  82  82  56  39   0]
 [  0  18 219 253 253 253 253 253 198 182 247 241   0   0   0   0   0   0]
 [  0   0  80 156 107 253 253 205  11   0  43 154   0   0   0   0   0   0]
 [  0   0   0  14   1 154 253  90   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0 139 253 190   2   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0  11 190 253  70   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0  35 241 225 160 108   1   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0  81 240 253 253 119  25   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0  45 186 253 253 150  27   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0  16  93 252 253 187   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0 249 253 249  64   0   0   0]
 [  0   0   0   0   0   0   0   0  46 130 183 253 253 207   2   0   0   0]
 [  0   0   0   0   0   0  39 148 229 253 253 253 250 182   0   0   0   0]
 [  0   0   0   0  24 114 221 253 253 253 253 201  78   0   0   0   0   0]
 [  0   0  23  66 213 253 253 253 253 198  81   2   0   0   0   0   0   0]
 [ 18 171 219 253 253 253 253 195  80   9   0   0   0   0   0   0   0   0]
 [226 253 253 253 253 244 133  11   0   0   0   0   0   0   0   0   0   0]
 [253 253 212 135 132  16   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]]

Preparing MNIST Digits for Input to MLP¶

Convert MNIST images for perceptron input

from 2-D arrays of 28 x 28 pixels,
to 1-D vector of 784 (28*28) numbers, scaled to range [0, 1].

# MNIST images as normalized (_, 28*28) tensors for MLP
train_images_mlp = train_images.reshape((60000, 28 * 28))
train_images_mlp = train_images_mlp.astype('float32') / 255

test_images_mlp = test_images.reshape((10000, 28 * 28))
test_images_mlp = test_images_mlp.astype('float32') / 255

# First digit
(shape0, mean0) = (train_images_mlp[0].shape[0], np.mean(train_images_mlp[0]))
(min0, max0) = (np.min(train_images_mlp[0]), np.max(train_images_mlp[0]))
print("MNIST First digit shape:", shape0, "mean:", mean0, "- min, max:", min0, max0)

MNIST First digit shape: 784 mean: 0.13768007 - min, max: 0.0 1.0

# Convert MNIST labels [0-9] to categorical
train_labels_cat = to_categorical(train_labels)
test_labels_cat = to_categorical(test_labels)

# First three labels
print("MNIST First three labels:\n", train_labels[:4])
print("First three labels categorical:\n", train_labels_cat[:4])

MNIST First three labels:
 [5 0 4 1]
First three labels categorical:
 [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]

Multi-Layer Perceptron MNIST Classifier (MLP)¶

MLP single-layer (~97% with one hidden layer, 256 nodes; total trainable params: 203,530)

Homework¶

MLP two-layer (~98% with two hidden layers) - skip
Deep MLP with regularization (dropout, data augmentation) - skip

# Define MLP: one hidden layer of 256 nodes - 10-class softmax output
mlp = models.Sequential()
mlp.add(layers.Dense(256, activation='relu', input_shape=(28 * 28,)))
mlp.add(layers.Dense(10, activation='softmax'))

mlp.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

# Summarize MLP
mlp.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_3 (Dense)              (None, 256)               200960    
_________________________________________________________________
dense_4 (Dense)              (None, 10)                2570      
=================================================================
Total params: 203,530
Trainable params: 203,530
Non-trainable params: 0
_________________________________________________________________

Train Multi-Layer Perceptron on MNIST¶

Use Keras mlp.fit(train_inputs, train_outputs)

# Train MLP; using validation_split = 0.2
history = mlp.fit(train_images_mlp, train_labels_cat, 
                  epochs=10, batch_size=128, validation_split=0.2)

WARNING:tensorflow:From /Users/nelson/dev/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 48000 samples, validate on 12000 samples
Epoch 1/10
48000/48000 [==============================] - 1s 25us/step - loss: 0.3247 - acc: 0.9085 - val_loss: 0.1709 - val_acc: 0.9499
Epoch 2/10
48000/48000 [==============================] - 1s 21us/step - loss: 0.1432 - acc: 0.9588 - val_loss: 0.1235 - val_acc: 0.9658
Epoch 3/10
48000/48000 [==============================] - 1s 21us/step - loss: 0.0976 - acc: 0.9716 - val_loss: 0.1127 - val_acc: 0.9668
Epoch 4/10
48000/48000 [==============================] - 1s 22us/step - loss: 0.0731 - acc: 0.9789 - val_loss: 0.0939 - val_acc: 0.9723
Epoch 5/10
48000/48000 [==============================] - 1s 21us/step - loss: 0.0568 - acc: 0.9836 - val_loss: 0.0887 - val_acc: 0.9739
Epoch 6/10
48000/48000 [==============================] - 1s 22us/step - loss: 0.0457 - acc: 0.9870 - val_loss: 0.0842 - val_acc: 0.9762
Epoch 7/10
48000/48000 [==============================] - 1s 21us/step - loss: 0.0367 - acc: 0.9897 - val_loss: 0.0789 - val_acc: 0.9772
Epoch 8/10
48000/48000 [==============================] - 1s 22us/step - loss: 0.0298 - acc: 0.9916 - val_loss: 0.0787 - val_acc: 0.9775
Epoch 9/10
48000/48000 [==============================] - 1s 22us/step - loss: 0.0237 - acc: 0.9936 - val_loss: 0.0784 - val_acc: 0.9785
Epoch 10/10
48000/48000 [==============================] - 1s 22us/step - loss: 0.0193 - acc: 0.9949 - val_loss: 0.0826 - val_acc: 0.9777

# Visualize MLP accuracy, loss on MNIST training and validation data

# Define acc, loss plot function
def plot_acc_loss(history):
    acc = history.history['acc']
    val_acc = history.history['val_acc']
    loss = history.history['loss']
    val_loss = history.history['val_loss']

    epochs = range(len(acc))
    accuracy_template = "Epochs: {}, Best training acc: {}, Best val_acc: {}"
    print(accuracy_template.format(len(epochs), np.max(acc), np.max(val_acc)))

    plt.figure(figsize=(18,5))
    plt.subplot(121)
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.subplot(122)
    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()
    return None

plot_acc_loss(history)

Epochs: 10, Best training acc: 0.9949375, Best val_acc: 0.9785000001589457

Evaluate Multi-Layer Perceptron on MNIST¶

Use Keras mlp.evaluate(test_inputs, test_outputs)

# Evaluate MLP classifier
test_loss, test_acc = mlp.evaluate(test_images_mlp, test_labels_cat)
print('MLP Test accuracy:', test_acc, '- MLP Test loss:', test_loss)

10000/10000 [==============================] - 0s 19us/step
MLP Test accuracy: 0.98 - MLP Test loss: 0.07149975089433429

Convolutional Neural Networks (CNNs)¶

LeCun, 1989.

ZIPCODE digit recognizer

CNN Elements: Filters, Layers, Strides and Padding¶

Figure

Convolutional Neural Network MNIST Classifier (CNN)¶

CNN (~98% with one hidden layer) - skip
CNN (~99.2% with three hidden layers, total trainable params: 130,890)
CNN Test accuracy: 0.9919

Contrast CNN trainable parameters and accuracy (~99.2%, trainable params: 130,890) to single-layer MLP (~98% with one hidden layer, 256 nodes; total trainable params: 203,530).

# Keras CNN MNIST classifier - Three-Layer Convolutional Neural Network
# CNN input (28, 28, 1) images - CNN output (3, 3, 64) tensor
cnn = models.Sequential()
cnn.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
cnn.add(layers.MaxPooling2D((2, 2)))
cnn.add(layers.Conv2D(64, (3, 3), activation='relu'))
cnn.add(layers.MaxPooling2D((2, 2)))
cnn.add(layers.Conv2D(64, (3, 3), activation='relu'))

# Fully-connected (MLP) output classifier - 10-class softmax output
cnn.add(layers.Flatten())
cnn.add(layers.Dense(128, activation='relu'))
cnn.add(layers.Dense(10, activation='softmax'))

cnn.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

# Model summary
cnn.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 3, 3, 64)          36928     
_________________________________________________________________
flatten_1 (Flatten)          (None, 576)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 128)               73856     
_________________________________________________________________
dense_6 (Dense)              (None, 10)                1290      
=================================================================
Total params: 130,890
Trainable params: 130,890
Non-trainable params: 0
_________________________________________________________________

Preparing MNIST Digits for Input to CNN¶

Convert MNIST images for CNN input

from 2-D array of 28 x 28 pixels,
to 3-D tensor of shape (28, 28, 1), scaled to range [0, 1].

# MNIST images as (_, 28, 28, 1) tensors for CNN input
train_images_cnn = train_images.reshape((60000, 28, 28, 1))
train_images_cnn = train_images_cnn.astype('float32') / 255

test_images_cnn = test_images.reshape((10000, 28, 28, 1))
test_images_cnn = test_images_cnn.astype('float32') / 255
print("train_images_cnn.shape:", train_images_cnn.shape)
print("train_labels_cat len:", len(train_labels_cat))

train_images_cnn.shape: (60000, 28, 28, 1)
train_labels_cat len: 60000

Train CNN with MLP output classifier¶

# Train CNN with MLP output classifier
history = cnn.fit(train_images_cnn, train_labels_cat, 
                    epochs=10, batch_size=64, validation_split=0.1)

Train on 54000 samples, validate on 6000 samples
Epoch 1/10
54000/54000 [==============================] - 21s 395us/step - loss: 0.1697 - acc: 0.9481 - val_loss: 0.0450 - val_acc: 0.9862
Epoch 2/10
54000/54000 [==============================] - 22s 406us/step - loss: 0.0478 - acc: 0.9850 - val_loss: 0.0375 - val_acc: 0.9875
Epoch 3/10
54000/54000 [==============================] - 23s 420us/step - loss: 0.0317 - acc: 0.9900 - val_loss: 0.0356 - val_acc: 0.9895
Epoch 4/10
54000/54000 [==============================] - 22s 416us/step - loss: 0.0244 - acc: 0.9923 - val_loss: 0.0329 - val_acc: 0.9913
Epoch 5/10
54000/54000 [==============================] - 22s 414us/step - loss: 0.0198 - acc: 0.9939 - val_loss: 0.0356 - val_acc: 0.9908
Epoch 6/10
54000/54000 [==============================] - 22s 409us/step - loss: 0.0161 - acc: 0.9952 - val_loss: 0.0537 - val_acc: 0.9880
Epoch 7/10
54000/54000 [==============================] - 22s 409us/step - loss: 0.0136 - acc: 0.9959 - val_loss: 0.0443 - val_acc: 0.9908
Epoch 8/10
54000/54000 [==============================] - 22s 411us/step - loss: 0.0103 - acc: 0.9968 - val_loss: 0.0333 - val_acc: 0.9942
Epoch 9/10
54000/54000 [==============================] - 22s 411us/step - loss: 0.0097 - acc: 0.9975 - val_loss: 0.0387 - val_acc: 0.9930
Epoch 10/10
54000/54000 [==============================] - 22s 417us/step - loss: 0.0079 - acc: 0.9976 - val_loss: 0.0552 - val_acc: 0.9908

# Plot acc, loss
plot_acc_loss(history)

Epochs: 10, Best training acc: 0.9975740740740741, Best val_acc: 0.9941666666666666

Evaluate CNN classifier¶

# Evaluate CNN classifier
test_loss, test_acc = cnn.evaluate(test_images_cnn, test_labels_cat)
print('CNN Test accuracy:', test_acc, 'CNN Test loss:', test_loss)

10000/10000 [==============================] - 1s 104us/step
CNN Test accuracy: 0.9914 CNN Test loss: 0.040052228675535165

MNIST with Traditional Machine Learning¶

Multinomial MNIST Classifiers¶

Homework: Naive Bayes (NB), Stochastic Gradient Descent (SGD)¶

GaussianNB, SGDClassifier from Scikit-Learn

e.g., sklearn.GaussianNB, sklearn.SGDClassifier (MNIST accuracy of ~85%)

Feature engineering¶

Feature engineering is important, more so for shallow MLP and traditional statistical classifiers.

Input scaling (e.g., sklearn.StandardScaler on MNIST improves accuracy to ~90%)
Deep networks beat hand-engineered features, with optimal, automatically learned features of, for example, CNN filters

Capsules Neural Network MNIST Classifier (CapsNet)¶

Capsule Networks (CapsNet, Sabour, X, Hinton, 2017; Zhen, KPU, 2019) - skip

MNIST State-of-the-Art (SOTA)¶

MNIST Test accuracy stands at 99.83% (in 2019)¶

Record (03/2019)¶

Zhao et al., 2019, report absolute MNIST error rate reduction from previous best 0.21% (99.79% accuracy) to 0.17% (99.83% accuracy).
Capsule Networks with Max-Min Normalization, Zhen Zhao, Ashley Kleinhans, Gursharan Sandhu, Ishan Patel, K. P. Unnikrishnan, 2019, https://arxiv.org/abs/1903.09662.

Others (2017 - 2018)¶

Regularization of neural networks using dropconnect, Li Wan, Matthew D Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus, ICML 2013. 99.79% accuracy (0.39% error rate; 0.21% with ensembling)
Dynamic Routing Between Capsules, Sara Sabour, Nicholas Frosst, Geoffrey E Hinton, 2017 (https://arxiv.org/abs/1710.09829) 99.75% accuracy (0.25% error rate)
Simple 3-layer CNN above (total params, 130,890) without any training regularization: acc: 99.15%

CapsNet with Max-Min: MNIST Test set errors¶

Misclassified MNIST images using 3-model majority vote from CapsNets trained using Max-Min normalization

Max-Min MNIST Test errors

Total of 17 digit errors in 10,000 digit test set (99,83% test accuracy)
MNIST digit index 6576 - Ground Truth: "7"
current deep networks recognize as "1"; no human makes such error

Source: Capsule Networks with Max-Min Normalization, Zhao et al., 2019

After 30 years, MNIST Test must be considered a validation set¶

... it is no longer a test set.¶

MNIST Test can no longer be considered a proper Test set. It is a digit classification Validation set.

The MNIST dataset availability and focus on the MNIST task since 1989 have made the MNIST, 10,000 digit "Test" set into a "validation" dataset.

New test sets for MNIST, different from the original 1990 one, are needed!
A new digit recognition benchmark is needed.

MNIST improvements are near the end (only 20 digits left for improvement)¶

The MNIST Test set has a total of 60,000 training and 10,000 test digits.

Zhao et al., 2019, use single classifiers with recognition error on only 20 to 23 digits of the 10,000 MNIST test digits.