Deep Learning Architectures for Image Classification
and Object Detection¶

Nelson Correa, Ph.D.
Data Science, Machine Learning and Natural Language Processing
https://linkedin.com/in/ncorrea

Palm Beach Data Science Meetup West Palm Beach, FL

Outline¶

Introduction: Computer vision and deep neural networks
Computer vision tasks and datasets
Deep neural network advances since 2011
MLP and CNN: Deep neural networks (MNIST)
Image classification with VGG16 (ImageNet classifier)
Object detection with YOLOv3 (MS-COCO object detector)
New developments and conclusion

Aiming for 45 minutes of presentation and 10 - 15 minutes of Q&A.

1. Introduction: A brief chronology¶

Computer vision and deep neural networks (1950 - 2000)¶

1950s - R. Rosenblatt, W. McCoullogh, W. Pitts (perceptron); Hubel, Weisel (cats)

1960s - MIT AI: Minsky, Waltz, Winston, Marr; AI, Blocks World, Robotics

1970s - MIT AI & computer vision; E. Harth, Alopex; G. Hinton, connectionist networks

1980s - Symbolic AI & computer vision; Connectionism

Rumelhart et al., Parallel Distributed Processing (PDP)
G. Hinton & others, MLP, Backpropagation
Y. LeCun & others, Convolutional neural networks (CNN)

1990s - Pattern recognition (PAMI), machine learning

Computer vision and deep neural networks (2000 - present)¶

2000s - Deep learning, OpenCV

Bengio; neural language models
Hinton, LeCun, Bengio & others; training deep networks
G. Bradsky, OpenCV
Stanford, Toronto, MILO, Oxford; Datasets and benchmarks (Pascal VOC, CIFAR)

2010-2019 - Rapid progress in deep learning

Google, Microsoft, Facebook; Benchmarks, datasets (ILSVCR, MS-COCO)
AlexNet on ImageNet/ILSVRC, 2012; Bengio, ReLU
Deep network modules, architectures, and toolkits
Hinton, LeCun, Bengio (ACM Turing Award, 2019)

2. Computer vision tasks and datasets¶

"The classical problem in computer vision, image processing, and machine vision is that of determining whether or not the image data contains some specific object, feature, or activity. "
(https://en.wikipedia.org/wiki/Computer_vision#Recognition)

The terminology varies and includes the terms recognition, classification, identification and detection when used to describe computer vision tasks.
A finite set of labels or classes (e.g., cat, dog, person, cup, lawn) is assumed to be associated with images, objects and other features of images.
The denotations of the labels may be exclusive (disjoint), or they may overlap and define a hierarchy (e.g., of concepts).

NOTES:¶

Instead of finite, symbolic labels, we may define image or class attributes (in a discrete or continuous space).
The classical problem is assumed to be a supervised learning problem - correct label or labels for an image given by example (an <image, labels> training set).

2.1 Image Classification (logistic regression)¶

Given an image and a set of labels or classes, determine what labels should be assigned to the image.
The problem may be multi-class (multinomial classification; more than two classes)
The problem may be multi-label (each image may assigned multiple labels)

Pascal VOC (20 classes) - MS COCO (80 classes):¶

Person: person
Animal: bird, cat, cow, dog, horse, sheep
Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

MS COCO Labels

2.2 Object Detection (logistic+linear regression)¶

Given an image and a set of object labels,
- (i) determine what labels should be assigned to the image (objects in image)
- (ii) indicate the bounding box (location) of each object in the image
- (iii) provide a confidence score (objectness) for each label selected
Bounding box represented by four coordinates (x, y, width, height), relative to image.

Object Bounding Boxes¶

MS COCO Labels

NOTE: Other CV and multi-modal CV tasks (captioning, Visual Question Answering)

2.5 Computer vision datasets¶

Datasets¶

MNIST (60,000 28x28x1 handwritten digits 0-9; AT&T Bell Labs; 1989-1998)
Pascal VOC (12,500 images; 20 classes; U. Oxford; 2005-2012)
CIFAR (60,000 32x32x3 images; 10 or 100 classes; U. Toronto; 2009)
MS-COCO (200,000 images; 80 classes; Microsoft; 2014-2019)
ImageNet (14M images; 21,000 classes; Google/Stanford; 2010-2019)

... other

Image types¶

Iconic objects and scenes
Non-iconics object and scenes (common objects in context)

imagenet_logo

"ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images."

ImageNet (ILSVRC): http://image-net.org
WordNet: http://wordnet.princeton.edu/
ImageNet Large Scale Visual Recognition Challenge, Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei, International Journal of Computer Vision, 2015. https://arxiv.org/abs/1409.0575

ImageNet Summary and Statistics (updated on April 30, 2010)¶

Total number of non-empty synsets: 21,841
Total number of images: 14,197,122
Number of images with bounding box annotations: 1,034,908
Number of synsets with SIFT features: 1000
Number of images with SIFT features: 1.2 million

Source: http://image-net.org/about-stats

3. Deep neural network advances since 2011¶

3.1 Deep neural networks pre-2011¶

Perceptron (trainable; direct solutions or gradient descent). Too simple a learning function (cannot learn XOR; Minsky and Papert, 1969)
Multi-Layer Perceptron (MLP) and deep networks (backpropagation ~ chain rule)
- ... but deep MLP not practically trainable (vanishing gradients, complexity)
- simple architecture: not hierarchical; activation function; feed-forward
  - no feedback; no "attention" mechanisms
Convolutional neural networks (CNNs) with a few layers (LeCun, 1989)
- CNNs provide hierarchy, an advance over fully-connected layers (MLP)
  - spatial and visual feature/concept organization
- But deep CNNs were not yet practical (i.e., trainable) in 1989
Parallel Distributed Processing
- CV as combined symbolic AI and pattern recognition
- IEEE PAMI, OpenCV

3.2 Deep neural network advances since 2011¶

Improved deep network training (2011 ILSVRC visual recognition challenge)
- More effective neuron activation functions
  - (from sigmoid to tanh, reLU, leaky reLU)
- New training methods (SGD, batch/mini-batch SGD; scaling; optimizers)
- Regularization (network weight rescaling; dropout; data augmentation)
- Reusable transfer learning
Improved deep network models and modules (2011 - present)
- Various new neural network modules (Inception, Xception, Capsules)
- Residual neural networks (ResNet)
- Deep, large network architectures for computer vision (LeNet, GoogLeNet/Inception, VGG, R-CNN, YOLO)
- Attention, feedback (Feedback-CNN) and capsules (CapsNet)
Large image datasets, CV tasks
- Image tasks and datasets: CIFAR, Pascal VOC, MS COCO, ImageNet
- Transfer learning on deep networks trained on the large datasets

4. MLP and CNN: Deep neural networks (MNIST)¶

MNIST dataset

The MNIST database of handwritten digits, Yann LeCun, Corinna Cortes, and Christopher JC Burges, 1998
LeCun site

MNIST MLP

Input (28, 28) - Dense-256-relu - Dense-10-softmax
Input (28, 28) - Dense-128-relu - Dense-128-relu - Dense-10-softmax

MNIST CNN

Feature extractor: Input (28, 28), Conv2D-32, Conv2D-64-relu, Conv2D-64-relu
- Output classifier: Flatten, Dense-128-relu, Dense-10-softmax

See notebook: MNIST_Digit_Classification.ipynb - html

4.1 MNIST State-of-the-Art (SOTA)¶

MNIST Test accuracy stands at 99.83% (in 2019)¶

Record (03/2019)¶

Zhao et al., 2019, report absolute MNIST error rate reduction from previous best 0.21% (99.79% accuracy) to 0.17% (99.83% accuracy).
Capsule Networks with Max-Min Normalization, Zhen Zhao, Ashley Kleinhans, Gursharan Sandhu, Ishan Patel, K. P. Unnikrishnan, 2019, https://arxiv.org/abs/1903.09662.

Others (2017 - 2018)¶

Regularization of neural networks using dropconnect, Li Wan, Matthew D Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus, ICML 2013.
99.79% accuracy (0.39% error rate; 0.21% with ensembling)
Dynamic Routing Between Capsules, Sara Sabour, Nicholas Frosst, Geoffrey E Hinton, 2017 (https://arxiv.org/abs/1710.09829) 99.75% accuracy (0.25% error rate)
Simple 3-layer CNN above (total params, 130,890)
without any training regularization: acc: 99.15%

5. VGG16 Image classification (ImageNet)¶

VGG16 architecture

5.1 Image classification with VGG16¶

See notebook: VGG16_Image_Classification.ipynb - html

Image classification examples and issues with VGG16 in Keras¶

Image classification task

Multiclass; Multilabel

VGG16 image classification architecture

Deep CNN with output classifier
16 trainable layers (13 convolutional; 3 FC classifier); 23 total layers

Pretrained and custom VGG16 image classifiers

Pretrained VGG16: 1000 ImageNet classes
custom trained output classifier: Dogs vs. cats

5.2 VGG16 Example - African elephant¶

African elephant (Flickr)¶

African elephant

Top-5 Predictions¶

n02504458 - African_elephant 69.72%
n01871265 - tusker       19.20%
n02504013 - Indian_elephant 6.60%
n02410509 - bison        3.56%
n02437312 - Arabian_camel 0.43%

5.2 VGG16 Example - People¶

Young man and woman (MS-COCO)¶

Young man and woman

Top-5 Predictions¶

n10148035 - groom        39.19%
n02883205 - bow_tie      27.01%
n04350905 - suit         11.94%
n03450230 - gown         8.47%
n03770439 - miniskirt    2.00%

5.3 Other image classification architectures¶

AlexNet (Krishevzky et al., 2012, ILSVRC)¶

AlexNet_architecture_01.png AlexNet architecture

Input Conv2D Max-Pooling Dense

5.4 Other image classification architectures¶

Inception V3 (2015)¶

InceptionV3 architecture

https://cloud.google.com/tpu/docs/inception-v3-advanced

5.5 Deep learning architecture improvements¶

InceptionV1 (2014), ResNet, ResNeXt (2016)¶

InceptionV1 module ResNet architecture

6. Object Detection¶

An extension of image classification.

Image classification¶

input image -> class probabilities (multinomial logistic regression)

Object localization and detection¶

(single object): input image ->

class probabilities (multinomial logistic regression),
bounding box (x, y, width, height)

(multiple objects at different scales): input image -> plural image grid cells

class scores/probabilities (multinomial logistic regression),
objecness score for cell,
plural bounding boxes (x, y, width, height)

6.1 YOLOv1 Object detection¶

YOLOv1 (Pascal VOC, 20 classes)
YOLOv3 (MS-COCO, 80 classes) - Total yolov3.layers: 252 (Convolutional: 106)

YOLOv1 Object detection

See notebook: YOLOv3_Object_Detection.ipynb - html

6.2 YOLOv3 Object detection (on MS-COCO)¶

VGG16 architecture

Source: Ayoosh Kathuria, YOLOv3 YOLO: https://pjreddie.com/yolo/

6.3 YOLOv3 Example - West Palm Beach¶

Clematis Street¶

Clematis Street

Evening on Clematis Street

6.3 YOLOv3 - Evening on Clematis Street¶

Clematis Street

Predictions on image (total objects: 11)¶

bus: 99.7612% person: 99.4076% person: 95.5676%
bicycle: 99.6953% traffic light: 87.7151% traffic light: 84.3330%
bus: 63.5284% car: 68.7173% car: 97.9441%
car: 91.7072% motorbike: 67.3111%

6.5 Benchmarks and Performance Numbers¶

Image classification performance¶

ImageNet 2011 - 2017, human performance (95%) vs. super-human performance (98%) (Source: EFF AI Metrics)

CV-2017-ImageNet Image Recognition - EFF

7. New developments: New neural models¶

Feedback neural networks (inspired by the mamalian visual system)¶

Role of feedback in mammalian vision: a new hypothesis and a computational model P.S. Sastry, Shesha Shah, S. Singh, K.P. Unnikrishnan, Vision Research 39 (1999) 131–148, Elsevier.

Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks, Cao, Chunshui et al., IEEE ICCV, 2015. pdf github IEEE PAMI

Capsule networks¶

Dynamic Routing Between Capsules, Sara Sabour, Nicholas Frosst, Geoffrey E Hinton, 2017 (https://arxiv.org/abs/1710.09829) (https://github.com/XifengGuo/CapsNet-Keras)

Absolute MNIST error rate reduction from 0.21% (99.79% accuracy) to 0.17% (99.83% accuracy), Zhao et al., 2019.

Capsule Networks with Max-Min Normalization, Zhen Zhao, Ashley Kleinhans, Gursharan Sandhu, Ishan Patel, K. P. Unnikrishnan, 2019, (https://arxiv.org/abs/1903.09662)

NOTE: After 30 years, MNIST Test can no longer be considered a proper test set (it is instead a validation set).

CapsNet with Max-Min: MNIST Test set errors¶

Misclassified MNIST images using 3-model majority vote from CapsNets trained using Max-Min normalization

Max-Min MNIST Test errors

Total of 17 digit errors in 10,000 digit test set (99,83% test accuracy)
MNIST digit index 6576 - Ground Truth: "7"
current deep networks recognize as "1"; no human makes such error

Source: Capsule Networks with Max-Min Normalization, Zhao et al., 2019

Conclusion¶

In this talk we have presented:

computer vision tasks of image classification and object detection
current image benchmark datasets (MNIST, Pascal VOC, ImageNet, MS-COCO)
the MLP and recent deep learning architectures
pre-trained, custom trained and used several deep learning architectures
- MLP and basic CNNs on MNIST
- VGG16 image classification on ImageNet
- YOLOv3 object detection on MS-COCO
Noted new developments (MNIST SOTA, capsules and feedback networks)

... But, there are other considerations for computer vision and AI¶

Ethics of AI: Other considerations for computer vision and AI¶

There are many impacts that computer vision and AI already have in society:

The future of work (impact of CV and AI on jobs, leisure, income)
- What will the world economy be like when "AIs" running on wind and solar power do most of the work?
Surveillance and privacy (what are appropriate uses of CV and AI)
User manipulation and monetization
- e.g., election interference, tracking, advertisement, user behavior
Laws of robotics (I. Asimov, Robo-Cops, Drones, War machines)

Ethics of AI Initiatives¶

Acknowledgements¶

We would to thank the following for useful discussions and comments on this work:

Johann Beukes and Brian Beam, Levatas
K.P. Unnikrishnan, eNeuroLearn

Slides prepared with

References¶

Deep Learning with Python, François Chollet, 2017, Manning Publications, (Chapter 5)
https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Aurélien Géron, 2017, 1st ed. (second edition in October)
https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

Deep Learning, LeCun, Y., Bengio, Y. and Hinton, G. E., Nature, Vol. 521, 2015 (pdf)

Keras¶

References, cont'd¶

VGG16¶

Very Deep Convolutional Networks for Large-Scale Image Recognition, Karen Simonyan & Andrew Zisserman, Visual Geometry Group, Department of Engineering Science, University of Oxford, ICLR 2015. (https://arxiv.org/abs/1409.1556)

YOLO¶

You Only Look Once: Unified, Real-Time Object Detection, Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi; University of Washington, Allen Institute for AI, Facebook AI Research, 2015 (https://pjreddie.com/yolo/)

https://arxiv.org/abs/1506.2640 (YOLO)
https://arxiv.org/abs/1612.08242 (YOLO-9000, aka YOLOv2)
https://arxiv.org/abs/1804.02767 (YOLOv3)

Computer vision and object detection¶

Robust Real-time Object Detection, Paul Viola and Michael Jones, IJCV 2001

Learning OpenCV, Gary Bradsky and Adrian Kaehler, O'Reilly, 2008

Thank you!¶

Dama Sentada

Dama Sentada, 1937 (Object detection by YOLOv3)

Deep Learning Architectures for Image Classification and Object Detection¶

Outline¶

1. Introduction: A brief chronology¶

Computer vision and deep neural networks (1950 - 2000)¶

Computer vision and deep neural networks (2000 - present)¶

2. Computer vision tasks and datasets¶

NOTES:¶

2.1 Image Classification (logistic regression)¶

Pascal VOC (20 classes) - MS COCO (80 classes):¶

2.2 Object Detection (logistic+linear regression)¶

Object Bounding Boxes¶

2.5 Computer vision datasets¶

Datasets¶

Image types¶

ImageNet Summary and Statistics (updated on April 30, 2010)¶

3. Deep neural network advances since 2011¶

3.1 Deep neural networks pre-2011¶

3.2 Deep neural network advances since 2011¶

4. MLP and CNN: Deep neural networks (MNIST)¶

4.1 MNIST State-of-the-Art (SOTA)¶

MNIST Test accuracy stands at 99.83% (in 2019)¶

Record (03/2019)¶

Others (2017 - 2018)¶

5. VGG16 Image classification (ImageNet)¶

5.1 Image classification with VGG16¶

Image classification examples and issues with VGG16 in Keras¶

5.2 VGG16 Example - African elephant¶

African elephant (Flickr)¶

Top-5 Predictions¶

5.2 VGG16 Example - People¶

Young man and woman (MS-COCO)¶

Top-5 Predictions¶

5.3 Other image classification architectures¶

AlexNet (Krishevzky et al., 2012, ILSVRC)¶

5.4 Other image classification architectures¶

Inception V3 (2015)¶

5.5 Deep learning architecture improvements¶

InceptionV1 (2014), ResNet, ResNeXt (2016)¶

6. Object Detection¶

Image classification¶

Object localization and detection¶

6.1 YOLOv1 Object detection¶

6.2 YOLOv3 Object detection (on MS-COCO)¶

6.3 YOLOv3 Example - West Palm Beach¶

Clematis Street¶

6.3 YOLOv3 - Evening on Clematis Street¶

Predictions on image (total objects: 11)¶

6.5 Benchmarks and Performance Numbers¶

Image classification performance¶

7. New developments: New neural models¶

Feedback neural networks (inspired by the mamalian visual system)¶

Capsule networks¶

CapsNet with Max-Min: MNIST Test set errors¶

Conclusion¶

... But, there are other considerations for computer vision and AI¶

Ethics of AI: Other considerations for computer vision and AI¶

Ethics of AI Initiatives¶

Acknowledgements¶

References¶

Keras¶

References, cont'd¶

VGG16¶

YOLO¶

Computer vision and object detection¶

Thank you!¶

Deep Learning Architectures for Image Classification
and Object Detection¶