scrapbook
  • "Unorganized" Notes
  • The Best Public Datasets for Machine Learning and Data Science
  • Practice Coding
  • plaid-API project
  • Biotech
    • Machine Learning vs. Deep Learning
  • Machine Learning for Computer Graphics
  • Books (on GitHub)
  • Ideas/Thoughts
  • Ziva for feature animation: Stylized simulation and machine learning-ready workflows
  • Tools
  • 🪶math
    • Papers
    • Math for ML (coursera)
      • Linear Algebra
        • Wk1
        • Wk2
        • Wk3
        • Wk4
        • Wk5
      • Multivariate Calculus
    • Improving your Algorithms & Data Structure Skills
    • Algorithms
    • Algorithms (MIT)
      • Lecture 1: Algorithmic Thinking, Peak Finding
    • Algorithms (khan academy)
      • Binary Search
      • Asymptotic notation
      • Sorting
      • Insertion sort
      • Recursion
      • Solve Hanoi recursively
      • Merge Sort
      • Representing graphs
      • The breadth-first search algorithm
      • Breadth First Search in JavaScript
      • Breadth-first vs Depth-first Tree Traversal in Javascript
    • Algorithms (udacity)
      • Social Network
    • Udacity
      • Linear Algebra Refresher /w Python
    • math-notes
      • functions
      • differential calculus
      • derivative
      • extras
      • Exponentials & logarithms
      • Trigonometry
    • Probability (MIT)
      • Unit 1
        • Probability Models and Axioms
        • Mathematical background: Sets; sequences, limits, and series; (un)countable sets.
    • Statistics and probability (khan academy)
      • Analyzing categorical data
      • Describing and comparing distributions
      • Outliers Definition
      • Mean Absolute Deviation (MAD)
      • Modeling data distribution
      • Exploring bivariate numerical data
      • Study Design
      • Probability
      • Counting, permutations, and combinations
      • Binomial variables
        • Binomial Distribution
        • Binomial mean and standard deviation formulas
        • Geometric random variable
      • Central Limit Theorem
      • Significance Tests (hypothesis testing)
    • Statistics (hackerrank)
      • Mean, Medium, Mode
      • Weighted Mean
      • Quartiles
      • Standard Deviation
      • Basic Probability
      • Conditional Probability
      • Permutations & Combinations
      • Binomial Distribution
      • Negative Binomial
      • Poisson Distribution
      • Normal Distribution
      • Central Limit Theorem
      • Important Concepts in Bayesian Statistics
  • 📽️PRODUCT
    • Product Strategy
    • Product Design
    • Product Development
    • Product Launch
  • 👨‍💻coding
    • of any interest
    • Maya API
      • Python API
    • Python
      • Understanding Class Inheritance in Python 3
      • 100+ Python challenging programming exercises
      • coding
      • Iterables vs. Iterators vs. Generators
      • Generator Expression
      • Stacks (LIFO) / Queues (FIFO)
      • What does -1 mean in numpy reshape?
      • Fold Left and Right in Python
      • Flatten a nested list of lists
      • Flatten a nested dictionary
      • Traverse A Tree
      • How to Implement Breadth-First Search
      • Breadth First Search
        • Level Order Tree Traversal
        • Breadth First Search or BFS for a Graph
        • BFS for Disconnected Graph
      • Trees and Tree Algorithms
      • Graph and its representations
      • Graph Data Structure Interview Questions
      • Graphs in Python
      • GitHub Repo's
    • Python in CG Production
    • GLSL/HLSL Shading programming
    • Deep Learning Specialization
      • Neural Networks and Deep Learning
      • Untitled
      • Untitled
      • Untitled
    • TensorFlow for AI, ML, and DL
      • Google ML Crash Course
      • TensorFlow C++ API
      • TensorFlow - coursera
      • Notes
      • An Introduction to different Types of Convolutions in Deep Learning
      • One by One [ 1 x 1 ] Convolution - counter-intuitively useful
      • SqueezeNet
      • Deep Compression
      • An Overview of ResNet and its Variants
      • Introducing capsule networks
      • What is a CapsNet or Capsule Network?
      • Xception
      • TensorFlow Eager
    • GitHub
      • Project README
    • Agile - User Stories
    • The Open-Source Data Science Masters
    • Coding Challenge Websites
    • Coding Interview
      • leetcode python
      • Data Structures
        • Arrays
        • Linked List
        • Hash Tables
        • Trees: Basic
        • Heaps, Stacks, Queues
        • Graphs
          • Shortest Path
      • Sorting & Searching
        • Depth-First Search & Breadth-First Search
        • Backtracking
        • Sorting
      • Dynamic Programming
        • Dynamic Programming: Basic
        • Dynamic Programming: Advanced
    • spaCy
    • Pandas
    • Python Packages
    • Julia
      • jupyter
    • macos
    • CPP
      • Debugging
      • Overview of memory management problems
      • What are lvalues and rvalues?
      • The Rule of Five
      • Concurrency
      • Avoiding Data Races
      • Mutex
      • The Monitor Object Pattern
      • Lambdas
      • Maya C++ API Programming Tips
      • How can I read and parse CSV files in C++?
      • Cpp NumPy
    • Advanced Machine Learning
      • Wk 1
      • Untitled
      • Untitled
      • Untitled
      • Untitled
  • data science
    • Resources
    • Tensorflow C++
    • Computerphile
      • Big Data
    • Google ML Crash Course
    • Kaggle
      • Data Versioning
      • The Basics of Rest APIs
      • How to Make an API
      • How to deploying your API
    • Jupiter Notebook Tips & Tricks
      • Jupyter
    • Image Datasets Notes
    • DS Cheatsheets
      • Websites & Blogs
      • Q&A
      • Strata
      • Data Visualisation
      • Matplotlib etc
      • Keras
      • Spark
      • Probability
      • Machine Learning
        • Fast Computation of AUC-ROC score
    • Data Visualisation
    • fast.ai
      • deep learning
      • How to work with Jupyter Notebook on a remote machine (Linux)
      • Up and Running With Fast.ai and Docker
      • AWS
    • Data Scientist
    • ML for Beginners (Video)
    • ML Mastery
      • Machine Learning Algorithms
      • Deep Learning With Python
    • Linear algebra cheat sheet for deep learning
    • DL_ML_Resources
    • Awesome Machine Learning
    • web scraping
    • SQL Style Guide
    • SQL - Tips & Tricks
  • 💡Ideas & Thoughts
    • Outdoors
    • Blog
      • markdown
      • How to survive your first day as an On-set VFX Supervisor
    • Book Recommendations by Demi Lee
  • career
    • Skills
    • learn.co
      • SQL
      • Distribution
      • Hypothesis Testing Glossary
      • Hypothesis Tests
      • Hypothesis & AB Testing
      • Combinatorics Continued and Maximum Likelihood Estimation
      • Bayesian Classification
      • Resampling and Monte Carlo Simulation
      • Extensions To Linear Models
      • Time Series
      • Distance Metrics
      • Graph Theory
      • Logistic Regression
      • MLE (Maximum Likelihood Estimation)
      • Gradient Descent
      • Decision Trees
      • Ensemble Methods
      • Spark
      • Machine Learning
      • Deep Learning
        • Backpropagation - math notation
        • PRACTICE DATASETS
        • Big Data
      • Deep Learning Resources
      • DL Datasets
      • DL Tutorials
      • Keras
      • Word2Vec
        • Word2Vec Tutorial Part 1 - The Skip-Gram Model
        • Word2Vec Tutorial Part 2 - Negative Sampling
        • An Intuitive Explanation of Convolutional Neural Networks
      • Mod 4 Project
        • Presentation
      • Mod 5 Project
      • Capstone Project Notes
        • Streaming large training and test files into Tensorflow's DNNClassifier
    • Carrier Prep
      • The Job Search
        • Building a Strong Job Search Foundation
        • Key Traits of Successful Job Seekers
        • Your Job Search Mindset
        • Confidence
        • Job Search Action Plan
        • CSC Weekly Activity
        • Managing Your Job Search
      • Your Online Presence
        • GitHub
      • Building Your Resume
        • Writing Your Resume Summary
        • Technical Experience
      • Effective Networking
        • 30 Second Elevator Pitch
        • Leveraging Your Network
        • Building an Online Network
        • Linkedin For Research And Networking
        • Building An In-Person Network
        • Opening The Line Of Communication
      • Applying to Jobs
        • Applying To Jobs Online
        • Cover Letters
      • Interviewing
        • Networking Coffees vs Formal Interviews
        • The Coffee Meeting/ Informational Interview
        • Communicating With Recruiters And HR Professional
        • Research Before an Interview
        • Preparing Questions for Interviews
        • Phone And Video/Virtual Interviews
        • Cultural/HR Interview Questions
        • The Salary Question
        • Talking About Apps/Projects You Built
        • Sending Thank You's After an Interview
      • Technical Interviewing
        • Technical Interviewing Formats
        • Code Challenge Best Practices
        • Technical Interviewing Resources
      • Communication
        • Following Up
        • When You Haven't Heard From an Employer
      • Job Offers
        • Approaching Salary Negotiations
      • Staying Current in the Tech Industry
      • Module 6 Post Work
      • Interview Prep
  • projects
    • Text Classification
    • TERRA-REF
    • saildrone
  • Computer Graphics
  • AI/ML
  • 3deeplearning
    • Fast and Deep Deformation Approximations
    • Compress and Denoise MoCap with Autoencoders
    • ‘Fast and Deep Deformation Approximations’ Implementation
    • Running a NeuralNet live in Maya in a Python DG Node
    • Implement a Substance like Normal Map Generator with a Convolutional Network
    • Deploying Neural Nets to the Maya C++ API
  • Tools/Plugins
  • AR/VR
  • Game Engine
  • Rigging
    • Deformer Ideas
    • Research
    • brave rabbit
    • Useful Rigging Links
  • Maya
    • Optimizing Node Graph for Parallel Evaluation
  • Houdini
    • Stuff
    • Popular Built-in VEX Attributes (Global Variables)
Powered by GitBook
On this page
  • So, what are CapsNets exactly?
  • Figure 6. Routing by agreement, step 2—update the routing weights. Image by Aurélien Géron.
  1. coding
  2. TensorFlow for AI, ML, and DL

Introducing capsule networks

PreviousAn Overview of ResNet and its VariantsNextWhat is a CapsNet or Capsule Network?

Last updated 6 years ago

How CapsNets can overcome some shortcomings of CNNs, including requiring less training data, preserving image details, and handling ambiguity.

Capsule networks (CapsNets) are a hot new neural net architecture that may well have a profound impact on deep learning, in particular for computer vision. Wait, isn't computer vision pretty much solved already? Haven't we all seen fabulous examples of convolutional neural networks (CNNs) reaching super-human level in various computer vision tasks, such as classification, localization, object detection, semantic segmentation or instance segmentation (see Figure 1)?Figure 1. Some of the main computer vision tasks. Today, each of these tasks requires a very different CNN architecture, for example ResNet for classification, YOLO for object detection, Mask R-CNN for instance segmentation, and so on. Image by Aurélien Géron.

main computer vision tasks

Well, yes, we’ve seen fabulous CNNs, but:

  • They were trained on huge numbers of images (or they reused parts of neural nets that had). CapsNets can generalize well using much less training data.

  • CNNs don’t handle ambiguity very well. CapsNets do, so they can perform well even on crowded scenes (although, they still struggle with backgrounds right now).

  • CNNs lose plenty of information in the pooling layers. These layers reduce the spatial resolution (see Figure 2), so their outputs are invariant to small changes in the inputs. This is a problem when detailed information must be preserved throughout the network, such as in semantic segmentation. Today, this issue is addressed by building complex architectures around CNNs to recover some of the lost information. With CapsNets, detailed pose information (such as precise object position, rotation, thickness, skew, size, and so on) is preserved throughout the network, rather than lost and later recovered. Small changes to the inputs result in small changes to the outputs—information is preserved. This is called "equivariance." As a result, CapsNets can use the same simple and consistent architecture across different vision tasks.

  • Finally, CNNs require extra components to automatically identify which object a part belongs to (e.g., this leg belongs to this sheep). CapsNets give you the hierarchy of parts for free.

So, what are CapsNets exactly?

Much like a regular neural network, a CapsNet is organized in multiple layers (see Figure 4). The capsules in the lowest layer are called primary capsules: each of them receives a small region of the image as input (called its receptive field), and it tries to detect the presence and pose of a particular pattern, for example a rectangle. Capsules in higher layers, called routing capsules, detect larger and more complex objects, such as boats.Figure 4. A two-layer CapsNet. In this example, the primary capsule layer has two maps of 5x5 capsules, while the second capsule layer has two maps of 3x3 capsules. Each capsule outputs a vector. Each arrow represents the output of a different capsule. Blue arrows represent the output of a capsule that tries to detect triangles, black arrows represent the output of a capsule that tries to detect rectangles, and so on. Image by Aurélien Géron.

The primary capsule layer is implemented using a few regular convolutional layers. For example, in the paper, they use two convolutional layers that output 256 6x6 features maps containing scalars. They reshape this output to get 32 6x6 maps containing 8-dimensional vectors. Finally, they use a novel squashing function to ensure these vectors have a length between 0 and 1 (to represent a probability). And that's it: this gives the output of the primary capsules.

The capsules in the next layers also try to detect objects and their pose, but they work very differently, using an algorithm called routing by agreement. This is where most of the magic of CapsNets lies. Let's look at an example.

Suppose there are just two primary capsules: one rectangle capsule and one triangle capsule, and suppose they both detected what they were looking for. Both the rectangle and the triangle could be part of either a house or a boat (see Figure 5). Given the pose of the rectangle, which is slightly rotated to the right, the house and the boat would have to be slightly rotated to the right as well. Given the pose of the triangle, the house would have to be almost upside down, whereas the boat would be slightly rotated to the right. Note that both the shapes and the whole/part relationships are learned during training. Now notice that the rectangle and the triangle agree on the pose of the boat, while they strongly disagree on the pose of the house. So, it is very likely that the rectangle and triangle are part of the same boat, and there is no house.Figure 5. Routing by agreement, step 1—predict the presence and pose of objects based on the presence and pose of object parts, then look for agreement between the predictions. Image by Aurélien Géron.

Since we are now confident that the rectangle and triangle are part of the boat, it makes sense to send the outputs of the rectangle and triangle capsules more to the boat capsule, and less to the house capsule: this way, the boat capsule will receive more useful input signal, and the house capsule will receive less noise. For each connection, the routing-by-agreement algorithm maintains a routing weight (see Figure 6): it increases routing weight when there is agreement, and decreases it in case of disagreement.

Figure 6. Routing by agreement, step 2—update the routing weights. Image by Aurélien Géron.

The routing-by-agreement algorithm involves a few iterations of agreement-detection + routing-update (note that this happens for each prediction, not just once, and not just at training time). This is especially useful in crowded scenes: for example, in Figure 7, the scene is ambiguous because you could see an upside-down house in the middle, but this would leave the bottom rectangle and top triangle unexplained, so the routing-by-agreement algorithm will most likely converge to a better explanation: a boat at the bottom, and a house at the top. The ambiguity is said to be "explained away": the lower rectangle is best explained by the presence of a boat, which also explains the lower triangle, and once these two parts are explained away, the remaining parts are easily explained as a house.Figure 7. Routing by agreement can parse crowded scenes, such as this ambiguous image, which could be misinterpreted as an upside-down house plus some unexplained parts. Instead, the lower rectangle will be routed to the boat, and this will also pull the lower triangle into the boat as well. Once that boat is “explained away,” it’s easy to interpret the top part as a house. Image by Aurélien Géron.

Figure 2. The DeepLab2 pipeline for image segmentation, by Liang-Chieh Chen, et al.: notice that the output of the CNN (top right) is very coarse, making it necessary to add extra steps to recover some of the lost details. From the DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, figure reproduced with the kind permission of the authors. See this by S. Chilamkurthy to see how diverse and complex the architectures for semantic segmentation can get.

DeepLab2 pipeline for image segmentation

CapsNets were first introduced in 2011 by Geoffrey Hinton, et al., in a paper called , but it was only a few months ago, in November 2017, that Sara Sabour, Nicholas Frosst, and Geoffrey Hinton published a paper called , where they introduced a CapsNet architecture that reached state-of-the-art performance on MNIST (the famous data set of handwritten digit images), and got considerably better results than CNNs on MultiMNIST (a variant with overlapping pairs of different digits). See Figure 3.

Figure 3. MultiMNIST images (white) and their reconstructions by a CapsNet (red+green). “R” = reconstructions; “L” = labels. For example, the predictions for the first example (top left) are correct, and so are the reconstructions. But in the fifth example, the prediction is wrong: (5,7) instead of (5,0). Therefore, the 5 is correctly reconstructed, but not the 0. From the : Dynamic routing between capsules, figure reproduced with the kind permission of the authors.

MultiMNIST images and their reconstructions

Despite all their qualities, CapsNets are still far from perfect. Firstly, for now they don't perform as well as CNNs on larger images such as CIFAR10 or ImageNet. Moreover, they are computationally intensive, and they cannot detect two objects of the same type when they are too close to each other (this is called the "crowding problem," and it has been shown that ). But the key ideas are extremely promising, and it seems likely that they just need a few tweaks to reach their full potential. After all, modern CNNs were invented in 1998, yet they only beat the state of the art on ImageNet in 2012, after a few tweaks.

In short, a CapsNet is composed of capsules rather than neurons. A capsule is a small group of neurons that learns to detect a particular object (e.g., a rectangle) within a given region of the image, and it outputs a vector (e.g., an 8-dimensional vector) whose length represents the estimated probability that the object is present, and whose orientation (e.g., in 8D space) encodes the object's pose parameters (e.g., precise position, rotation, etc.). If the object is changed slightly (e.g., shifted, rotated, resized, etc.) then the capsule will output a vector of the same length, but oriented slightly differently. Thus, capsules are equivariant.

two-layer CapsNet
Routing-by-agreement
Routing-by-agreement, step 2 – Update the routing weights
Routing by agreement can parse crowded scenes

And that’s it—you know the key ideas behind CapsNets! If you want more details, check out my two (one on the architecture and another on the implementation) and my (Jupyter Notebook). Please don’t hesitate to comment on the videos, file issues on GitHub if you see any, or . I hope you found this post useful!

This is the original architecture proposed in the paper Dynamic routing with capsules, by S. Sabour, N. Frosst, and G. Hinton, but since then, they proposed a more general architecture where the object’s presence probability and pose parameters are encoded differently in the output vector. The ideas remain the same, however.

👨‍💻
paper
great post
Transforming Autoencoders
Dynamic Routing between Capsules
Learn more
paper
humans have it, too
[1]
videos on CapsNets
commented TensorFlow implementation
contact me on Twitter @aureliengeron
[1]