squash function | TheAILearner

In the last blog we have seen that what is a capsule network and how it can overcome the problems associated with convolutional neural network. In this blog we will implement a capsule network in keras.

You can find full code here.

Here, we will use handwritten digit dataset(MNIST) and train the capsule network to classify the digits. MNIST digit dataset consists of grayscale images of size 28*28.

Capsule Network architecture is somewhat similar to convolutional neural network except capsule layers. We can break the implementation of capsule network into following steps:

Initial convolutional layer
Primary capsule layer
Digit capsule layer
Decoder network
Loss Functions
Training and testing of model

Initial Convolution Layer:

Initially we will use a convolution layer to detect low level features of an image. It will use 256 filters each of size 9*9 with stride 1 and activation function is relu. Input size of image is 28*28, after applying this layer output size will be 20*20*256.

input_shape = Input(shape=(28,28,1))  # size of input image is 28*28

# a convolution layer output shape = 20*20*256
conv1 = Conv2D(256, (9,9), activation = 'relu', padding = 'valid')(input_shape)

input_shape = Input(shape=(28,28,1)) # size of input image is 28*28

# a convolution layer output shape = 20*20*256

conv1 = Conv2D(256, (9,9), activation = 'relu', padding = 'valid')(input_shape)

Primary Capsule Layer:

The output from the previous layer is being passed to 256 filters each of size 9*9 with a stride of 2 which will produce an output of size 6*6*256. This output is then reshaped into 8-dimensional vector. So shape will be 6*6*32 capsules each of which will be 8-dimensional. Then it will pass through a non-linear function(squash) so that length of output vector can be maintained between 0 and 1.

# convolution layer with stride 2 and 256 filters of size 9*9
conv2 = Conv2D(256, (9,9), strides = 2, padding = 'valid')(conv1)

# reshape into 1152 capsules of 8 dimensional vectors
reshaped = Reshape((6*6*32,8))(conv2)

# squash the reshaped output to make length of vector b/w 0 and 1
squashed_output = Lambda(squash)(reshaped)

def squash(inputs):
    # take norm of input vectors
    squared_norm = K.sum(K.square(inputs), axis = -1, keepdims = True)

    # use the formula for non-linear function to return squashed output
    return ((squared_norm/(1+squared_norm))/(K.sqrt(squared_norm+K.epsilon())))*inputs

# convolution layer with stride 2 and 256 filters of size 9*9

conv2 = Conv2D(256, (9,9), strides = 2, padding = 'valid')(conv1)

# reshape into 1152 capsules of 8 dimensional vectors

reshaped = Reshape((6*6*32,8))(conv2)

# squash the reshaped output to make length of vector b/w 0 and 1

squashed_output = Lambda(squash)(reshaped)

def squash(inputs):

# take norm of input vectors

squared_norm = K.sum(K.square(inputs), axis = -1, keepdims = True)

# use the formula for non-linear function to return squashed output

return ((squared_norm/(1+squared_norm))/(K.sqrt(squared_norm+K.epsilon())))*inputs

Digit Capsule Layer:

Logic and algorithm used for this layer is explained in the previous blog. Here we will see what we need to do in code to implement it. We need to write a custom layer in keras. It will take 1152*8 as its input and produces output of size 10*16, where 10 capsules each represents an output class with 16 dimensional vector. Then each of these 10 capsules are converted into single value to predict the output class using a lambda layer.

class DigitCapsuleLayer(Layer):
    # creating a layer class in keras
    def __init__(self, **kwargs):
        super(DigitCapsuleLayer, self).__init__(**kwargs)
        self.kernel_initializer = initializers.get('glorot_uniform')
    
    def build(self, input_shape): 
        # initialize weight matrix for each capsule in lower layer
        self.W = self.add_weight(shape = [10, 6*6*32, 16, 8], initializer = self.kernel_initializer, name = 'weights')
        self.built = True
    
    def call(self, inputs):
        inputs = K.expand_dims(inputs, 1)
        inputs = K.tile(inputs, [1, 10, 1, 1])
        # matrix multiplication b/w previous layer output and weight matrix
        inputs = K.map_fn(lambda x: K.batch_dot(x, self.W, [2, 3]), elems=inputs)
        b = tf.zeros(shape = [K.shape(inputs)[0], 10, 6*6*32])
        
# routing algorithm with updating coupling coefficient c, using scalar product b/w input capsule and output capsule
        for i in range(3-1):
            c = tf.nn.softmax(b, dim=1)
            s = K.batch_dot(c, inputs, [2, 2])
            v = squash(s)
            b = b + K.batch_dot(v, inputs, [2,3])
            
        return v 
    def compute_output_shape(self, input_shape):
        return tuple([None, 10, 16])

def output_layer(inputs):
    return K.sqrt(K.sum(K.square(inputs), -1) + K.epsilon())

digit_caps = DigitCapsuleLayer()(squashed_output)
outputs = Lambda(output_layer)(digit_caps)

class DigitCapsuleLayer(Layer):

# creating a layer class in keras

def __init__(self, **kwargs):

super(DigitCapsuleLayer, self).__init__(**kwargs)

self.kernel_initializer = initializers.get('glorot_uniform')

def build(self, input_shape):

# initialize weight matrix for each capsule in lower layer

self.W = self.add_weight(shape = [10, 6*6*32, 16, 8], initializer = self.kernel_initializer, name = 'weights')

self.built = True

def call(self, inputs):

inputs = K.expand_dims(inputs, 1)

inputs = K.tile(inputs, [1, 10, 1, 1])

# matrix multiplication b/w previous layer output and weight matrix

inputs = K.map_fn(lambda x: K.batch_dot(x, self.W, [2, 3]), elems=inputs)

b = tf.zeros(shape = [K.shape(inputs)[0], 10, 6*6*32])

# routing algorithm with updating coupling coefficient c, using scalar product b/w input capsule and output capsule

for i in range(3-1):

c = tf.nn.softmax(b, dim=1)

s = K.batch_dot(c, inputs, [2, 2])

v = squash(s)

b = b + K.batch_dot(v, inputs, [2,3])

return v

def compute_output_shape(self, input_shape):

return tuple([None, 10, 16])

def output_layer(inputs):

return K.sqrt(K.sum(K.square(inputs), -1) + K.epsilon())

digit_caps = DigitCapsuleLayer()(squashed_output)

outputs = Lambda(output_layer)(digit_caps)

Decoder Network:

To further boost the pose parameters learned by the digit capsule layer, we can add decoder network to reconstruct the input image. In this part, decoder network will be fed with an input of size 10*16 (digit capsule layer output) and will reconstruct back the original image of size 28*28. Decoder will consist of 3 dense layer having 512, 1024 and 784 nodes.

During training time input to the decoder is the output from digit capsule layer which is masked with original labels. It means that other vectors except the vector corresponding to correct label will be multiplied with zero. So that decoder can only be trained with correct digit capsule. In test time input to decoder will be the same output from digit capsule layer but masked with highest length vector in that layer. Lets see the code.

def mask(outputs):

    if type(outputs) != list:  # mask at test time
        norm_outputs = K.sqrt(K.sum(K.square(outputs), -1) + K.epsilon())
        y  = K.one_hot(indices=K.argmax(norm_outputs, 1), num_classes = 10)
        y = Reshape((10,1))(y)
        return Flatten()(y*outputs)

    else:    # mask at train time
        y = Reshape((10,1))(outputs[1])
        masked_output = y*outputs[0]
        return Flatten()(masked_output)

inputs = Input(shape = (10,))
masked = Lambda(mask)([digit_caps, inputs])
masked_for_test = Lambda(mask)(digit_caps)

decoded_inputs = Input(shape = (16*10,))
dense1 = Dense(512, activation = 'relu')(decoded_inputs)
dense2 = Dense(1024, activation = 'relu')(dense1)
decoded_outputs = Dense(784, activation = 'sigmoid')(dense2)
decoded_outputs = Reshape((28,28,1))(decoded_outputs)

def mask(outputs):

if type(outputs) != list: # mask at test time

norm_outputs = K.sqrt(K.sum(K.square(outputs), -1) + K.epsilon())

y = K.one_hot(indices=K.argmax(norm_outputs, 1), num_classes = 10)

y = Reshape((10,1))(y)

return Flatten()(y*outputs)

else: # mask at train time

y = Reshape((10,1))(outputs[1])

masked_output = y*outputs[0]

return Flatten()(masked_output)

inputs = Input(shape = (10,))

masked = Lambda(mask)([digit_caps, inputs])

masked_for_test = Lambda(mask)(digit_caps)

decoded_inputs = Input(shape = (16*10,))

dense1 = Dense(512, activation = 'relu')(decoded_inputs)

dense2 = Dense(1024, activation = 'relu')(dense1)

decoded_outputs = Dense(784, activation = 'sigmoid')(dense2)

decoded_outputs = Reshape((28,28,1))(decoded_outputs)

Loss Functions:

It uses two loss function one is probabilistic loss function used for classifying digits image and another is reconstruction loss which is mean squared error. Lets see probabilistic loss which is simple to understand once you look at following code.

def loss_fn(y_true, y_pred):

    L = y_true * K.square(K.maximum(0., 0.9 - y_pred)) + 0.5 * (1 - y_true) * K.square(K.maximum(0., y_pred - 0.1))

    return K.mean(K.sum(L, 1))

def loss_fn(y_true, y_pred):

L = y_true * K.square(K.maximum(0., 0.9 - y_pred)) + 0.5 * (1 - y_true) * K.square(K.maximum(0., y_pred - 0.1))

return K.mean(K.sum(L, 1))

Training and Testing of model:

Now define our training and testing model and train it on MNIST digit dataset.

decoder = Model(decoded_inputs, decoded_outputs)
model = Model([input_shape,inputs],[outputs,decoder(masked)])
test_model = Model(input_shape,[outputs,decoder(masked_for_test)])

m = 128
epochs = 10
model.compile(optimizer=keras.optimizers.Adam(lr=0.001),loss=[loss_fn,'mse'],loss_weights = [1. ,0.0005],metrics=['accuracy'])
model.fit([x_train, y_train],[y_train,x_train], batch_size = m, epochs = epochs, validation_data = ([x_test, y_test],[y_test,x_test]))

decoder = Model(decoded_inputs, decoded_outputs)

model = Model([input_shape,inputs],[outputs,decoder(masked)])

test_model = Model(input_shape,[outputs,decoder(masked_for_test)])

m = 128

epochs = 10

model.compile(optimizer=keras.optimizers.Adam(lr=0.001),loss=[loss_fn,'mse'],loss_weights = [1. ,0.0005],metrics=['accuracy'])

model.fit([x_train, y_train],[y_train,x_train], batch_size = m, epochs = epochs, validation_data = ([x_test, y_test],[y_test,x_test]))

In test data set it was able to achieve 99.09% accuracy. Pretty good yeah! Also reconstructed images looks good. Here are the reconstructed images generated by decoder network.

Capsule Network comes with promising results and yet to be explored thoroughly. There are various bits and bytes where it can be explored. Research on a capsule network is still in an early stage but it has given clear indication that it is worth exploring.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Since 2012 with the introduction of AlexNet, convolutional neural networks(CNNs) are being used as sole resource for many wide range image problems. Convolutional neural networks are able to perform really well in the field of image classification, object detection, semantic segmentation and many more.

But are CNNs best solution to solve image problems? Does they translate all features present in the image to predict the output?

Problems with Convolutional Neural Networks:

CNNs uses pooling layers to reduce parameters so that it can speed up computation. In that process it looses some of its useful features information.
CNNs also requires huge amount of dataset to train otherwise it will not give high accuracy in the test dataset.
CNNs basically try to achieve “viewpoint invariance”. It means by changing input a little bit, output will not change. Also, CNNs do not store relative spatial relationship between features.

To solve these problems we need to find a better solution. That is where capsule network comes. A network which has given an early indication that it can solves problem associated with convolution neural networks. Recently, Geoffrey E. Hinton et. al. has published a paper named “Dynamic Routing Between Capsules”, in which they have introduced capsule network and dynamic routing algorithm.

What is a Capsule Network?

A capsule is a group of neurons which uses vectors to represent an object or object part. Length of a vector represents presence of an object and orientation of vector represents its pose(size, position, orientation, etc). Group of these capsules forms a capsule layer and then these layers lead to form a capsule network. It has some advantages over CNN.

Capsule network tries to achieve “equivariance”. It means by changing input a little bit, output will also change but length of vector will remain same which will predict the presence of same object.
Capsule Networks also requires less amount of data for training because it saves spatial relationship between features.
Capsule network do not uses pooling layers which removes the problem of loosing useful features information.

How a Capsule Network works?

Usually in CNNs we deal with layers i.e. one layer passes information to subsequent layer and so on. CapsNet follows same flow as shown below.

Diagram shown above, represents network architecture used in the paper for MNIST dataset. Initial layer uses convolution to get low level features from image and pass them to a primary capsule layer.

A primary capsule layer reshapes output from previous convolution layer into capsules containing vectors of equal dimension. Length of each of these vector represents the probability of presence of an object, that is why we also need to use a non linear function “squashing” to change length of every vector between 0 and 1.

Where S_j is the input vector ||S_j|| is the norm of vector and v_j is the output vector. And that will be the output of primary capsule layer. Capsules in the next layer are generated using dynamic routing algorithm. Which follows following algorithm.

Routing Algorithm:

The main feature of routing algorithm is the agreement between capsules. The lower level capsules will send values to higher level capsules if they agree to each other.

Let’s take an example of an image of a face. If there are four capsules in a lower layer each of which representing mouth, nose, left eye, and right eye respectively. And if all of these four agrees to same face position then it will send its values to the output layer capsule regarding there is a presence of a face.

To produce output for the routing capsules( capsules in the higher layer), firstly output from lower layer(u) is multiplied with weight matrix W and then it uses a coupling coefficient C. This C will determine which capsules form lower layer will send its output to which capsule in higher layer.

Coupling coefficient c is learned iteratively. The sum of all the c for a capsule ‘i’ in the lower layer is equal to 1. This maintains the probabilistic nature of vector that its length represents the probability of the presence of an object. C is determined by an applying softmax to weights b. Where initial values of b is taken to zero.

The routing agreement is determined by updating weights b by adding previous b to scalar product between current capsule in higher layer and capsule in lower layer( shown in line 7 in below algorithm)

Further to boost the capsule layer estimation, authors have added a decoder network to it. A decoder network tries to reconstruct the original image using an output of digit capsule layer. It is simply adding some fully connected layer to the output of 16-dimensional capsule layer.

Now we have seen basic concepts of a capsule network. To get more in depth knowledge about capsule network, the best way is to implement its code. Which you can see in the next blog.

The Next Blog : Implementing Capsule Network in Keras

Referenced Research Paper: Dynamic Routing Between Capsules

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

TheAILearner

Mastering Artificial Intelligence

Tag Archives: squash function

Implementing Capsule Network in Keras

Capsule Networks