Tag Archives: activation function

Unleashing the Power of Gelu Activation: Exploring Its Uses, Applications, and Implementation in Python

Introduction

In the field of deep learning, activation functions play a crucial role in introducing non-linearity to neural networks, enabling them to model complex relationships. One such activation function that has gained popularity is the Gelu activation function. Gelu stands for Gaussian Error Linear Unit, and it offers a smooth and continuous non-linear transformation. In this blog post, we will dive into the world of Gelu activation, its applications, the formula behind it, and how to implement it in Python.

Understanding Gelu Activation

The Gelu activation function was introduced in 2016 by Dan Hendrycks and Kevin Gimpel as an alternative to other popular activation functions such as ReLU (Rectified Linear Unit). Gelu is known for its ability to capture a wide range of non-linearities while maintaining smoothness and differentiability.

Formula and Characteristics

The Gelu activation function is defined mathematically as follows:

Gelu(x) =

The key characteristics of the Gelu activation function are as follows:

  1. Range: Gelu activation outputs values in the range [0, +inf).
  2. Differentiability: Gelu is a smooth function and possesses derivatives at all points.
  3. Monotonicity: Gelu is a monotonically increasing function, making it suitable for gradient-based optimization algorithms.
  4. Gaussian Approximation: Gelu approximates the cumulative distribution function (CDF) of a standard normal distribution.

Applications of Gelu Activation: Gelu activation has found applications in various domains, including:

  1. Natural Language Processing (NLP): Gelu has shown promising results in NLP tasks such as sentiment analysis, machine translation, and text generation.
  2. Computer Vision: Gelu activation can be used in convolutional neural networks (CNNs) for image classification, object detection, and semantic segmentation tasks.
  3. Recommendation Systems: Gelu activation can enhance the performance of recommendation models by introducing non-linearities and capturing complex user-item interactions.

Implementing Gelu Activation in Python

Let’s see how we can implement the Gelu activation function in Python:

Comparison of Gelu and ReLU Activation Functions

  1. Formula:
    • ReLU: ReLU(x) = max(0, x)
    • Gelu: Gelu(x) = 0.5x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
  2. Range:
    • ReLU: ReLU outputs values in the range [0, +inf).
    • Gelu: Gelu also outputs values in the range [0, +inf).
  3. Smoothness and Continuity:
    • ReLU: ReLU is a piecewise linear function and non-differentiable at x=0.
    • Gelu: Gelu is a smooth and continuous function, ensuring differentiability at all points.
  4. Monotonicity:
    • ReLU: ReLU is a piecewise linear function, which means it is monotonically increasing for x > 0.
    • Gelu: Gelu is a monotonically increasing function, making it suitable for gradient-based optimization algorithms.
  5. Non-linearity:
    • ReLU: ReLU introduces non-linearity by mapping negative values to 0 and preserving positive values unchanged.
    • Gelu: Gelu introduces non-linearity through a combination of linear and non-linear transformations, capturing a wider range of non-linearities.
  6. Performance:
    • ReLU: ReLU has been widely used due to its simplicity and computational efficiency. However, it suffers from the “dying ReLU” problem where neurons can become inactive (outputting 0) and may not recover during training.
    • Gelu: Gelu has shown promising performance in various tasks, including NLP and computer vision, and it addresses the “dying ReLU” problem by maintaining non-zero gradients for all inputs.
  7. Applicability:
    • ReLU: ReLU is commonly used in hidden layers of deep neural networks and has been successful in image classification and computer vision tasks.
    • Gelu: Gelu has gained popularity in natural language processing (NLP) tasks, such as sentiment analysis and text generation, where capturing complex non-linear relationships is crucial.

Conclusion

Gelu activation offers a powerful tool for introducing non-linearity to neural networks, making them capable of modeling complex relationships. Its smoothness, differentiability, and wide range of applications make it an attractive choice in various domains, including NLP, computer vision, and recommendation systems. By implementing Gelu activation in Python, researchers and practitioners can leverage its potential and explore its benefits in their own deep learning projects. So go ahead, unleash the power of Gelu and take your models to the next level!

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Neural Arithmetic Logic Units

In this tutorial, you will learn about neural arithmetic logic units (NALU).

You can see full TensorFlow implementation of neural arithmetic logic units in my GitHub Repository.

In Present time you can see neural network has wide application area from simple classification problem to complex self-driving cars problem. And neural network is doing very well in these fields. But can you think a neural network can’t count. Even animals as simple as bees can do that.

The problem is that a neural network can not perform numerical extrapolation outside training data. It will not be able to learn even a scalar identity function outside it’s training set. Recently DeepMind researchers released a paper in which they have generated a function which tries to solve this problem.

Failure Of Neural Network Learning a scalar identity function

The problem of neural nets not able to learn identity relations is not new. But in paper they tried to show this using an example.

They used an autoencoder of 3 layers each of 8 units and tried to predict identity relations. As an example if input is given as 4 then output should also be 4. They used different non-linear functions in this network like sigmoid and tanh but they all fail to extrapolate identity relation outside training data set.

They also saw that some highly linear function like PReLU are able to reduce the error but you can see even neural nets have function that are capable of extrapolation, they fail to do it.

To solve this problem they proposed two models:

  1. NAC (Neural Accumulator)
  2. NALU (Neural Arithmetic Logic Units)

NAC (Neural Accumulator)

Neural accumulator is able to solve problem of addition and subtraction.

NAC is a special case where transformation matrix of a layer consists of [-1, 0,1]. This makes output from W as addition or subtraction of rows of input vector rather than arbitrary re scaling produced by non-linear functions. As an example if our input layer consists of X1  and X2 then output from NAC will be linear combinations of input vectors. It will make numbers consistent throughout the model no matter haw many operations are applied there.

Since W is having hard constraints that every element of W should be one of {-1, 0, 1}. It makes learning difficult. The problem of difficult learning is that hard constraints creates difficulty in updating weights during back propagation. To solve this they proposed a continuous and differentiable parameterization of W.

 

w_hat and m_hat are randomly initialized weights and be convenient to learn with gradient descent. This guarantees that W will be in range of {-1, 1} and be close to {-1 , 0 ,1}. Here ” * ” means element-wise multiplication.

NALU (Neural Arithmetic Logic Units)

NAC is able to solve problem of addition/subtraction but also to solve problem of multiplication/division they came up with NALU which consists of two NAC sub cells, one capable of addition/subtraction and other for multiplication/division.

It consists of these five equations:

Where,

  1.  w_hat, m_hat and G and randomly initialized weights,
  2.  ϵ is used to get away with problem of log(0),
  3.  x and y are input and output layer respectively,
  4. g is the gate which will be between 0 and 1.

Here the concept of gate is being added as variable ” g ” , such that if output value is applied with g = 1(on) then multiply/divide sub cell is 0 (off) and vice-versa.

For addition and subtraction (a = matmul(x, W) ), is identical to original NAC while for multiply/divide NAC operates in log space and capable of learning to multiply and divide (m = exp(matmul(W , log(|x|+ ϵ ) ) ).

So, this NALU is capable of both extrapolation and interpolation.

Experiments performed with NAC and NALU models

In paper they have also applied these concepts over different task to see ability of NAC and NALU. They found NALU to be very useful in different problems like:

  1. Learning tasks using different arithmetic functions( x+y, x-y, x-y+x, x*y, etc)
  2. Counting Task using recurrent network in which images of different digits are being fed to model and output should count no of each different type of digits.
  3. Language to number translation task in which expression like ” five hundred fifteen ” is being fed to network and output should return ” 515 “. Here NALU is applied with a LSTM model in output layer.
  4. NALU is also used with reinforcement learning to track time in a grid-world environment.

Summary

We have seen that NAC and NALU can be applied to overcome problem of failure of numerical representation to generalize outside the range observed in training data set. If you have gone through this blog, you have seen that this NAC and NALU concept is very easy to grasp and apply. However, it can not be said that NALU will be perfect for every task, so we have to see where it is giving good results.

Referenced Research Paper : Neural Arithmetic Logic Units