Tag Archives: Deep Learning

Unleashing the Power of Gelu Activation: Exploring Its Uses, Applications, and Implementation in Python

Introduction

In the field of deep learning, activation functions play a crucial role in introducing non-linearity to neural networks, enabling them to model complex relationships. One such activation function that has gained popularity is the Gelu activation function. Gelu stands for Gaussian Error Linear Unit, and it offers a smooth and continuous non-linear transformation. In this blog post, we will dive into the world of Gelu activation, its applications, the formula behind it, and how to implement it in Python.

Understanding Gelu Activation

The Gelu activation function was introduced in 2016 by Dan Hendrycks and Kevin Gimpel as an alternative to other popular activation functions such as ReLU (Rectified Linear Unit). Gelu is known for its ability to capture a wide range of non-linearities while maintaining smoothness and differentiability.

Formula and Characteristics

The Gelu activation function is defined mathematically as follows:

Gelu(x) =

The key characteristics of the Gelu activation function are as follows:

  1. Range: Gelu activation outputs values in the range [0, +inf).
  2. Differentiability: Gelu is a smooth function and possesses derivatives at all points.
  3. Monotonicity: Gelu is a monotonically increasing function, making it suitable for gradient-based optimization algorithms.
  4. Gaussian Approximation: Gelu approximates the cumulative distribution function (CDF) of a standard normal distribution.

Applications of Gelu Activation: Gelu activation has found applications in various domains, including:

  1. Natural Language Processing (NLP): Gelu has shown promising results in NLP tasks such as sentiment analysis, machine translation, and text generation.
  2. Computer Vision: Gelu activation can be used in convolutional neural networks (CNNs) for image classification, object detection, and semantic segmentation tasks.
  3. Recommendation Systems: Gelu activation can enhance the performance of recommendation models by introducing non-linearities and capturing complex user-item interactions.

Implementing Gelu Activation in Python

Let’s see how we can implement the Gelu activation function in Python:

Comparison of Gelu and ReLU Activation Functions

  1. Formula:
    • ReLU: ReLU(x) = max(0, x)
    • Gelu: Gelu(x) = 0.5x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
  2. Range:
    • ReLU: ReLU outputs values in the range [0, +inf).
    • Gelu: Gelu also outputs values in the range [0, +inf).
  3. Smoothness and Continuity:
    • ReLU: ReLU is a piecewise linear function and non-differentiable at x=0.
    • Gelu: Gelu is a smooth and continuous function, ensuring differentiability at all points.
  4. Monotonicity:
    • ReLU: ReLU is a piecewise linear function, which means it is monotonically increasing for x > 0.
    • Gelu: Gelu is a monotonically increasing function, making it suitable for gradient-based optimization algorithms.
  5. Non-linearity:
    • ReLU: ReLU introduces non-linearity by mapping negative values to 0 and preserving positive values unchanged.
    • Gelu: Gelu introduces non-linearity through a combination of linear and non-linear transformations, capturing a wider range of non-linearities.
  6. Performance:
    • ReLU: ReLU has been widely used due to its simplicity and computational efficiency. However, it suffers from the “dying ReLU” problem where neurons can become inactive (outputting 0) and may not recover during training.
    • Gelu: Gelu has shown promising performance in various tasks, including NLP and computer vision, and it addresses the “dying ReLU” problem by maintaining non-zero gradients for all inputs.
  7. Applicability:
    • ReLU: ReLU is commonly used in hidden layers of deep neural networks and has been successful in image classification and computer vision tasks.
    • Gelu: Gelu has gained popularity in natural language processing (NLP) tasks, such as sentiment analysis and text generation, where capturing complex non-linear relationships is crucial.

Conclusion

Gelu activation offers a powerful tool for introducing non-linearity to neural networks, making them capable of modeling complex relationships. Its smoothness, differentiability, and wide range of applications make it an attractive choice in various domains, including NLP, computer vision, and recommendation systems. By implementing Gelu activation in Python, researchers and practitioners can leverage its potential and explore its benefits in their own deep learning projects. So go ahead, unleash the power of Gelu and take your models to the next level!

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Machine Learning Quiz-5

Q1. The optimizer is an important part of training neural networks. which of the following is not the purpose of using optimizers?

  1. Speed up algorithm convergence
  2. Reduce the difficulty of manual parameter setting
  3. Avoid overfitting
  4. Avoid local extremes

Answer: 3
Explanation: To avoid overfitting, we use regularization and not optimizers.

Q2. Which of the following is not a regularization technique used in machine learning?

  1. L1 regularization
  2. R-square
  3. L2 regularization
  4. Dropout

Answer: 2
Explanation: Of all the above mentioned, R-square is not a regularization technique. R-squared is a statistical measure of how close the data are to the fitted regression line.

Q3. Which of the following are hperparameter in the context of deep learning?

  1. Learning Rate, α
  2. Momentum parameter, β1
  3. Number of units in a layer
  4. All of the above

Answer: 4
Explanation: According to Wikipedia, “In machine learning, a hyperparameter is a parameter whose value is used to control the learning process”. So, all of the above are hyperparameters.

Q4. Which of the following statement is not true with respect to batch normalization?

  1. Batch normalization helps in decreasing training time
  2. Batch normalization add slight regularization effect
  3. After using of batch normalization there is no need to use the dropout
  4. Batch normalization helps in reducing the covariate shift

Answer: 3
Explanation: Although Batch Normalization has a slight regularization effect but this is not why we use this. This is used to make the neural network more robust (reduce covariate shift) and easy to train. While Dropout is used for regularization (reducing overfitting). So, the third option is incorrect.

Q5. In a machine learning project, modelling is an iterative process but deployment is not.

  1. True
  2. False

Answer: 2
Explanation: Deployment is an iterative process, where you should expect to make multiple adjustments (such as metrics monitored using dashboards or percentage of traffic served) to work towards optimizing the system.

Q6. Which of the following activation function works better for hidden layers?

  1. Sigmoid
  2. Tanh

Answer: 2
Explanation: The Tanh activation function usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, so it centers the data better for the next layer and the gradients are not restricted to move in a certain direction.

Q7. The softmax function is used to calculate the probability distribution over a discrete variable with n possible values?

  1. True
  2. False

Answer: 1
Explanation: The softmax function is used to calculate the probability distribution over a discrete variable with n possible values. This can be seen as a generalization of the sigmoid function which was used to represent a probability distribution over a binary variable.

Q8. Let say you want to use the transfer learning from task A to task B. Which of the following scenario would support to use this transfer learning?

  1. Task A and B have same input x
  2. You have lot more data for task A than task B
  3. Low level features from task A could be helpful for learning B
  4. All of the above

Answer: 4 Explanation: All of the things mentioned above are pre-requisites for performing transfer learning. Refer to this beautiful explanation by Andrew Ng to know more.

Machine Learning Quiz-4

Q1. Which of the following is an example of unstructured data?

  1. Audio
  2. Images
  3. Text
  4. All of the above

Answer: 4
Explanation: All of these are examples of unstructured data. Refer to this link to know more.

Q2. Which of the following is a model-centric AI development?

  1. Hold the data fixed and iteratively improve the code/model
  2. Hold the code/model fixed and iteratively improve the data

Answer: 1
Explanation: As clear from the name, in model-centric AI development, we hold the data fixed and iteratively improve the code/model

Q3. What is Semi-Supervised Learning?

  1. where for each example we have the correct answer/label and we infer a mapping function from these examples
  2. where for each example we don’t have the correct answer/label and we try to find some sort of structure or pattern in the dataset
  3. where for some examples we have the correct answer/label while for others we don’t have correct answer/label

Answer: 3
Explanation: As clear from the name, in Semi-Supervised learning for some examples we have the correct answer/label while for others we don’t have correct answer/label. Because nowadays we are able to collect huge amount of data and labelling this huge data takes enormous effort so the focus is now shifting to Semi-Supervised learning. This is also known as Self-Supervised learning. Why? Because sometimes the data can be unlabelled but the data itself provides the necessary context which would make up the labels. For instance, CBOW model for creating word embeddings.

Q4. Which of the following is the reason to use non-linear activation function on neural networks?

  1. If you use only linear activation function, then no matter how many layers you use it will be same as not using any hidden layers
  2. Hidden layer with linear activation functions is of no use as it is not adding any non-linearity to the network so the network will not be able to learn complex functions
  3. Adding n number of hidden layers with linear activation function, end up summing it to another linear function
  4. All of the above

Answer: 4
Explanation: All of the above are possible reasons. Refer to this beautiful explanation by Andrew Ng to know more.

Q5. Which of the following activation functions can be used in neural network?

  1. ReLU
  2. Tanh
  3. Sigmoid
  4. All of the above

Answer: 4
Explanation: All of the above activation functions can be used in neural networks. Refer to this beautiful explanation by Andrew Ng to know more.

Q6. RMSprop resolves the limitation of AdaGrad optimizer?

  1. True
  2. False

Answer: 1
Explanation: RMSprop divides the learning rate by exponentially decaying average of squared gradients whereas AdaGrad divides the learning rate by sum of squared gradients. This in turn causes the learning rate to shrink in AdaGrad and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. Refer to this link to know more.

Q7. If you increase the value of lambda (regularization parameter), then model will always perform better as it helps in reducing the overfitting of model.

  1. True
  2. False

Answer: 2
Explanation: As we increase the regularization hyperparameter lambda, the weights starts becoming smaller. This can also be verified by the weights update equation in gradient descent (with L2 regularization) which is w=w(1-α*λ/m)-α*dLoss/dw. So, as you increase λ to a very high value, weights become closer to 0. This leads to a model that is too simple and ends up underfitting the data thus decreasing the performance of the model. Refer to this beautiful explanation by Andrew Ng to know more.

Q8. What is a multi-task learning in deep learning?

  1. Train n different neural networks to learn n tasks
  2. Train a single neural network to learn n task simultaneously

Answer: 2
Explanation: In multi-task learning, we train a single neural network to learn n task simultaneously. For instance, self driving cars has to detect pedestrains, cars, traffic lights etc.

Machine Learning Quiz-3

Q1. In neural networks, where do we apply batch normalization?

  1. Before applying activation function
  2. After applying activation function

Answer: 1
Explanation: We generally apply batch normalization before applying activation function. Refer to this beautiful explanation by Andrew Ng to know more.

Q2. In Mini-batch gradient descent, if the mini-batch size is set equal to training set size it will become Stochastic gradient descent and if the mini-batch size is set equal to 1 training example it will become batch gradient descent?

  1. True
  2. False

Answer: 2
Explanation: It is actually opposite. In Mini-batch gradient descent, if the mini-batch size is set equal to training set size it will become Batch gradient descent and if the mini-batch size is set equal to 1 training example it will become Stochastic gradient descent.

Q3. If we have enough computation power, it would be wiser to train multiple parallel model and then choose the best one instead of babysitting a single model.

  1. True
  2. False

Answer: 1
Explanation: In deep learning, there is as such no general rule to find the best set of hyperparameters for any task. So, one need to follow the iterative process of Idea -> Code -> Experiment and being able to try out different ideas quickly is more suited instead of babysitting a single model.

Q4. Vectorization allows you to compute forward propagation in an L-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L.?

  1. True
  2. False

Answer: 2
Explanation: We cannot avoid the for-loop iteration over the computations among layers. Refer to this beautiful explanation by Andrew Ng to know more.

Q5. Suppose you ran logistic regression twice, once with regularization parameter λ=0, and once with λ=1. One of the times, you got weight parameters w=[26.29 65.41], and the other time you got w=[2.75 1.32]. However, you forgot which value of λ corresponds to which value of w. Which one do you think corresponds to λ=1?

  1. w=[26.29 65.41]
  2. w=[2.75 1.32]

Answer: 2
Explanation: λ=0 means no regularization is used whereas λ=1 means regularization is used. And as we know that regularization results in weights shrinkage so without regularization you will get larger weights as compared to with regularization.

Q6. What is the value of Sigmoid activation function (let’s denote by g(z)) at an input value of z=0?

  1. 0
  2. 0.5
  3. -♾️
  4. +♾️

Answer: 2
Explanation: As we know that sigmoid is given by g(z) = 1/ (1 + exp(–z)) so at an input value of z=0 this outputs the value of 0.5. Refer to this beautiful explanation by Andrew Ng to know more.

Q7. Suppose you have built a neural network having 1 input, 1 hidden and 1 output layer. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

  1. Each neuron in the hidden layer will perform the same computation in the first iteration. But after one iteration of gradient descent they will learn to compute different things because we have “broken symmetry”.
  2. The hidden layer’s neurons will perform different computations from each other even in the first iteration; their parameters will thus keep evolving in their own way.
  3. Each neuron in the hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.

Answer: 3
Explanation: By initializing the weights and biases to 0, Each neuron in the hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons. Refer to this beautiful explanation by Andrew Ng to know more.

Q8. Suppose we have a neural network having 10 nodes in the input layer, 5 nodes in the hidden layer and 1 node in the output layer. What will be the dimension of b1 (first layer bias) and b2 (second layer bias)?

  1. b1:5×1, b2:1×1
  2. b1:1×10, b2:1×5
  3. b1:1×5, b2:5×10
  4. b1:5×10, b2:1×5

Answer: 1
Explanation: Generally, the bias dimensions for a layer is (next layer nodes x 1) so the answer is b1:5×1, b2:1×1. Refer to this beautiful explanation by Andrew Ng to know more.

Machine Learning Quiz-2

Q1. Which of the following is a good choice for image related tasks such as Image classification or object detection?

  1. Multilayer Perceptron (MLP)
  2. Convolutional Neural Network (CNN)
  3. Recurrent Neural Network (RNN)
  4. All of the above

Answer: 2
Explanation: Convolutional Neural Network (CNN) is a good choice for image related tasks such as Image classification or object detection. There are two main reasons for this. First one is Parameter Sharing i.e. a feature detector that is useful in 1 part of image is probably useful in another part of the same image and because of this CNN has less parameters. Second one is Sparsity of connections i.e. in each layer, each output value depends only on small number of inputs (equal to the filter size).

Q2. Which of the following statement is correct?

  1. RMSprop divides the learning rate by an exponentially decaying average of squared gradients
  2. RMSprop divides the learning rate by an exponentially increasing average of squared gradients
  3. RMSprop has a constant learning rate
  4. RMSprop decays the learning rate by a constant value

Answer: 1
Explanation: The weights update equation in RMSprop is given by w=w-α*dw/(Sdw+e)^0.5 where Sdw is an exponentially weighted average (decaying function). Thus, RMSprop divides the learning rate by an exponentially decaying average of squared gradients. Refer to this beautiful explanation by Andrew Ng to know more.

Q3. _____ is a type of gradient descent which processes 1 training example per iteration?

  1. Stochastic Gradient Descent
  2. Batch Gradient Descent
  3. Mini-batch Gradient Descent
  4. None of the above.

Answer: 1
Explanation: Stochastic Gradient Descent processes 1 training example per iteration of gradient descent.

Q4. Let say you have trained a cat classifier on 10 million cat images and it is performing well on live environment. Now in live environment you have encountered new cat species. Due to that your deployed model has started degrading. You have only 1000 images of new indentifed cat species. Which of the following step you should take first?

  1. Put all 1000 images in the training set and start training asap
  2. Try data augmentation on these 1000 images to get more data
  3. Split the 1000 images into train/test set and start the training
  4. Use the data you have to define a new evaluation metric (using a new dev/test set) taking into account the new species, and use that to drive further progress with the model

Answer: 4
Explanation: Because we have a very less amount of data for new cat species (1000) as compared to 10 million so putting these 1000 in training or splitting will not make any difference. Also by augmentation we will not be able to increase the dataset to that extent (10 million). So the only option that left is build a new evaluation metric and penalize the model more for making false predictions on the new species.

Q5. Which of the following is an example of supervised learning?

  1. Given the data of house prices and house sizes, predict house price as a function of house size
  2. Given 50 spam and 50 non-spam emails, predict whether the new email is spam/non-spam
  3. Given the data consisting of 1000 images of cats and dogs each, we need to classify to which class the new image belongs
  4. All of the above

Answer: 4
Explanation: Because for each of the above options, we have the correct answer/label so all of the these are examples of supervised learning.

Q6. Which of the following is True for Structured Data?

  1. Structured Data has clear, definable relationships between the data points, with a pre-defined model containing it
  2. Structured data is quantitative, highly organized, and each of the feature has a well-defined meaning
  3. Structured data is generally contained in relational databases (RDBMS)
  4. All of the above

Answer: 4
Explanation: All of the above is True for Structured Data. Refer to this link to know more.

Q7. You have built a network using the sigmoid activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*10000. What will happen?

  1. This will cause the inputs to the sigmoid to be very large, causing the units to be “highly activated” and thus speed up learning compared to if the weights had to start from small values
  2. It doesn’t matter as long as you initialize the weights randomly gradient descent is not affected by whether the weights are large or small
  3. This will cause the inputs to the sigmoid to be very large, thus causing gradients to also become large. You therefore have to set \alphaα to be very small to prevent divergence; this will slow down learning
  4. This will cause the inputs to the sigmoid to be very large, thus causing gradients to be close to zero and slows down the learning

Answer: 4
Explanation: When we initialize the weights to a very large value, the input to a sigmoid function (that is calculated using z=w*x+b) will also become very large. As we know that for large inputs the sigmoid curve is quite flat and because of this the gradients will be close to 0 and thus slows down the gradient descent or learning.

Q8. Let say you are working on a cat classifier, and have been asked to work on three different metrics. 1. accuracy 2. inference time and 3. memory size. What will you say about the following statement:\n”Having three evaluation metrics will make it easier for you to quickly choose between two different algorithms, and your team can work faster.”

  1. True
  2. False

Answer: 2
Explanation: It is always good to have a single real number evaluation metric. If you have more than 1 evaluation metric then it would be very difficult to access the performance. For instance, if for 1 case if the precision and recall is 60% and 40% while for other case precision and recall is 30% and 70% so it’s very tedious task to judge which one is better. That’s why we have F1 score as it combines precision and recall into one metric.

Machine Learning Quiz-1

Q1. Let say if you have 10 million dataset, and it would take 2 week time to train your model. Which of the following statement do you most agree with?

  1. If you have already trained a model with different dataset and is performing well with 98% dev accuracy on that dataset, you just use that model instead of training on current dataset for two weeks
  2. If 10 million dataset is enough to build a good model, you might be better off training with just 1 million dataset to gain 10 times improvement in how quickly you can run experiments, even if each model performs a bit worse because it’s trained on less data
  3. You will go with complete dataset and run the model for two weeks to see the first results
  4. All of the above

Answer: 2
Explanation: In Machine learning, the best approach is to build an initial model quickly using a random subset of data and then use the Bias/Variance analysis and error analysis to priortize next steps.

Q2. In a Multi-layer Perceptron (MLP), each node is connected to all the previous layer nodes?

  1. True
  2. False

Answer: 1
Explanation: Since a Multi-Layer Perceptron (MLP) is a Fully Connected Network, each node in one layer connects with a certain weight to every node in the following layer.

Q3. Identify the following activation function : g(z) = (exp(z) – exp(-z))/(exp(z) + exp(–z))?

  1. Tanh activation function
  2. Sigmoid activation function
  3. ReLU activation function
  4. Leaky ReLU activation function

Answer: 1
Explanation: This refers to Tanh activation function. Similar to sigmoid, the tanh function is continuous and differentiable at all points, the only difference is that it is symmetric around the origin. Refer to this beautiful explanation by Andrew Ng to know more.

Q4. Suppose we have a neural network having 10 nodes in the input layer, 5 nodes in the hidden layer and 1 node in the output layer. What will be the dimension of W1 (first layer weights) and W2 (second layer weights)?

  1. W1:5×1, W2:1×1
  2. W1:1×10, W2:1×5
  3. W1:1×5, W2:5×10
  4. W1:5×10, W2:1×5

Answer: 4
Explanation: Generally, the weights dimensions for a layer is (next layer nodes x previous layer nodes) so the answer is W1:5×10, W2:1×5. Refer to this beautiful explanation by Andrew Ng to know more.

Q5. In Dropout, What will happen if we increasing the Dropout rate from (say) 0.5 to 0.8?

  1. Reducing the regularization effect.
  2. Causing the neural network to end up with a lower training set error.
  3. Both of the above.
  4. None of the above.

Answer: 3
Explanation:

Q6. Finding good hyperparameter values is very time-consuming. So typically you should do it once at the start of the project, and try to find very good hyperparameters so that you don’t ever have to revisit tuning them again.

  1. True
  2. False

Answer: 2
Explanation: You can’t really know beforehand which set of hyperparameters will work best for your case. You need to follow the iterative process of Idea->Code->Eperiment.

Q7. In a deep neural network, what is the general rule for the dimensions of weights and biases of layer l? Where n is the number of units in layer l.

  1. w[l] : (n[l], n[l])
    b[l] : (n[l], 1)
  2. w[l] : (n[l+1], n[l])
    b[l] : (n[l-1], 1)
  3. w[l] : (n[l], n[l-1])
    b[l] : (n[l], 1)
  4. w[l] : (n[l], n[l-1])
    b[l] : (n[l-1], 1)

Answer: 3
Explanation: The dimensions of weights of layer l is given by (n[l], n[l-1]) and biases is given by (n[l], 1). Refer to this beautiful explanation by Andrew Ng to know more.

Q8. Which of the following method can be used for hyperparameter tuning?

  1. Random Search
  2. Grid Search
  3. Bayesian optimization
  4. All of the above.

Answer: 4
Explanation: All of the above methods can be used for hyperparameter tuning.

An Introduction to Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs), as the name suggests these are the deep learning models used to generate new data from the given dataset using an adversarial process. GANs were first introduced by Ian Goodfellow at NIPS 2014. This idea is regarded as the most interesting in machine learning in the last 10 years. Generative models are carrying bigger hope because they can mimic any data distribution. They can be used to generated images, audio waveform containing speech, music, etc.

Generative Adversarial Network Algorithm:

To create a GAN, we train two networks simultaneously in an adversarial manner. The two networks are generator and discriminator. And the adversary is, while the generator tries to generate data similar to original data distribution, discriminator tries to discriminate between data generated by the generator and original data. Here generator will try to fool the discriminator by improving itself and discriminator tries to differentiate between original and fake. This training will continue until the discriminator model is fooled half the time and the generator is able to generate data similar to original data distribution.

Let’s consider an example of generating new images using GAN. The first network discriminator is D(X), where X is an image (either real or fake). And the second network generator is G(Z), where Z is random noise. To train these networks D is first fed with real images and train to produce values close to 1(real) and then fed with fake images(generated by generator) and trained to produce values close to 0 (fake). Similarly, the generator is trained with loss generated by each image fed to discriminator produced by the generator.

We train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimize log(1 − D(G(z))). Let’s take a look into the algorithm provided in GAN paper.

We train this network for some number of iterations to make generator predict images close to the training dataset.

Generative Adversarial Networks (GANs) Vs Variational Autoencoders (VAEs)

There are some other generative models such as variational autoencoders that can do a similar job as GANs do. A VAE model maps the input to low dimensional space and then create a probability distribution to generate new outputs using some decoder function (To know more about VAEs you can follow this blog).

VAE Model

While Vanilla GANs are not able to map the input to latent space rather they use random noise to generate new data. GANs are usually difficult to train but generate more fine and granular images while VAEs are easier to train but produces more blurred images.

This was a brief introduction about generative adversarial networks. In the following posts, we will implement different GAN architectures, train GAN network and learn more about GAN improvements with its variants (CycleGAN, InfoGAN, BigGAN, etc).

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Supervised And Unsupervised Learning

With the advancement in the field of artificial intelligence, we are able to solve the problems of different fields. Some of them you may be using in your daily life. The two major categorization in this field are supervised and unsupervised learning.

You get a bunch of e-mails with information about in which category they fall either “spam” or “not spam” and then you train a model to categorize a new e-mail. This type of learning is called supervised learning.

You are invited to a party and met totally strangers. Now you will classify them using unsupervised learning (no prior knowledge) and this classification can be on the basis of gender, age group, dressing, educational qualification or whatever way you would like. This is unsupervised learning since you are exploring the data and finding groups by exploration.

In supervised learning, we are going to teach the computer how to do something while in unsupervised we let the computer to do itself. Does it make sense? Let’s look into this using some examples.

Supervised Learning

Let’s say we need to predict an image, whether it is “cat” or not.

To make computers learn this type of problem, we need to provide them a dataset having both input image and their corresponding labels i.e. is it a cat or not. So, if the dataset is having output label in it, the problem can be classified as supervised learning problem.

A supervised leaning follow this pattern: input -> hypothesis -> output

Where inputs are our training data for example images of “cat”, hypothesis can be one of the machine learning algorithm for example SVM and Decision Trees and output is corresponding labels for example it is “cat” or “not cat”.

A Supervised learning can be further classified into Classification and Regression.

Classification: In classification problems we predict results in a discrete output. Let say predicting an email as “spam” or “non spam”.

Regression: In regression problems we need to predict results within a continuous output. Let say predicting house prices.

Unsupervised Learning

Let say we are having bunch of T-shirts.

Also we do not have corresponding labels to T-shirts to which class it belongs. Now in unsupervised learning, model will discover information from this data. Let say model discovers a feature as t-shirt sizes and cluster these t-shirts according to their sizes into three categories small, medium and large.

So in unsupervised learning problems output labels are not provided and computer is restricted to find some hidden structure and group data according to that.

Unsupervised learning is further classified into Clustering and association problems.

Clustering: In clustering, algorithm form groups inside the data. For example, grouping news according to its headline as google news does.

Association: In association, algorithm discovers an interesting relationship between data. For example, recommending a similar product to a user on an e-commerce website.

Summary

  • Supervised learning works on labeled training data while unsupervised works on unlabeled training data.
  • Unsupervised learning explores the data and finds interesting features.
  • Supervised learning as the name suggests has a supervisor.
  • Unsupervised learning uses algorithms like K-means, hierarchical clustering while supervised learning uses algorithms like SVM, linear regression, logistic regression, etc.
  • Supervised learning can be applied in the field of risk assessment, image classification, fraud detection, object detection, etc.
  • Unsupervised learning can be applied in the field of delivery store optimization, semantic clustering, market basket analysis, etc.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Calculating Screen Time of an Actor using Deep Learning

Screen time of an actor in a movie or an episode is very important. Many actors get paid according to their total screen time. Moreover, we also want to know how much time our favorite character acted on screen. So, have you ever wondered how can you calculate the total screen time of an actor? One of the plausible answer is with deep learning.

With the advancement of deep learning now its possible to solve various difficult problems. In this blog, we will learn how to use transfer learning and image classification concepts of deep learning to calculate the screen time of an actor.

To solve any problem with deep learning, the first requirement is the data. For this tutorial, we will use a video clip from the famous TV show “Friends”. We are going to calculate the screen time of my favorite character “Ross”.

Creating Dataset

First, we need to get a video. To do this I have downloaded a video from YouTube using pytube library. For more understanding of pytube, you can follow this blog or use the following code to get started.

Now we have our data in the form of a video which is nothing but a group of frames( images). Since we are going to solve this problem using image classification, we need to extract the images from this video. For this task, I have used OpenCV as shown below

The video is now converted into individual frames. In this problem, there is only one class, either “Ross” or “No Ross”. To create a dataset, we need to separate images according to these two manually. For this, I have created a folder named “data” which is having two sub-folder “ross” and “no_ross”. Then manually added images to these two sub-folders. After creating dataset we are ready to dive into the code and concepts.

Input Data and Preprocessing

We are having data in the form of images. To prepare this data for input to our neural network, we need to do some preprocessing with the following steps:

  • Read all images one by one using openCV
  • Resize each image to (224, 224, 3) for the input to the model
  • Divide the data by 255 to make input features to neural network in the same range
  • Append to corresponding class

Transfer Learning

Since we have only 6814 images, so it will be difficult to train a neural network with this little dataset. Here comes the concept of transfer learning.

With the help of transfer learning, we can use features generated by a model trained on a large dataset into our model. Here we will use VGG16 model trained on “imagenet” dataset. For this, we are using tensorflow high-level API Keras. With keras, you can directly import VGG16 model as shown in the code below.

VGG16 model trained with imagenet dataset predicts on lots of classes, but in this problem, we are only having one class, either “Ross” or “No Ross”. That’s why above we are using include_top = False, which signifies that we are not including fully connected layers from the VGG16 model. Now we will pass our input data to vgg_model and generate the features.

Network Architectures

Since we are not including fully connected layers from VGG16 model, we need to create a model with some fully connected layers and an output layer with 1 class, either “Ross” or “No Ross”. Output features from VGG16 model will be having shape 7*7*512, which will be input shape for our model. Here I am also using dropout layer to make model less over-fit. Let’s see the code:

Splitting Data into Train and Validation

Now we have input features from VGG16 model and our own network architecture defined above. Next thing is to train this neural network. But we are lacking our validation data. We are having 6814 images, so we will split this into 5000 training images and 1814 validation images.

According to our created class 1, class 2, training and validation data, we will create our output y labels.

Training the Network

All set, we are ready to train our model. Here, we will use stochastic gradient descent as an optimizer and binary cross-entropy as our loss function. We are also going to save our checkpoint for the best model according to it’s validation dataset accuracy.

I am using batch size of 64 and 10 epochs to train.

Training and validation accuracy looks quite pleasing. Now let’s calculate screen time of “Ross”.

Calculating Screen Time

To test our trained model and calculate the screen time, I have downloaded another “friends” video clip from YouTube and extracted images. To calculate the screen time, first I have used the trained model to predict each image to find out which class it belongs, either “Ross” or “No Ross”. Since video is made up of 24 frames per second, we will count the number of frames which has been predicted for having “Ross” in it and then divide it by 24 to count the number of seconds “Ross” was on screen.

This test video clip is made up of 24 frames per second and number of images predicted for having “Ross” in it are 4715. So the screen time for Ross will be 4715/24 = 196 seconds.

Summary

We can see good accuracy on train and validation dataset but when I tested it on test dataset, the accuracy was about 65%. The one reason that I figured out is less training data. If you can get more data then accuracy can be higher. Another reason can be co-variance shift which means the test dataset is quite different from training dataset due to different video quality.

This type of technique can be very helpful in calculating screen time of a particular character.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Snake Game Using Tensorflow Object Detection API – Part IV

In the last blog we have trained the model and saved the inference graph. In this blog we will learn how to use this inference graph for object detection and how to run our snake game using this trained object detection model.

To play snake game using this trained model, you first need to develop a snake game. But don’t worry you need not to develop it from scratch, you can clone this repository. And if you want to know algorithm behind this code you can follow this blog.

Now we have our snake game next thing is to use this object detection model to play the snake game. To do this we need to run both snake game file and following script from models/research folder simultaneously.

In the above code we need to specify path to our inference graph using ” PATH_TO_CKPT ” variable. Also we need to specify ” PATH_TO_LABELS ” variable with path of object-detection.pbtxt file. Then specify number of classes i.e. 4 in our case.

In the above script we have used ” pyautogui ” to press the button when particular hand gesture for a particular direction is detected.

Finally you can play snake game using your hand gestures. Let see some of the results.

Pretty well yeah. This is all for playing snake game using tensorflow object detection API. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.