Tag Archives: Machine Learning

Machine Learning Quiz-5

Q1. The optimizer is an important part of training neural networks. which of the following is not the purpose of using optimizers?

  1. Speed up algorithm convergence
  2. Reduce the difficulty of manual parameter setting
  3. Avoid overfitting
  4. Avoid local extremes

Answer: 3
Explanation: To avoid overfitting, we use regularization and not optimizers.

Q2. Which of the following is not a regularization technique used in machine learning?

  1. L1 regularization
  2. R-square
  3. L2 regularization
  4. Dropout

Answer: 2
Explanation: Of all the above mentioned, R-square is not a regularization technique. R-squared is a statistical measure of how close the data are to the fitted regression line.

Q3. Which of the following are hperparameter in the context of deep learning?

  1. Learning Rate, α
  2. Momentum parameter, β1
  3. Number of units in a layer
  4. All of the above

Answer: 4
Explanation: According to Wikipedia, “In machine learning, a hyperparameter is a parameter whose value is used to control the learning process”. So, all of the above are hyperparameters.

Q4. Which of the following statement is not true with respect to batch normalization?

  1. Batch normalization helps in decreasing training time
  2. Batch normalization add slight regularization effect
  3. After using of batch normalization there is no need to use the dropout
  4. Batch normalization helps in reducing the covariate shift

Answer: 3
Explanation: Although Batch Normalization has a slight regularization effect but this is not why we use this. This is used to make the neural network more robust (reduce covariate shift) and easy to train. While Dropout is used for regularization (reducing overfitting). So, the third option is incorrect.

Q5. In a machine learning project, modelling is an iterative process but deployment is not.

  1. True
  2. False

Answer: 2
Explanation: Deployment is an iterative process, where you should expect to make multiple adjustments (such as metrics monitored using dashboards or percentage of traffic served) to work towards optimizing the system.

Q6. Which of the following activation function works better for hidden layers?

  1. Sigmoid
  2. Tanh

Answer: 2
Explanation: The Tanh activation function usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, so it centers the data better for the next layer and the gradients are not restricted to move in a certain direction.

Q7. The softmax function is used to calculate the probability distribution over a discrete variable with n possible values?

  1. True
  2. False

Answer: 1
Explanation: The softmax function is used to calculate the probability distribution over a discrete variable with n possible values. This can be seen as a generalization of the sigmoid function which was used to represent a probability distribution over a binary variable.

Q8. Let say you want to use the transfer learning from task A to task B. Which of the following scenario would support to use this transfer learning?

  1. Task A and B have same input x
  2. You have lot more data for task A than task B
  3. Low level features from task A could be helpful for learning B
  4. All of the above

Answer: 4 Explanation: All of the things mentioned above are pre-requisites for performing transfer learning. Refer to this beautiful explanation by Andrew Ng to know more.

Machine Learning Quiz-4

Q1. Which of the following is an example of unstructured data?

  1. Audio
  2. Images
  3. Text
  4. All of the above

Answer: 4
Explanation: All of these are examples of unstructured data. Refer to this link to know more.

Q2. Which of the following is a model-centric AI development?

  1. Hold the data fixed and iteratively improve the code/model
  2. Hold the code/model fixed and iteratively improve the data

Answer: 1
Explanation: As clear from the name, in model-centric AI development, we hold the data fixed and iteratively improve the code/model

Q3. What is Semi-Supervised Learning?

  1. where for each example we have the correct answer/label and we infer a mapping function from these examples
  2. where for each example we don’t have the correct answer/label and we try to find some sort of structure or pattern in the dataset
  3. where for some examples we have the correct answer/label while for others we don’t have correct answer/label

Answer: 3
Explanation: As clear from the name, in Semi-Supervised learning for some examples we have the correct answer/label while for others we don’t have correct answer/label. Because nowadays we are able to collect huge amount of data and labelling this huge data takes enormous effort so the focus is now shifting to Semi-Supervised learning. This is also known as Self-Supervised learning. Why? Because sometimes the data can be unlabelled but the data itself provides the necessary context which would make up the labels. For instance, CBOW model for creating word embeddings.

Q4. Which of the following is the reason to use non-linear activation function on neural networks?

  1. If you use only linear activation function, then no matter how many layers you use it will be same as not using any hidden layers
  2. Hidden layer with linear activation functions is of no use as it is not adding any non-linearity to the network so the network will not be able to learn complex functions
  3. Adding n number of hidden layers with linear activation function, end up summing it to another linear function
  4. All of the above

Answer: 4
Explanation: All of the above are possible reasons. Refer to this beautiful explanation by Andrew Ng to know more.

Q5. Which of the following activation functions can be used in neural network?

  1. ReLU
  2. Tanh
  3. Sigmoid
  4. All of the above

Answer: 4
Explanation: All of the above activation functions can be used in neural networks. Refer to this beautiful explanation by Andrew Ng to know more.

Q6. RMSprop resolves the limitation of AdaGrad optimizer?

  1. True
  2. False

Answer: 1
Explanation: RMSprop divides the learning rate by exponentially decaying average of squared gradients whereas AdaGrad divides the learning rate by sum of squared gradients. This in turn causes the learning rate to shrink in AdaGrad and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. Refer to this link to know more.

Q7. If you increase the value of lambda (regularization parameter), then model will always perform better as it helps in reducing the overfitting of model.

  1. True
  2. False

Answer: 2
Explanation: As we increase the regularization hyperparameter lambda, the weights starts becoming smaller. This can also be verified by the weights update equation in gradient descent (with L2 regularization) which is w=w(1-α*λ/m)-α*dLoss/dw. So, as you increase λ to a very high value, weights become closer to 0. This leads to a model that is too simple and ends up underfitting the data thus decreasing the performance of the model. Refer to this beautiful explanation by Andrew Ng to know more.

Q8. What is a multi-task learning in deep learning?

  1. Train n different neural networks to learn n tasks
  2. Train a single neural network to learn n task simultaneously

Answer: 2
Explanation: In multi-task learning, we train a single neural network to learn n task simultaneously. For instance, self driving cars has to detect pedestrains, cars, traffic lights etc.

Machine Learning Quiz-3

Q1. In neural networks, where do we apply batch normalization?

  1. Before applying activation function
  2. After applying activation function

Answer: 1
Explanation: We generally apply batch normalization before applying activation function. Refer to this beautiful explanation by Andrew Ng to know more.

Q2. In Mini-batch gradient descent, if the mini-batch size is set equal to training set size it will become Stochastic gradient descent and if the mini-batch size is set equal to 1 training example it will become batch gradient descent?

  1. True
  2. False

Answer: 2
Explanation: It is actually opposite. In Mini-batch gradient descent, if the mini-batch size is set equal to training set size it will become Batch gradient descent and if the mini-batch size is set equal to 1 training example it will become Stochastic gradient descent.

Q3. If we have enough computation power, it would be wiser to train multiple parallel model and then choose the best one instead of babysitting a single model.

  1. True
  2. False

Answer: 1
Explanation: In deep learning, there is as such no general rule to find the best set of hyperparameters for any task. So, one need to follow the iterative process of Idea -> Code -> Experiment and being able to try out different ideas quickly is more suited instead of babysitting a single model.

Q4. Vectorization allows you to compute forward propagation in an L-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L.?

  1. True
  2. False

Answer: 2
Explanation: We cannot avoid the for-loop iteration over the computations among layers. Refer to this beautiful explanation by Andrew Ng to know more.

Q5. Suppose you ran logistic regression twice, once with regularization parameter λ=0, and once with λ=1. One of the times, you got weight parameters w=[26.29 65.41], and the other time you got w=[2.75 1.32]. However, you forgot which value of λ corresponds to which value of w. Which one do you think corresponds to λ=1?

  1. w=[26.29 65.41]
  2. w=[2.75 1.32]

Answer: 2
Explanation: λ=0 means no regularization is used whereas λ=1 means regularization is used. And as we know that regularization results in weights shrinkage so without regularization you will get larger weights as compared to with regularization.

Q6. What is the value of Sigmoid activation function (let’s denote by g(z)) at an input value of z=0?

  1. 0
  2. 0.5
  3. -♾️
  4. +♾️

Answer: 2
Explanation: As we know that sigmoid is given by g(z) = 1/ (1 + exp(–z)) so at an input value of z=0 this outputs the value of 0.5. Refer to this beautiful explanation by Andrew Ng to know more.

Q7. Suppose you have built a neural network having 1 input, 1 hidden and 1 output layer. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

  1. Each neuron in the hidden layer will perform the same computation in the first iteration. But after one iteration of gradient descent they will learn to compute different things because we have “broken symmetry”.
  2. The hidden layer’s neurons will perform different computations from each other even in the first iteration; their parameters will thus keep evolving in their own way.
  3. Each neuron in the hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.

Answer: 3
Explanation: By initializing the weights and biases to 0, Each neuron in the hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons. Refer to this beautiful explanation by Andrew Ng to know more.

Q8. Suppose we have a neural network having 10 nodes in the input layer, 5 nodes in the hidden layer and 1 node in the output layer. What will be the dimension of b1 (first layer bias) and b2 (second layer bias)?

  1. b1:5×1, b2:1×1
  2. b1:1×10, b2:1×5
  3. b1:1×5, b2:5×10
  4. b1:5×10, b2:1×5

Answer: 1
Explanation: Generally, the bias dimensions for a layer is (next layer nodes x 1) so the answer is b1:5×1, b2:1×1. Refer to this beautiful explanation by Andrew Ng to know more.

Machine Learning Quiz-2

Q1. Which of the following is a good choice for image related tasks such as Image classification or object detection?

  1. Multilayer Perceptron (MLP)
  2. Convolutional Neural Network (CNN)
  3. Recurrent Neural Network (RNN)
  4. All of the above

Answer: 2
Explanation: Convolutional Neural Network (CNN) is a good choice for image related tasks such as Image classification or object detection. There are two main reasons for this. First one is Parameter Sharing i.e. a feature detector that is useful in 1 part of image is probably useful in another part of the same image and because of this CNN has less parameters. Second one is Sparsity of connections i.e. in each layer, each output value depends only on small number of inputs (equal to the filter size).

Q2. Which of the following statement is correct?

  1. RMSprop divides the learning rate by an exponentially decaying average of squared gradients
  2. RMSprop divides the learning rate by an exponentially increasing average of squared gradients
  3. RMSprop has a constant learning rate
  4. RMSprop decays the learning rate by a constant value

Answer: 1
Explanation: The weights update equation in RMSprop is given by w=w-α*dw/(Sdw+e)^0.5 where Sdw is an exponentially weighted average (decaying function). Thus, RMSprop divides the learning rate by an exponentially decaying average of squared gradients. Refer to this beautiful explanation by Andrew Ng to know more.

Q3. _____ is a type of gradient descent which processes 1 training example per iteration?

  1. Stochastic Gradient Descent
  2. Batch Gradient Descent
  3. Mini-batch Gradient Descent
  4. None of the above.

Answer: 1
Explanation: Stochastic Gradient Descent processes 1 training example per iteration of gradient descent.

Q4. Let say you have trained a cat classifier on 10 million cat images and it is performing well on live environment. Now in live environment you have encountered new cat species. Due to that your deployed model has started degrading. You have only 1000 images of new indentifed cat species. Which of the following step you should take first?

  1. Put all 1000 images in the training set and start training asap
  2. Try data augmentation on these 1000 images to get more data
  3. Split the 1000 images into train/test set and start the training
  4. Use the data you have to define a new evaluation metric (using a new dev/test set) taking into account the new species, and use that to drive further progress with the model

Answer: 4
Explanation: Because we have a very less amount of data for new cat species (1000) as compared to 10 million so putting these 1000 in training or splitting will not make any difference. Also by augmentation we will not be able to increase the dataset to that extent (10 million). So the only option that left is build a new evaluation metric and penalize the model more for making false predictions on the new species.

Q5. Which of the following is an example of supervised learning?

  1. Given the data of house prices and house sizes, predict house price as a function of house size
  2. Given 50 spam and 50 non-spam emails, predict whether the new email is spam/non-spam
  3. Given the data consisting of 1000 images of cats and dogs each, we need to classify to which class the new image belongs
  4. All of the above

Answer: 4
Explanation: Because for each of the above options, we have the correct answer/label so all of the these are examples of supervised learning.

Q6. Which of the following is True for Structured Data?

  1. Structured Data has clear, definable relationships between the data points, with a pre-defined model containing it
  2. Structured data is quantitative, highly organized, and each of the feature has a well-defined meaning
  3. Structured data is generally contained in relational databases (RDBMS)
  4. All of the above

Answer: 4
Explanation: All of the above is True for Structured Data. Refer to this link to know more.

Q7. You have built a network using the sigmoid activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*10000. What will happen?

  1. This will cause the inputs to the sigmoid to be very large, causing the units to be “highly activated” and thus speed up learning compared to if the weights had to start from small values
  2. It doesn’t matter as long as you initialize the weights randomly gradient descent is not affected by whether the weights are large or small
  3. This will cause the inputs to the sigmoid to be very large, thus causing gradients to also become large. You therefore have to set \alphaα to be very small to prevent divergence; this will slow down learning
  4. This will cause the inputs to the sigmoid to be very large, thus causing gradients to be close to zero and slows down the learning

Answer: 4
Explanation: When we initialize the weights to a very large value, the input to a sigmoid function (that is calculated using z=w*x+b) will also become very large. As we know that for large inputs the sigmoid curve is quite flat and because of this the gradients will be close to 0 and thus slows down the gradient descent or learning.

Q8. Let say you are working on a cat classifier, and have been asked to work on three different metrics. 1. accuracy 2. inference time and 3. memory size. What will you say about the following statement:\n”Having three evaluation metrics will make it easier for you to quickly choose between two different algorithms, and your team can work faster.”

  1. True
  2. False

Answer: 2
Explanation: It is always good to have a single real number evaluation metric. If you have more than 1 evaluation metric then it would be very difficult to access the performance. For instance, if for 1 case if the precision and recall is 60% and 40% while for other case precision and recall is 30% and 70% so it’s very tedious task to judge which one is better. That’s why we have F1 score as it combines precision and recall into one metric.

Machine Learning Quiz-1

Q1. Let say if you have 10 million dataset, and it would take 2 week time to train your model. Which of the following statement do you most agree with?

  1. If you have already trained a model with different dataset and is performing well with 98% dev accuracy on that dataset, you just use that model instead of training on current dataset for two weeks
  2. If 10 million dataset is enough to build a good model, you might be better off training with just 1 million dataset to gain 10 times improvement in how quickly you can run experiments, even if each model performs a bit worse because it’s trained on less data
  3. You will go with complete dataset and run the model for two weeks to see the first results
  4. All of the above

Answer: 2
Explanation: In Machine learning, the best approach is to build an initial model quickly using a random subset of data and then use the Bias/Variance analysis and error analysis to priortize next steps.

Q2. In a Multi-layer Perceptron (MLP), each node is connected to all the previous layer nodes?

  1. True
  2. False

Answer: 1
Explanation: Since a Multi-Layer Perceptron (MLP) is a Fully Connected Network, each node in one layer connects with a certain weight to every node in the following layer.

Q3. Identify the following activation function : g(z) = (exp(z) – exp(-z))/(exp(z) + exp(–z))?

  1. Tanh activation function
  2. Sigmoid activation function
  3. ReLU activation function
  4. Leaky ReLU activation function

Answer: 1
Explanation: This refers to Tanh activation function. Similar to sigmoid, the tanh function is continuous and differentiable at all points, the only difference is that it is symmetric around the origin. Refer to this beautiful explanation by Andrew Ng to know more.

Q4. Suppose we have a neural network having 10 nodes in the input layer, 5 nodes in the hidden layer and 1 node in the output layer. What will be the dimension of W1 (first layer weights) and W2 (second layer weights)?

  1. W1:5×1, W2:1×1
  2. W1:1×10, W2:1×5
  3. W1:1×5, W2:5×10
  4. W1:5×10, W2:1×5

Answer: 4
Explanation: Generally, the weights dimensions for a layer is (next layer nodes x previous layer nodes) so the answer is W1:5×10, W2:1×5. Refer to this beautiful explanation by Andrew Ng to know more.

Q5. In Dropout, What will happen if we increasing the Dropout rate from (say) 0.5 to 0.8?

  1. Reducing the regularization effect.
  2. Causing the neural network to end up with a lower training set error.
  3. Both of the above.
  4. None of the above.

Answer: 3
Explanation:

Q6. Finding good hyperparameter values is very time-consuming. So typically you should do it once at the start of the project, and try to find very good hyperparameters so that you don’t ever have to revisit tuning them again.

  1. True
  2. False

Answer: 2
Explanation: You can’t really know beforehand which set of hyperparameters will work best for your case. You need to follow the iterative process of Idea->Code->Eperiment.

Q7. In a deep neural network, what is the general rule for the dimensions of weights and biases of layer l? Where n is the number of units in layer l.

  1. w[l] : (n[l], n[l])
    b[l] : (n[l], 1)
  2. w[l] : (n[l+1], n[l])
    b[l] : (n[l-1], 1)
  3. w[l] : (n[l], n[l-1])
    b[l] : (n[l], 1)
  4. w[l] : (n[l], n[l-1])
    b[l] : (n[l-1], 1)

Answer: 3
Explanation: The dimensions of weights of layer l is given by (n[l], n[l-1]) and biases is given by (n[l], 1). Refer to this beautiful explanation by Andrew Ng to know more.

Q8. Which of the following method can be used for hyperparameter tuning?

  1. Random Search
  2. Grid Search
  3. Bayesian optimization
  4. All of the above.

Answer: 4
Explanation: All of the above methods can be used for hyperparameter tuning.

Supervised And Unsupervised Learning

With the advancement in the field of artificial intelligence, we are able to solve the problems of different fields. Some of them you may be using in your daily life. The two major categorization in this field are supervised and unsupervised learning.

You get a bunch of e-mails with information about in which category they fall either “spam” or “not spam” and then you train a model to categorize a new e-mail. This type of learning is called supervised learning.

You are invited to a party and met totally strangers. Now you will classify them using unsupervised learning (no prior knowledge) and this classification can be on the basis of gender, age group, dressing, educational qualification or whatever way you would like. This is unsupervised learning since you are exploring the data and finding groups by exploration.

In supervised learning, we are going to teach the computer how to do something while in unsupervised we let the computer to do itself. Does it make sense? Let’s look into this using some examples.

Supervised Learning

Let’s say we need to predict an image, whether it is “cat” or not.

To make computers learn this type of problem, we need to provide them a dataset having both input image and their corresponding labels i.e. is it a cat or not. So, if the dataset is having output label in it, the problem can be classified as supervised learning problem.

A supervised leaning follow this pattern: input -> hypothesis -> output

Where inputs are our training data for example images of “cat”, hypothesis can be one of the machine learning algorithm for example SVM and Decision Trees and output is corresponding labels for example it is “cat” or “not cat”.

A Supervised learning can be further classified into Classification and Regression.

Classification: In classification problems we predict results in a discrete output. Let say predicting an email as “spam” or “non spam”.

Regression: In regression problems we need to predict results within a continuous output. Let say predicting house prices.

Unsupervised Learning

Let say we are having bunch of T-shirts.

Also we do not have corresponding labels to T-shirts to which class it belongs. Now in unsupervised learning, model will discover information from this data. Let say model discovers a feature as t-shirt sizes and cluster these t-shirts according to their sizes into three categories small, medium and large.

So in unsupervised learning problems output labels are not provided and computer is restricted to find some hidden structure and group data according to that.

Unsupervised learning is further classified into Clustering and association problems.

Clustering: In clustering, algorithm form groups inside the data. For example, grouping news according to its headline as google news does.

Association: In association, algorithm discovers an interesting relationship between data. For example, recommending a similar product to a user on an e-commerce website.

Summary

  • Supervised learning works on labeled training data while unsupervised works on unlabeled training data.
  • Unsupervised learning explores the data and finds interesting features.
  • Supervised learning as the name suggests has a supervisor.
  • Unsupervised learning uses algorithms like K-means, hierarchical clustering while supervised learning uses algorithms like SVM, linear regression, logistic regression, etc.
  • Supervised learning can be applied in the field of risk assessment, image classification, fraud detection, object detection, etc.
  • Unsupervised learning can be applied in the field of delivery store optimization, semantic clustering, market basket analysis, etc.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Calculating Screen Time of an Actor using Deep Learning

Screen time of an actor in a movie or an episode is very important. Many actors get paid according to their total screen time. Moreover, we also want to know how much time our favorite character acted on screen. So, have you ever wondered how can you calculate the total screen time of an actor? One of the plausible answer is with deep learning.

With the advancement of deep learning now its possible to solve various difficult problems. In this blog, we will learn how to use transfer learning and image classification concepts of deep learning to calculate the screen time of an actor.

To solve any problem with deep learning, the first requirement is the data. For this tutorial, we will use a video clip from the famous TV show “Friends”. We are going to calculate the screen time of my favorite character “Ross”.

Creating Dataset

First, we need to get a video. To do this I have downloaded a video from YouTube using pytube library. For more understanding of pytube, you can follow this blog or use the following code to get started.

Now we have our data in the form of a video which is nothing but a group of frames( images). Since we are going to solve this problem using image classification, we need to extract the images from this video. For this task, I have used OpenCV as shown below

The video is now converted into individual frames. In this problem, there is only one class, either “Ross” or “No Ross”. To create a dataset, we need to separate images according to these two manually. For this, I have created a folder named “data” which is having two sub-folder “ross” and “no_ross”. Then manually added images to these two sub-folders. After creating dataset we are ready to dive into the code and concepts.

Input Data and Preprocessing

We are having data in the form of images. To prepare this data for input to our neural network, we need to do some preprocessing with the following steps:

  • Read all images one by one using openCV
  • Resize each image to (224, 224, 3) for the input to the model
  • Divide the data by 255 to make input features to neural network in the same range
  • Append to corresponding class

Transfer Learning

Since we have only 6814 images, so it will be difficult to train a neural network with this little dataset. Here comes the concept of transfer learning.

With the help of transfer learning, we can use features generated by a model trained on a large dataset into our model. Here we will use VGG16 model trained on “imagenet” dataset. For this, we are using tensorflow high-level API Keras. With keras, you can directly import VGG16 model as shown in the code below.

VGG16 model trained with imagenet dataset predicts on lots of classes, but in this problem, we are only having one class, either “Ross” or “No Ross”. That’s why above we are using include_top = False, which signifies that we are not including fully connected layers from the VGG16 model. Now we will pass our input data to vgg_model and generate the features.

Network Architectures

Since we are not including fully connected layers from VGG16 model, we need to create a model with some fully connected layers and an output layer with 1 class, either “Ross” or “No Ross”. Output features from VGG16 model will be having shape 7*7*512, which will be input shape for our model. Here I am also using dropout layer to make model less over-fit. Let’s see the code:

Splitting Data into Train and Validation

Now we have input features from VGG16 model and our own network architecture defined above. Next thing is to train this neural network. But we are lacking our validation data. We are having 6814 images, so we will split this into 5000 training images and 1814 validation images.

According to our created class 1, class 2, training and validation data, we will create our output y labels.

Training the Network

All set, we are ready to train our model. Here, we will use stochastic gradient descent as an optimizer and binary cross-entropy as our loss function. We are also going to save our checkpoint for the best model according to it’s validation dataset accuracy.

I am using batch size of 64 and 10 epochs to train.

Training and validation accuracy looks quite pleasing. Now let’s calculate screen time of “Ross”.

Calculating Screen Time

To test our trained model and calculate the screen time, I have downloaded another “friends” video clip from YouTube and extracted images. To calculate the screen time, first I have used the trained model to predict each image to find out which class it belongs, either “Ross” or “No Ross”. Since video is made up of 24 frames per second, we will count the number of frames which has been predicted for having “Ross” in it and then divide it by 24 to count the number of seconds “Ross” was on screen.

This test video clip is made up of 24 frames per second and number of images predicted for having “Ross” in it are 4715. So the screen time for Ross will be 4715/24 = 196 seconds.

Summary

We can see good accuracy on train and validation dataset but when I tested it on test dataset, the accuracy was about 65%. The one reason that I figured out is less training data. If you can get more data then accuracy can be higher. Another reason can be co-variance shift which means the test dataset is quite different from training dataset due to different video quality.

This type of technique can be very helpful in calculating screen time of a particular character.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Densely Connected Convolutional Networks – DenseNet

When we see a machine learning problem related to an image, the first things comes into our mind is CNN(convolutional neural networks). Different convolutional networks like LeNet, AlexNet, VGG16, VGG19, ResNet, etc. are used to solve different problems either it is supervised(classification) or unsupervised(image generation). Through these years there has been more deeper and deeper CNN architectures are used. As more complex problem comes, more deeper convolutional networks are preferred. But with deeper networks problem of vanishing gradient arises.

To solve this problem Gao Huang et al. introduced Dense Convolutional networks. DenseNets have several compelling advantages:

  1. alleviate the vanishing-gradient problem
  2. strengthen feature propagation
  3. encourage feature reuse, and substantially reduce the number of parameters.

How DenseNet works?

Recent researches like ResNet also tries to solve the problem of vanishing gradient. ResNet passes information from one layer to another layer via identity connection. In ResNet features are combined through summation before passing into the next layer.

While in DenseNet, it introduces connection from one layer to all its subsequent layer in a feed forward fashion (As shown in the figure below). This connection is done using concatenation not through summation.

source: DenseNet

ResNet architecture preserve information explicitly through identity connection, also recent variation of ResNet shows that many layers contribute very little and can in fact be randomly dropped during training. DenseNet architecture explicitly differentiates between information that is added to the network and information that is preserved.

In DenseNet, Each layer has direct access to the gradients from the loss function and the original input signal, leading to an r improved flow of information and gradients throughout the network, DenseNets have a regularizing effect, which reduces overfitting on tasks with smaller training set sizes.

An important difference between DenseNet and existing network architectures is that DenseNet can have very narrow layers, e.g., k = 12.  It refers to the hyperparameter k as the growth rate of the network. It means each layer in dense block will only produce k features. And these k features will be concatenated with previous layers features and will be given as input to the next layer.

DenseNet Architecture

The best way to illustrate any architecture is done with the help of code. So, I have implemented DenseNet architecture in Keras using MNIST data set.

A DenseNet consists of dense blocks. Each dense block consists of convolution layers. After a dense block a transition layer is added to proceed to next dense block (As shown in figure below).

Every layer in a dense block is directly connected to all its subsequent layers. Consequently, each layer receives the feature-maps of all preceding layer.

Each convolution layer is consist of three consecutive operations: batch normalization (BN) , followed by a rectified linear unit (ReLU) and a 3 × 3 convolution (Conv). Also dropout can be added which depends on your architecture requirement.

An essential part of convolutional networks is down-sampling layers that change the size of feature-maps. To facilitate down-sampling in DenseNet architecture it divides the network into multiple densely connected dense blocks(As shown in figure earlier).

The layers between blocks are transition layers, which do convolution and pooling. The transition layers consist of a batch normalization layer and an 1×1 convolutional layer followed by a 2×2 average pooling layer.

DenseNets can scale naturally to hundreds of layers, while exhibiting no optimization difficulties. Because of their compact internal representations and reduced feature redundancy, DenseNets may be good feature extractors for various computer vision tasks that build on convolutional features

The full code can be found here.

Referenced research paper: Densely Connected Convolutional Networks

Hope you enjoy reading. If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Autoencoders

Let’s start with a simple definition of autoencoders. ‘ Autoencoders are the neural networks trained to reconstruct their original input’.

Now, you might be thinking what’s the use of reconstructing same data. Let me give you an example If you want to transfer data of GB’s of size and somehow if you can compress it into MB’s and then able to reconstruct back the data to the original size, isn’t that a better way to transfer data. This is one of the applications of autoencoders.

Autoencoders generally consists of two parts, one is encoder and other is decoder. Encoder downscale data to less number of features and decoder upscale the extracted features to original one.

There are some practical applications of autoencoders:

  1. Dimensionality reduction for data visualization
  2. Image Denoising
  3. Generative Models

Visualizing a 10-dimensional vector is difficult. To overcome this problem we need to reduce that 10-dimensional vector into 2-D or 3-D. One of the famous algorithm PCA (Principal Component Analysis) tries to solve this problem. PCA uses linear transformations while autoencoders can use both linear and non-linear transformations for dimensionality reduction. Which makes autoencoders to generate more complex and interesting features than PCA.

Autoencoders can be used to remove the noise present in the image. It can also be used to generate new images required for a specific task. We will see more about these two applications in the next blog.

Now, let’s start with the simple implementation of autoencoders in Keras using MNIST data. First, let’s download MNIST training and test data and reshape it.

Encoder

MNIST data consists of images of digits. So, it is better to use a convolutional neural network in our encoders and decoders. In our encoder, I have used conv and max-pooling layers to extract the compressed representation. Then flatten the encoder output to 32 features. Which will be the input to the decoder.

Decoder

In the decoder, we need to upsample the extracted 32 features into the original size of the image. To achieve this, I have used Conv2DTranspose functions from keras. Then the final layer of the decoder will give the reconstructed output which will be similar to the original input.

To minimize reconstruction loss, we train the network with a large dataset and update weights. Now, our model is created, the next thing is to compile and train the model.

Below are the results from autoencoder trained above. The first line of digits shows the original input (test images) while the second line represents the reconstructed inputs from the model.

The full code can be find here.

Hope you understand the basics of autoencoders, where these can be used and how a simple autoencoder be implemented. In the next blog, we will see how to denoise an image using autoencoders. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: http://proceedings.mlr.press/v27/baldi12a/baldi12a.pdf

Genetic Algorithm and its usage in neural network

You might have heard about the theory of evolution by natural selection. If not then read this quote by Charles Darwin ” It is not the strongest of the species that survives, nor the most intelligent; it is the one most adaptable to change. ” Genetic Algorithm is also based on this theory.

In the 1970’s, Jon Holland tries to mimic some processes observed in natural evolution by introducing genetic algorithm. This algorithm can be used in optimization and search problems both.

A typical genetic algorithm requires some population in the solution domain and a fitness function to find the fittest individual. To evolve individuals in the population genetic algorithm uses some operations like crossover, mutation, and selection.

Genetic algorithm starts with some random initial population. Then tries to produce offspring from the best individuals in the population. The concept is that, if the fittest individuals are selected then chances of producing a better offspring is more. This process keeps on iteration until your target is not achieved. Each iteration is known as the generation.

Initial Population

Initial Population refers to a set of possible solutions. Each member (individual) of the population is usually known as the chromosome (phenotypes) and represents a solution for the problem to be investigated. The chromosome is represented as a set of parameters (features or genes or weights) that defines the individual. Size of the population depends totally on your problem. Random selection of initial population makes sure that it covers a wide range of possible solution.

Evaluation and Fitness Function

Now, we have a random initial population, next thing is to evaluate the fitness of these individuals. To evaluate the fitness of these individuals, you need to define some fitness function. You need to choose the fitness function according to your problem. Fitness function measures the quality of each individual.

Selection

Some best individuals are selected from the evaluated population. These selected individuals are mated to produce some new offspring.

Crossover

Each individual selected in the previous step has some quality. Our objective is to produce better offspring so that our algorithm can evolve and find a better solution to the problem. To do that two individuals from the best population are selected and a new child (offspring) is produced with features of both as shown above. This is known as Crossover.

Mutation

Mutation is applied to maintain the diversity within the population and inhibit premature convergence. With some low probability, a portion of the new individual is subjected to mutation as shown in the figure above.

Replacement

New population replaces a previous one for the next generation. This process keeps on iterating until a certain target is not achieved.

Application of genetic algorithm in neural networks:

  1. Training of a neural network ( instead of using gradient descent, Adam etc.)
  2. Selection of neural network architecture ( Hyperparameters selection)

Now, you might have got some feeling about the genetic algorithm. In the next blog, we will see how this concept can be applied to train the neural network to play a snake game. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.