Q1. The optimizer is an important part of training neural networks. which of the following is not the purpose of using optimizers?
Speed up algorithm convergence
Reduce the difficulty of manual parameter setting
Avoid overfitting
Avoid local extremes
Answer: 3 Explanation: To avoid overfitting, we use regularization and not optimizers.
Q2. Which of the following is not a regularization technique used in machine learning?
L1 regularization
R-square
L2 regularization
Dropout
Answer: 2 Explanation: Of all the above mentioned, R-square is not a regularization technique. R-squared is a statistical measure of how close the data are to the fitted regression line.
Q3. Which of the following are hperparameter in the context of deep learning?
Learning Rate, α
Momentum parameter, β1
Number of units in a layer
All of the above
Answer: 4 Explanation: According to Wikipedia, “In machine learning, a hyperparameter is a parameter whose value is used to control the learning process”. So, all of the above are hyperparameters.
Q4. Which of the following statement is not true with respect to batch normalization?
Batch normalization helps in decreasing training time
After using of batch normalization there is no need to use the dropout
Batch normalization helps in reducing the covariate shift
Answer: 3 Explanation: Although Batch Normalization has a slight regularization effect but this is not why we use this. This is used to make the neural network more robust (reduce covariate shift) and easy to train. While Dropout is used for regularization (reducing overfitting). So, the third option is incorrect.
Q5. In a machine learning project, modelling is an iterative process but deployment is not.
True
False
Answer: 2 Explanation: Deployment is an iterative process, where you should expect to make multiple adjustments (such as metrics monitored using dashboards or percentage of traffic served) to work towards optimizing the system.
Q6. Which of the following activation function works better for hidden layers?
Sigmoid
Tanh
Answer: 2 Explanation: The Tanh activation function usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, so it centers the data better for the next layer and the gradients are not restricted to move in a certain direction.
Q7. The softmax function is used to calculate the probability distribution over a discrete variable with n possible values?
True
False
Answer: 1 Explanation: The softmax function is used to calculate the probability distribution over a discrete variable with n possible values. This can be seen as a generalization of the sigmoid function which was used to represent a probability distribution over a binary variable.
Q8. Let say you want to use the transfer learning from task A to task B. Which of the following scenario would support to use this transfer learning?
Task A and B have same input x
You have lot more data for task A than task B
Low level features from task A could be helpful for learning B
All of the above
Answer: 4
Explanation: All of the things mentioned above are pre-requisites for performing transfer learning. Refer to this beautiful explanation by Andrew Ng to know more.
Q1. Which of the following is an example of unstructured data?
Audio
Images
Text
All of the above
Answer: 4 Explanation: All of these are examples of unstructured data. Refer to this link to know more.
Q2. Which of the following is a model-centric AI development?
Hold the data fixed and iteratively improve the code/model
Hold the code/model fixed and iteratively improve the data
Answer: 1 Explanation: As clear from the name, in model-centric AI development, we hold the data fixed and iteratively improve the code/model
Q3. What is Semi-Supervised Learning?
where for each example we have the correct answer/label and we infer a mapping function from these examples
where for each example we don’t have the correct answer/label and we try to find some sort of structure or pattern in the dataset
where for some examples we have the correct answer/label while for others we don’t have correct answer/label
Answer: 3 Explanation: As clear from the name, in Semi-Supervised learning for some examples we have the correct answer/label while for others we don’t have correct answer/label. Because nowadays we are able to collect huge amount of data and labelling this huge data takes enormous effort so the focus is now shifting to Semi-Supervised learning. This is also known as Self-Supervised learning. Why? Because sometimes the data can be unlabelled but the data itself provides the necessary context which would make up the labels. For instance, CBOW model for creating word embeddings.
Q4. Which of the following is the reason to use non-linear activation function on neural networks?
If you use only linear activation function, then no matter how many layers you use it will be same as not using any hidden layers
Hidden layer with linear activation functions is of no use as it is not adding any non-linearity to the network so the network will not be able to learn complex functions
Adding n number of hidden layers with linear activation function, end up summing it to another linear function
All of the above
Answer: 4 Explanation: All of the above are possible reasons. Refer to this beautiful explanation by Andrew Ng to know more.
Q5. Which of the following activation functions can be used in neural network?
ReLU
Tanh
Sigmoid
All of the above
Answer: 4 Explanation: All of the above activation functions can be used in neural networks. Refer to this beautiful explanation by Andrew Ng to know more.
Q6. RMSprop resolves the limitation of AdaGrad optimizer?
True
False
Answer: 1 Explanation: RMSprop divides the learning rate by exponentially decaying average of squared gradients whereas AdaGrad divides the learning rate by sum of squared gradients. This in turn causes the learning rate to shrink in AdaGrad and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. Refer to this link to know more.
Q7. If you increase the value of lambda (regularization parameter), then model will always perform better as it helps in reducing the overfitting of model.
True
False
Answer: 2 Explanation: As we increase the regularization hyperparameter lambda, the weights starts becoming smaller. This can also be verified by the weights update equation in gradient descent (with L2 regularization) which is w=w(1-α*λ/m)-α*dLoss/dw. So, as you increase λ to a very high value, weights become closer to 0. This leads to a model that is too simple and ends up underfitting the data thus decreasing the performance of the model. Refer to this beautiful explanation by Andrew Ng to know more.
Q8. What is a multi-task learning in deep learning?
Train n different neural networks to learn n tasks
Train a single neural network to learn n task simultaneously
Answer: 2 Explanation: In multi-task learning, we train a single neural network to learn n task simultaneously. For instance, self driving cars has to detect pedestrains, cars, traffic lights etc.
Q1. In neural networks, where do we apply batch normalization?
Before applying activation function
After applying activation function
Answer: 1 Explanation: We generally apply batch normalization before applying activation function. Refer to this beautiful explanation by Andrew Ng to know more.
Q2. In Mini-batch gradient descent, if the mini-batch size is set equal to training set size it will become Stochastic gradient descent and if the mini-batch size is set equal to 1 training example it will become batch gradient descent?
True
False
Answer: 2 Explanation: It is actually opposite. In Mini-batch gradient descent, if the mini-batch size is set equal to training set size it will become Batch gradient descent and if the mini-batch size is set equal to 1 training example it will become Stochastic gradient descent.
Q3. If we have enough computation power, it would be wiser to train multiple parallel model and then choose the best one instead of babysitting a single model.
True
False
Answer: 1 Explanation: In deep learning, there is as such no general rule to find the best set of hyperparameters for any task. So, one need to follow the iterative process of Idea -> Code -> Experiment and being able to try out different ideas quickly is more suited instead of babysitting a single model.
Q4. Vectorization allows you to compute forward propagation in an L-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L.?
True
False
Answer: 2 Explanation: We cannot avoid the for-loop iteration over the computations among layers. Refer to this beautiful explanation by Andrew Ng to know more.
Q5. Suppose you ran logistic regression twice, once with regularization parameter λ=0, and once with λ=1. One of the times, you got weight parameters w=[26.29 65.41], and the other time you got w=[2.75 1.32]. However, you forgot which value of λ corresponds to which value of w. Which one do you think corresponds to λ=1?
w=[26.29 65.41]
w=[2.75 1.32]
Answer: 2 Explanation: λ=0 means no regularization is used whereas λ=1 means regularization is used. And as we know that regularization results in weights shrinkage so without regularization you will get larger weights as compared to with regularization.
Q6. What is the value of Sigmoid activation function (let’s denote by g(z)) at an input value of z=0?
0
0.5
-♾️
+♾️
Answer: 2 Explanation: As we know that sigmoid is given by g(z) = 1/ (1 + exp(–z)) so at an input value of z=0 this outputs the value of 0.5. Refer to this beautiful explanation by Andrew Ng to know more.
Q7. Suppose you have built a neural network having 1 input, 1 hidden and 1 output layer. You decide to initialize the weights and biases to be zero. Which of the following statements is true?
Each neuron in the hidden layer will perform the same computation in the first iteration. But after one iteration of gradient descent they will learn to compute different things because we have “broken symmetry”.
The hidden layer’s neurons will perform different computations from each other even in the first iteration; their parameters will thus keep evolving in their own way.
Each neuron in the hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.
Answer: 3 Explanation: By initializing the weights and biases to 0, Each neuron in the hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons. Refer to this beautiful explanation by Andrew Ng to know more.
Q8. Suppose we have a neural network having 10 nodes in the input layer, 5 nodes in the hidden layer and 1 node in the output layer. What will be the dimension of b1 (first layer bias) and b2 (second layer bias)?
b1:5×1, b2:1×1
b1:1×10, b2:1×5
b1:1×5, b2:5×10
b1:5×10, b2:1×5
Answer: 1 Explanation: Generally, the bias dimensions for a layer is (next layer nodes x 1) so the answer is b1:5×1, b2:1×1. Refer to this beautiful explanation by Andrew Ng to know more.
Q1. Which of the following is a good choice for image related tasks such as Image classification or object detection?
Multilayer Perceptron (MLP)
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
All of the above
Answer: 2 Explanation: Convolutional Neural Network (CNN) is a good choice for image related tasks such as Image classification or object detection. There are two main reasons for this. First one is Parameter Sharing i.e. a feature detector that is useful in 1 part of image is probably useful in another part of the same image and because of this CNN has less parameters. Second one is Sparsity of connections i.e. in each layer, each output value depends only on small number of inputs (equal to the filter size).
Q2. Which of the following statement is correct?
RMSprop divides the learning rate by an exponentially decaying average of squared gradients
RMSprop divides the learning rate by an exponentially increasing average of squared gradients
RMSprop has a constant learning rate
RMSprop decays the learning rate by a constant value
Answer: 1 Explanation: The weights update equation in RMSprop is given by w=w-α*dw/(Sdw+e)^0.5 where Sdw is an exponentially weighted average (decaying function). Thus, RMSprop divides the learning rate by an exponentially decaying average of squared gradients. Refer to this beautiful explanation by Andrew Ng to know more.
Q3. _____ is a type of gradient descent which processes 1 training example per iteration?
Stochastic Gradient Descent
Batch Gradient Descent
Mini-batch Gradient Descent
None of the above.
Answer: 1 Explanation: Stochastic Gradient Descent processes 1 training example per iteration of gradient descent.
Q4. Let say you have trained a cat classifier on 10 million cat images and it is performing well on live environment. Now in live environment you have encountered new cat species. Due to that your deployed model has started degrading. You have only 1000 images of new indentifed cat species. Which of the following step you should take first?
Put all 1000 images in the training set and start training asap
Try data augmentation on these 1000 images to get more data
Split the 1000 images into train/test set and start the training
Use the data you have to define a new evaluation metric (using a new dev/test set) taking into account the new species, and use that to drive further progress with the model
Answer: 4 Explanation: Because we have a very less amount of data for new cat species (1000) as compared to 10 million so putting these 1000 in training or splitting will not make any difference. Also by augmentation we will not be able to increase the dataset to that extent (10 million). So the only option that left is build a new evaluation metric and penalize the model more for making false predictions on the new species.
Q5. Which of the following is an example of supervised learning?
Given the data of house prices and house sizes, predict house price as a function of house size
Given 50 spam and 50 non-spam emails, predict whether the new email is spam/non-spam
Given the data consisting of 1000 images of cats and dogs each, we need to classify to which class the new image belongs
All of the above
Answer: 4 Explanation: Because for each of the above options, we have the correct answer/label so all of the these are examples of supervised learning.
Q6. Which of the following is True for Structured Data?
Structured Data has clear, definable relationships between the data points, with a pre-defined model containing it
Structured data is quantitative, highly organized, and each of the feature has a well-defined meaning
Structured data is generally contained in relational databases (RDBMS)
All of the above
Answer: 4 Explanation: All of the above is True for Structured Data. Refer to this link to know more.
Q7. You have built a network using the sigmoid activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*10000. What will happen?
This will cause the inputs to the sigmoid to be very large, causing the units to be “highly activated” and thus speed up learning compared to if the weights had to start from small values
It doesn’t matter as long as you initialize the weights randomly gradient descent is not affected by whether the weights are large or small
This will cause the inputs to the sigmoid to be very large, thus causing gradients to also become large. You therefore have to set \alphaα to be very small to prevent divergence; this will slow down learning
This will cause the inputs to the sigmoid to be very large, thus causing gradients to be close to zero and slows down the learning
Answer: 4 Explanation: When we initialize the weights to a very large value, the input to a sigmoid function (that is calculated using z=w*x+b) will also become very large. As we know that for large inputs the sigmoid curve is quite flat and because of this the gradients will be close to 0 and thus slows down the gradient descent or learning.
Q8. Let say you are working on a cat classifier, and have been asked to work on three different metrics. 1. accuracy 2. inference time and 3. memory size. What will you say about the following statement:\n”Having three evaluation metrics will make it easier for you to quickly choose between two different algorithms, and your team can work faster.”
True
False
Answer: 2 Explanation: It is always good to have a single real number evaluation metric. If you have more than 1 evaluation metric then it would be very difficult to access the performance. For instance, if for 1 case if the precision and recall is 60% and 40% while for other case precision and recall is 30% and 70% so it’s very tedious task to judge which one is better. That’s why we have F1 score as it combines precision and recall into one metric.
Q1. Let say if you have 10 million dataset, and it would take 2 week time to train your model. Which of the following statement do you most agree with?
If you have already trained a model with different dataset and is performing well with 98% dev accuracy on that dataset, you just use that model instead of training on current dataset for two weeks
If 10 million dataset is enough to build a good model, you might be better off training with just 1 million dataset to gain 10 times improvement in how quickly you can run experiments, even if each model performs a bit worse because it’s trained on less data
You will go with complete dataset and run the model for two weeks to see the first results
All of the above
Answer: 2 Explanation: In Machine learning, the best approach is to build an initial model quickly using a random subset of data and then use the Bias/Variance analysis and error analysis to priortize next steps.
Q2. In a Multi-layer Perceptron (MLP), each node is connected to all the previous layer nodes?
True
False
Answer: 1 Explanation: Since a Multi-Layer Perceptron (MLP) is a Fully Connected Network, each node in one layer connects with a certain weight to every node in the following layer.
Q3. Identify the following activation function : g(z) = (exp(z) – exp(-z))/(exp(z) + exp(–z))?
Tanh activation function
Sigmoid activation function
ReLU activation function
Leaky ReLU activation function
Answer: 1 Explanation: This refers to Tanh activation function. Similar to sigmoid, the tanh function is continuous and differentiable at all points, the only difference is that it is symmetric around the origin. Refer to this beautiful explanation by Andrew Ng to know more.
Q4. Suppose we have a neural network having 10 nodes in the input layer, 5 nodes in the hidden layer and 1 node in the output layer. What will be the dimension of W1 (first layer weights) and W2 (second layer weights)?
W1:5×1, W2:1×1
W1:1×10, W2:1×5
W1:1×5, W2:5×10
W1:5×10, W2:1×5
Answer: 4 Explanation: Generally, the weights dimensions for a layer is (next layer nodes x previous layer nodes) so the answer is W1:5×10, W2:1×5. Refer to this beautiful explanation by Andrew Ng to know more.
Q5. In Dropout, What will happen if we increasing the Dropout rate from (say) 0.5 to 0.8?
Reducing the regularization effect.
Causing the neural network to end up with a lower training set error.
Both of the above.
None of the above.
Answer: 3 Explanation:
Q6. Finding good hyperparameter values is very time-consuming. So typically you should do it once at the start of the project, and try to find very good hyperparameters so that you don’t ever have to revisit tuning them again.
True
False
Answer: 2 Explanation: You can’t really know beforehand which set of hyperparameters will work best for your case. You need to follow the iterative process of Idea->Code->Eperiment.
Q7. In a deep neural network, what is the general rule for the dimensions of weights and biases of layer l? Where n is the number of units in layer l.
w[l] : (n[l], n[l]) b[l] : (n[l], 1)
w[l] : (n[l+1], n[l]) b[l] : (n[l-1], 1)
w[l] : (n[l], n[l-1]) b[l] : (n[l], 1)
w[l] : (n[l], n[l-1]) b[l] : (n[l-1], 1)
Answer: 3 Explanation: The dimensions of weights of layer l is given by (n[l], n[l-1]) and biases is given by (n[l], 1). Refer to this beautiful explanation by Andrew Ng to know more.
Q8. Which of the following method can be used for hyperparameter tuning?
Random Search
Grid Search
Bayesian optimization
All of the above.
Answer: 4 Explanation: All of the above methods can be used for hyperparameter tuning.