Category Archives: Machine Learning Quiz

Machine Learning Quiz-5

Q1. The optimizer is an important part of training neural networks. which of the following is not the purpose of using optimizers?

  1. Speed up algorithm convergence
  2. Reduce the difficulty of manual parameter setting
  3. Avoid overfitting
  4. Avoid local extremes

Answer: 3
Explanation: To avoid overfitting, we use regularization and not optimizers.

Q2. Which of the following is not a regularization technique used in machine learning?

  1. L1 regularization
  2. R-square
  3. L2 regularization
  4. Dropout

Answer: 2
Explanation: Of all the above mentioned, R-square is not a regularization technique. R-squared is a statistical measure of how close the data are to the fitted regression line.

Q3. Which of the following are hperparameter in the context of deep learning?

  1. Learning Rate, α
  2. Momentum parameter, β1
  3. Number of units in a layer
  4. All of the above

Answer: 4
Explanation: According to Wikipedia, “In machine learning, a hyperparameter is a parameter whose value is used to control the learning process”. So, all of the above are hyperparameters.

Q4. Which of the following statement is not true with respect to batch normalization?

  1. Batch normalization helps in decreasing training time
  2. Batch normalization add slight regularization effect
  3. After using of batch normalization there is no need to use the dropout
  4. Batch normalization helps in reducing the covariate shift

Answer: 3
Explanation: Although Batch Normalization has a slight regularization effect but this is not why we use this. This is used to make the neural network more robust (reduce covariate shift) and easy to train. While Dropout is used for regularization (reducing overfitting). So, the third option is incorrect.

Q5. In a machine learning project, modelling is an iterative process but deployment is not.

  1. True
  2. False

Answer: 2
Explanation: Deployment is an iterative process, where you should expect to make multiple adjustments (such as metrics monitored using dashboards or percentage of traffic served) to work towards optimizing the system.

Q6. Which of the following activation function works better for hidden layers?

  1. Sigmoid
  2. Tanh

Answer: 2
Explanation: The Tanh activation function usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, so it centers the data better for the next layer and the gradients are not restricted to move in a certain direction.

Q7. The softmax function is used to calculate the probability distribution over a discrete variable with n possible values?

  1. True
  2. False

Answer: 1
Explanation: The softmax function is used to calculate the probability distribution over a discrete variable with n possible values. This can be seen as a generalization of the sigmoid function which was used to represent a probability distribution over a binary variable.

Q8. Let say you want to use the transfer learning from task A to task B. Which of the following scenario would support to use this transfer learning?

  1. Task A and B have same input x
  2. You have lot more data for task A than task B
  3. Low level features from task A could be helpful for learning B
  4. All of the above

Answer: 4 Explanation: All of the things mentioned above are pre-requisites for performing transfer learning. Refer to this beautiful explanation by Andrew Ng to know more.

Machine Learning Quiz-4

Q1. Which of the following is an example of unstructured data?

  1. Audio
  2. Images
  3. Text
  4. All of the above

Answer: 4
Explanation: All of these are examples of unstructured data. Refer to this link to know more.

Q2. Which of the following is a model-centric AI development?

  1. Hold the data fixed and iteratively improve the code/model
  2. Hold the code/model fixed and iteratively improve the data

Answer: 1
Explanation: As clear from the name, in model-centric AI development, we hold the data fixed and iteratively improve the code/model

Q3. What is Semi-Supervised Learning?

  1. where for each example we have the correct answer/label and we infer a mapping function from these examples
  2. where for each example we don’t have the correct answer/label and we try to find some sort of structure or pattern in the dataset
  3. where for some examples we have the correct answer/label while for others we don’t have correct answer/label

Answer: 3
Explanation: As clear from the name, in Semi-Supervised learning for some examples we have the correct answer/label while for others we don’t have correct answer/label. Because nowadays we are able to collect huge amount of data and labelling this huge data takes enormous effort so the focus is now shifting to Semi-Supervised learning. This is also known as Self-Supervised learning. Why? Because sometimes the data can be unlabelled but the data itself provides the necessary context which would make up the labels. For instance, CBOW model for creating word embeddings.

Q4. Which of the following is the reason to use non-linear activation function on neural networks?

  1. If you use only linear activation function, then no matter how many layers you use it will be same as not using any hidden layers
  2. Hidden layer with linear activation functions is of no use as it is not adding any non-linearity to the network so the network will not be able to learn complex functions
  3. Adding n number of hidden layers with linear activation function, end up summing it to another linear function
  4. All of the above

Answer: 4
Explanation: All of the above are possible reasons. Refer to this beautiful explanation by Andrew Ng to know more.

Q5. Which of the following activation functions can be used in neural network?

  1. ReLU
  2. Tanh
  3. Sigmoid
  4. All of the above

Answer: 4
Explanation: All of the above activation functions can be used in neural networks. Refer to this beautiful explanation by Andrew Ng to know more.

Q6. RMSprop resolves the limitation of AdaGrad optimizer?

  1. True
  2. False

Answer: 1
Explanation: RMSprop divides the learning rate by exponentially decaying average of squared gradients whereas AdaGrad divides the learning rate by sum of squared gradients. This in turn causes the learning rate to shrink in AdaGrad and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. Refer to this link to know more.

Q7. If you increase the value of lambda (regularization parameter), then model will always perform better as it helps in reducing the overfitting of model.

  1. True
  2. False

Answer: 2
Explanation: As we increase the regularization hyperparameter lambda, the weights starts becoming smaller. This can also be verified by the weights update equation in gradient descent (with L2 regularization) which is w=w(1-α*λ/m)-α*dLoss/dw. So, as you increase λ to a very high value, weights become closer to 0. This leads to a model that is too simple and ends up underfitting the data thus decreasing the performance of the model. Refer to this beautiful explanation by Andrew Ng to know more.

Q8. What is a multi-task learning in deep learning?

  1. Train n different neural networks to learn n tasks
  2. Train a single neural network to learn n task simultaneously

Answer: 2
Explanation: In multi-task learning, we train a single neural network to learn n task simultaneously. For instance, self driving cars has to detect pedestrains, cars, traffic lights etc.

Machine Learning Quiz-3

Q1. In neural networks, where do we apply batch normalization?

  1. Before applying activation function
  2. After applying activation function

Answer: 1
Explanation: We generally apply batch normalization before applying activation function. Refer to this beautiful explanation by Andrew Ng to know more.

Q2. In Mini-batch gradient descent, if the mini-batch size is set equal to training set size it will become Stochastic gradient descent and if the mini-batch size is set equal to 1 training example it will become batch gradient descent?

  1. True
  2. False

Answer: 2
Explanation: It is actually opposite. In Mini-batch gradient descent, if the mini-batch size is set equal to training set size it will become Batch gradient descent and if the mini-batch size is set equal to 1 training example it will become Stochastic gradient descent.

Q3. If we have enough computation power, it would be wiser to train multiple parallel model and then choose the best one instead of babysitting a single model.

  1. True
  2. False

Answer: 1
Explanation: In deep learning, there is as such no general rule to find the best set of hyperparameters for any task. So, one need to follow the iterative process of Idea -> Code -> Experiment and being able to try out different ideas quickly is more suited instead of babysitting a single model.

Q4. Vectorization allows you to compute forward propagation in an L-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L.?

  1. True
  2. False

Answer: 2
Explanation: We cannot avoid the for-loop iteration over the computations among layers. Refer to this beautiful explanation by Andrew Ng to know more.

Q5. Suppose you ran logistic regression twice, once with regularization parameter λ=0, and once with λ=1. One of the times, you got weight parameters w=[26.29 65.41], and the other time you got w=[2.75 1.32]. However, you forgot which value of λ corresponds to which value of w. Which one do you think corresponds to λ=1?

  1. w=[26.29 65.41]
  2. w=[2.75 1.32]

Answer: 2
Explanation: λ=0 means no regularization is used whereas λ=1 means regularization is used. And as we know that regularization results in weights shrinkage so without regularization you will get larger weights as compared to with regularization.

Q6. What is the value of Sigmoid activation function (let’s denote by g(z)) at an input value of z=0?

  1. 0
  2. 0.5
  3. -♾️
  4. +♾️

Answer: 2
Explanation: As we know that sigmoid is given by g(z) = 1/ (1 + exp(–z)) so at an input value of z=0 this outputs the value of 0.5. Refer to this beautiful explanation by Andrew Ng to know more.

Q7. Suppose you have built a neural network having 1 input, 1 hidden and 1 output layer. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

  1. Each neuron in the hidden layer will perform the same computation in the first iteration. But after one iteration of gradient descent they will learn to compute different things because we have “broken symmetry”.
  2. The hidden layer’s neurons will perform different computations from each other even in the first iteration; their parameters will thus keep evolving in their own way.
  3. Each neuron in the hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.

Answer: 3
Explanation: By initializing the weights and biases to 0, Each neuron in the hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons. Refer to this beautiful explanation by Andrew Ng to know more.

Q8. Suppose we have a neural network having 10 nodes in the input layer, 5 nodes in the hidden layer and 1 node in the output layer. What will be the dimension of b1 (first layer bias) and b2 (second layer bias)?

  1. b1:5×1, b2:1×1
  2. b1:1×10, b2:1×5
  3. b1:1×5, b2:5×10
  4. b1:5×10, b2:1×5

Answer: 1
Explanation: Generally, the bias dimensions for a layer is (next layer nodes x 1) so the answer is b1:5×1, b2:1×1. Refer to this beautiful explanation by Andrew Ng to know more.

Machine Learning Quiz-2

Q1. Which of the following is a good choice for image related tasks such as Image classification or object detection?

  1. Multilayer Perceptron (MLP)
  2. Convolutional Neural Network (CNN)
  3. Recurrent Neural Network (RNN)
  4. All of the above

Answer: 2
Explanation: Convolutional Neural Network (CNN) is a good choice for image related tasks such as Image classification or object detection. There are two main reasons for this. First one is Parameter Sharing i.e. a feature detector that is useful in 1 part of image is probably useful in another part of the same image and because of this CNN has less parameters. Second one is Sparsity of connections i.e. in each layer, each output value depends only on small number of inputs (equal to the filter size).

Q2. Which of the following statement is correct?

  1. RMSprop divides the learning rate by an exponentially decaying average of squared gradients
  2. RMSprop divides the learning rate by an exponentially increasing average of squared gradients
  3. RMSprop has a constant learning rate
  4. RMSprop decays the learning rate by a constant value

Answer: 1
Explanation: The weights update equation in RMSprop is given by w=w-α*dw/(Sdw+e)^0.5 where Sdw is an exponentially weighted average (decaying function). Thus, RMSprop divides the learning rate by an exponentially decaying average of squared gradients. Refer to this beautiful explanation by Andrew Ng to know more.

Q3. _____ is a type of gradient descent which processes 1 training example per iteration?

  1. Stochastic Gradient Descent
  2. Batch Gradient Descent
  3. Mini-batch Gradient Descent
  4. None of the above.

Answer: 1
Explanation: Stochastic Gradient Descent processes 1 training example per iteration of gradient descent.

Q4. Let say you have trained a cat classifier on 10 million cat images and it is performing well on live environment. Now in live environment you have encountered new cat species. Due to that your deployed model has started degrading. You have only 1000 images of new indentifed cat species. Which of the following step you should take first?

  1. Put all 1000 images in the training set and start training asap
  2. Try data augmentation on these 1000 images to get more data
  3. Split the 1000 images into train/test set and start the training
  4. Use the data you have to define a new evaluation metric (using a new dev/test set) taking into account the new species, and use that to drive further progress with the model

Answer: 4
Explanation: Because we have a very less amount of data for new cat species (1000) as compared to 10 million so putting these 1000 in training or splitting will not make any difference. Also by augmentation we will not be able to increase the dataset to that extent (10 million). So the only option that left is build a new evaluation metric and penalize the model more for making false predictions on the new species.

Q5. Which of the following is an example of supervised learning?

  1. Given the data of house prices and house sizes, predict house price as a function of house size
  2. Given 50 spam and 50 non-spam emails, predict whether the new email is spam/non-spam
  3. Given the data consisting of 1000 images of cats and dogs each, we need to classify to which class the new image belongs
  4. All of the above

Answer: 4
Explanation: Because for each of the above options, we have the correct answer/label so all of the these are examples of supervised learning.

Q6. Which of the following is True for Structured Data?

  1. Structured Data has clear, definable relationships between the data points, with a pre-defined model containing it
  2. Structured data is quantitative, highly organized, and each of the feature has a well-defined meaning
  3. Structured data is generally contained in relational databases (RDBMS)
  4. All of the above

Answer: 4
Explanation: All of the above is True for Structured Data. Refer to this link to know more.

Q7. You have built a network using the sigmoid activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*10000. What will happen?

  1. This will cause the inputs to the sigmoid to be very large, causing the units to be “highly activated” and thus speed up learning compared to if the weights had to start from small values
  2. It doesn’t matter as long as you initialize the weights randomly gradient descent is not affected by whether the weights are large or small
  3. This will cause the inputs to the sigmoid to be very large, thus causing gradients to also become large. You therefore have to set \alphaα to be very small to prevent divergence; this will slow down learning
  4. This will cause the inputs to the sigmoid to be very large, thus causing gradients to be close to zero and slows down the learning

Answer: 4
Explanation: When we initialize the weights to a very large value, the input to a sigmoid function (that is calculated using z=w*x+b) will also become very large. As we know that for large inputs the sigmoid curve is quite flat and because of this the gradients will be close to 0 and thus slows down the gradient descent or learning.

Q8. Let say you are working on a cat classifier, and have been asked to work on three different metrics. 1. accuracy 2. inference time and 3. memory size. What will you say about the following statement:\n”Having three evaluation metrics will make it easier for you to quickly choose between two different algorithms, and your team can work faster.”

  1. True
  2. False

Answer: 2
Explanation: It is always good to have a single real number evaluation metric. If you have more than 1 evaluation metric then it would be very difficult to access the performance. For instance, if for 1 case if the precision and recall is 60% and 40% while for other case precision and recall is 30% and 70% so it’s very tedious task to judge which one is better. That’s why we have F1 score as it combines precision and recall into one metric.

Machine Learning Quiz-1

Q1. Let say if you have 10 million dataset, and it would take 2 week time to train your model. Which of the following statement do you most agree with?

  1. If you have already trained a model with different dataset and is performing well with 98% dev accuracy on that dataset, you just use that model instead of training on current dataset for two weeks
  2. If 10 million dataset is enough to build a good model, you might be better off training with just 1 million dataset to gain 10 times improvement in how quickly you can run experiments, even if each model performs a bit worse because it’s trained on less data
  3. You will go with complete dataset and run the model for two weeks to see the first results
  4. All of the above

Answer: 2
Explanation: In Machine learning, the best approach is to build an initial model quickly using a random subset of data and then use the Bias/Variance analysis and error analysis to priortize next steps.

Q2. In a Multi-layer Perceptron (MLP), each node is connected to all the previous layer nodes?

  1. True
  2. False

Answer: 1
Explanation: Since a Multi-Layer Perceptron (MLP) is a Fully Connected Network, each node in one layer connects with a certain weight to every node in the following layer.

Q3. Identify the following activation function : g(z) = (exp(z) – exp(-z))/(exp(z) + exp(–z))?

  1. Tanh activation function
  2. Sigmoid activation function
  3. ReLU activation function
  4. Leaky ReLU activation function

Answer: 1
Explanation: This refers to Tanh activation function. Similar to sigmoid, the tanh function is continuous and differentiable at all points, the only difference is that it is symmetric around the origin. Refer to this beautiful explanation by Andrew Ng to know more.

Q4. Suppose we have a neural network having 10 nodes in the input layer, 5 nodes in the hidden layer and 1 node in the output layer. What will be the dimension of W1 (first layer weights) and W2 (second layer weights)?

  1. W1:5×1, W2:1×1
  2. W1:1×10, W2:1×5
  3. W1:1×5, W2:5×10
  4. W1:5×10, W2:1×5

Answer: 4
Explanation: Generally, the weights dimensions for a layer is (next layer nodes x previous layer nodes) so the answer is W1:5×10, W2:1×5. Refer to this beautiful explanation by Andrew Ng to know more.

Q5. In Dropout, What will happen if we increasing the Dropout rate from (say) 0.5 to 0.8?

  1. Reducing the regularization effect.
  2. Causing the neural network to end up with a lower training set error.
  3. Both of the above.
  4. None of the above.

Answer: 3
Explanation:

Q6. Finding good hyperparameter values is very time-consuming. So typically you should do it once at the start of the project, and try to find very good hyperparameters so that you don’t ever have to revisit tuning them again.

  1. True
  2. False

Answer: 2
Explanation: You can’t really know beforehand which set of hyperparameters will work best for your case. You need to follow the iterative process of Idea->Code->Eperiment.

Q7. In a deep neural network, what is the general rule for the dimensions of weights and biases of layer l? Where n is the number of units in layer l.

  1. w[l] : (n[l], n[l])
    b[l] : (n[l], 1)
  2. w[l] : (n[l+1], n[l])
    b[l] : (n[l-1], 1)
  3. w[l] : (n[l], n[l-1])
    b[l] : (n[l], 1)
  4. w[l] : (n[l], n[l-1])
    b[l] : (n[l-1], 1)

Answer: 3
Explanation: The dimensions of weights of layer l is given by (n[l], n[l-1]) and biases is given by (n[l], 1). Refer to this beautiful explanation by Andrew Ng to know more.

Q8. Which of the following method can be used for hyperparameter tuning?

  1. Random Search
  2. Grid Search
  3. Bayesian optimization
  4. All of the above.

Answer: 4
Explanation: All of the above methods can be used for hyperparameter tuning.