Category Archives: Computer Vision Quiz

Computer Vision Quiz-5

Q1. For a multi-channel input feature map, we apply Max-pooling independently on each channel and then concatenate the results along the channel axis?

  1. True
  2. False

Answer: 1
Explanation: Max-pooling operation is applied independently on each channel and then the results are concatenated along the channel axis to form the final output. Refer to this beautiful explanation by Andrew Ng to understand more.

Q2. A fully convolutional network can take as input the image of any size?

  1. True
  2. False

Answer: 1
Explanation: Because a fully convolutional network doesnot contain any fully connected or Dense layer, this can take as input the image of any size.

Q3. In R-CNN, the bounding box loss is only calculated for positive samples (samples that contains classes present in the dataset)?

  1. True
  2. False

Answer: 1
Explanation: In R-CNN, the bounding box loss is only calculated for positive samples (samples that contains classes present in the dataset) as it makes no sense to fine-tune a bounding box that doesn’t contain object.

Q4. In the VGG16 model, we have all the Conv layers with same padding and filter size and the downsampling is done by MaxPooling only?

  1. True
  2. False

Answer: 1
Explanation: Earlier models like AlexNet use large filter size in the beginning and downsampling was done either by max-pooling or by convolution. But in the VGG16 model, we have all the Conv layers with same padding and filter size and the downsampling is done by MaxPooling only. So what have we gained by using, for instance, a stack of three 3×3 conv. layers instead of a single 7×7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters. Refer to the Section 2.3 of this research paper to understand more.

Q5. 1×1 convolution can also help in decreasing the computation cost of a convolution operation?

  1. True
  2. False

Answer: 1
Explanation: 1×1 convolution can also help in decreasing the computation cost of a convolution operation. Refer to this beautiful explanation by Andrew Ng to understand more.

Q6. Can we use Fully Convolutional Neural Networks for object detection?

  1. Yes
  2. No

Answer: 1
Explanation: Yes a Fully Convolutional Neural Networks can be used for object detection. For instance, YOLO etc.

Q7. Which of the following networks can be used for object detection?

  1. Overfeat
  2. Faster RCNN
  3. YOLO
  4. All of the above

Answer: 4
Explanation: All of the above mentioned networks can be used for object detection. For instance, Faster RCNN belongs to Region based methods whereas YOLO, Overfeat belongs to sliding window based methods.

Q8. AlexNet was one of the first networks that uses ReLU activation function in the hidden layers instead of tanh/sigmoid (which were quite common at that time)?

  1. True
  2. False

Answer: 1
Explanation: This was one of the revolutionary ideas that boomed deep learning i.e. using ReLU activation function in the hidden layers instead of tanh/sigmoid (which were quite common at that time).

Computer Vision Quiz-4

Q1. The values in a filter/mask are called as

  1. Coefficients
  2. Weights
  3. Both of the above
  4. None of the above

Answer: 3
Explanation: The values in a filter/mask are called as either coefficients or weights.

Q2. Which of the following networks uses the idea of Depthwise Separable Convolutions?

  1. AlexNet
  2. MobileNet
  3. ResNet
  4. VGG16

Answer: 2
Explanation: As mentioned in the MobileNet paper, MobileNets are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks that work even in low compute environment, such as a mobile phones. Refer to this research paper to understand more.

Q3. What is the output of a Region Proposal Network (RPN) at each sliding window location if we have k anchor boxes?

  1. 2k scores and 4k bounding box coordinates
  2. 4k scores and 2k bounding box coordinates
  3. k scores and 4k bounding box coordinates
  4. 4k scores and 4k bounding box coordinates

Answer: 1
Explanation: In a Region Proposal Network (RPN), for k anchor boxes we get 2k scores (that estimate probability of object or not) and 4k bounding box coordinates corresponding to each sliding window location. Refer to Figure 3 of this research paper to understand more.

Q4. Which of the following networks uses Skip-connections?

  1. DenseNet
  2. ResNet
  3. U-Net
  4. All of the above

Answer: 4
Explanation: All of the above mentioned networks uses Skip-connections.

Q5. For binary classification, we generally use ________ activation function in the output layer?

  1. Tanh
  2. ReLU
  3. Sigmoid
  4. Leaky ReLU

Answer: 3
Explanation: For binary classification, we want the output (y) to be either 0 or 1. Because sigmoid outputs the P(y=1|x) and has value between 0 and 1, so it is appropriate for binary classification.

Q6. In ResNet’s Skip-connection, the output from the previous layer is ________ to the layer ahead?

  1. added
  2. concatenated
  3. convoluted
  4. multiplied

Answer: 1
Explanation: In ResNet’s Skip-connection, the output from the previous layer is added to the layer ahead. Refer to the Figure 2 of this research paper to understand more.

Q7. In Fast R-CNN, we extract feature maps from the input image only once as compared to R-CNN where we extract feature maps from each region proposal separately?

  1. True
  2. False

Answer: 1
Explanation: Earlier in R-CNN we were extracting features from each region proposals separately using a CNN and this was very time consuming. So, to counter this, in Fast R-CNN we extract feature maps from the input image only once and then project the region proposals onto this feature map. This saves a lot of time. Refer to this link to understand more.

Q8. For Multiclass classification, we generally use ________ activation function in the output layer?

  1. Tanh
  2. ReLU
  3. Sigmoid
  4. Softmax

Answer: 4
Explanation: For Multiclass classification, we generally use softmax activation function in the output layer. Refer to this beautiful explanation by Andrew Ng to understand more.

Computer Vision Quiz-3

Q1. Which of the following object detection networks uses a ROI Pooling layer?

  1. R-CNN
  2. Fast R-CNN
  3. YOLO
  4. All of the above

Answer: 2
Explanation: Out of the above mentioned networks, only Fast R-CNN uses a ROI Pooling layer. Becuase of this, Fast R-CNN can take any size image as input as compared to R-CNN where we need to resize region proposals before passing into CNN. Refer to this research paper to understand more.

Q2. Which of the following techniques can be used to reduce the number of channels/feature maps?

  1. Pooling
  2. Padding
  3. 1×1 convolution
  4. Batch Normalization

Answer: 3
Explanation: 1×1 convolution can be used to reduce the number of channels/feature maps. Refer to this beautiful explanation by Andrew Ng to understand more.

Q3. Which of the following networks has the fastest prediction time?

  1. R-CNN
  2. Fast R-CNN
  3. Faster R-CNN

Answer: 3
Explanation: As clear from the name, Faster R-CNN has the fastest prediction time. Refer to this research paper to understand more.

Q4. Max-Pooling makes the Convolutional Neural Network translation invariant (for small translations of the input)?

  1. True
  2. False

Answer: 1
Explanation: According to Ian Goodfellow, Max pooling achieves partial invariance to small translations because the max of a region depends only on the single largest element. If a small translation doesn’t bring in a new largest element at the edge of the pooling region and also doesn’t remove the largest element by taking it outside the pooling region, then the max doesn’t change.

Q5. What do you mean by the term “Region Proposals” as used in the R-CNN paper?

  1. regions of an image that could possibly contain an object of interest
  2. regions of an image that could possibly contain information other than the object of interest
  3. final bounding boxes given by the R-CNN

Answer: 1
Explanation: As clear from the name, Region Proposals are a set of candidate regions that could possibly contain an object of interest. These region proposals are then fed to a CNN which extracts features from each of these proposals and these features are then fed to a SVM classifier to determine what type of object (if any) is contained within the proposal. The main reason behind extracting these region proposals beforehand is that instead of searching the object at all image locations, we should search for only those locations where there is a possibility of object. This will reduce the false positives as we are only searching in the regions where there is a possibility of having an object. Refer to this research paper to understand more.

Q6. Because Pooling layer has no parameters, they don’t affect the gradient calculation during backpropagation?

  1. True
  2. False

Answer: 2
Explanation: It is true that Pooling layer has no parameters and hence no learning takes place during backpropagation. But it’s wrong to say that they don’t affect the gradient calculation during backpropagation because pooling layer guides the gradient to the right input from which the Pooling output came from. Refer to this link to know more.

Q7. Which of the following techniques was used by Traditional computer vision object detection algorithms to locate objects in images at varying scales and locations?

  1. image pyramids for varing scale and sliding windows for varing locations
  2. image pyramids for varing locations and sliding windows for varing scale

Answer: 1
Explanation: Becuase an object can be of any size and can be present at any location, so for object detection we need to search both at different locations and scales. As we know that by using image pyramids (multi-resolution representations for images) we can handle scale dependency and for locations we can use sliding window. So, traditional computer vision algorithms use these for object detection. For instance, refer to the Overfeat paper that shows how a multiscale and sliding window approach can be efficiently implemented within a ConvNet.

Q8. How do you introduce non-linearity in a Convolutional Neural Network (CNN)?

  1. Using ReLU
  2. Using a Max-Pooling layer
  3. Both of the above
  4. None of the above

Answer: 3
Explanation: Non-linearity can be introduced by either using ReLU (non-linear activation function) or by using a Max-Pooling layer (as max is a non-linear function).

Computer Vision Quiz-2

Q1. Suppose we have an image of size 4×4 and we apply the Max-pooling with a filter of size 2×2 and a stide of 2. The resulting image will be of size:

  1. 2×2
  2. 2×3
  3. 3×3
  4. 2×4

Answer: 1
Explanation: Because in Max-pooling, we take the maximum value for each filter location so the output image size will be 2×2 (the number of filter locations). Refer to this beautiful explanation by Andrew Ng to understand more.

Q2. In Faster R-CNN, which loss function is used in the bounding box regressor?

  1. L2 Loss
  2. Smooth L1 Loss
  3. Log Loss
  4. Huber Loss

Answer: 2
Explanation: In Faster R-CNN, Smooth L1 loss is used in the bounding box regressor. This is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. Refer to Section 3.1.2 of this research paper to understand more.

Q3. For binary classification, we generally use ________ loss function?

  1. Binary crossentropy
  2. mean squared error
  3. mean absolute error
  4. ctc

Answer: 1
Explanation: For binary classification, we generally use Binary crossentropy loss function. Refer to this beautiful explanation by Andrew Ng to understand more.

Q4. How do we perform the convolution operation in computer vision?

  1. we multiply the filter weights with the corresponding image pixels, and then sum these up
  2. we multiply the filter weights with the corresponding image pixels, and then subtract these up
  3. we add the filter weights and the corresponding image pixels, and then multiply these up
  4. we add the filter weights with the corresponding image pixels, and then sum these up

Answer: 1
Explanation: In Convolution, we multiply the filter weights with the corresponding image pixels, and then sum these up.

Q5. In a Region Proposal Network (RPN), what is used in the last layer for calculating the objectness scores at each sliding window position?

  1. Softmax
  2. Linear SVM
  3. ReLU
  4. Sigmoid

Answer: 1
Explanation: In a Region Proposal Network (RPN), the authors of Faster R-CNN paper uses a 2 class softmax layer for calculating the objectness scores for each proposal at each sliding window position.

Q6. In R-CNN, the regression model outputs the actual absolute coordinates of the bounding boxes?

  1. Yes
  2. No

Answer: 2
Explanation: In R-CNN, the regression model outputs the deltas or the relative coordinate change of the bounding boxes instead of absolute coordinates. Refer to Appendix C of this research paper to understand more.

Q7. Is Dropout a form of Regularization?

  1. Yes
  2. No

Answer: 1
Explanation: Dropout, applied to a layer, consists of randomly dropping out(setting to zero) a number of output features of the layer during training. Because as any node can become zero, we can’t rely on any one feature so have to spread out the weights similar to regularization.

Q8. A fully convolutional network can be used for

  1. Image Segmentation
  2. Object Detection
  3. Image Classification
  4. All of the above

Answer: 4
Explanation: We can use a fully convolutional network for all of the above mentioned tasks. For instance, for image segmentation we have U-Net, for object detection we have YOLO etc.

Computer Vision Quiz-1

Q1. Which of the following is not a good evaluation metric for Multi-label classification?

  1. Mean Average Precision at K
  2. Hamming Score
  3. Accuracy
  4. Top k categorical accuracy

Answer: 3
Explanation: Accuracy is not a good evaluation metric for Multi-label classification. As we know in multi-label each example can be assigned to multiple classes so let’s say if the predicted output was [0, 0, 0, 0, 1, 1,0] and the correct output was [1, 1, 0, 0, 0, 0, 0], my accuracy would still be 3/6 but it should be 0 as it is not able to predict any of the classes correctly.

Q2. Which of the following are the hyperparameters for a Pooling layer?

  1. filter size
  2. stride
  3. which type of Pooling to use (max or average)
  4. All of the above

Answer: 4
Explanation: All of the above mentioned are the hyperparameters for a Pooling layer.

Q3. Images are an example of ________ data?

  1. Structured
  2. Unstructured

Answer: 2
Explanation: Structured data refers to the type of data where each feature has a well defined meaning and opposite is true for unstructured data. So, images are an example of unstructured data.

Q4. For image classification, MaxPooling tends to works better than average pooling?

  1. Yes
  2. No

Answer: 1
Explanation: Because in image classification our main aim is to identify whether a feature is present or not so MaxPooling tends to works better than average pooling.

Q5. What is Pointwise Convolution?

  1. 1×1 convolution
  2. Strided Convolution
  3. convolution followed by MaxPool
  4. convolution followed by Dropout

Answer: 1
Explanation: According to the MobileNet paper, “After depthwise convolution, The pointwise convolution then applies a 1×1 convolution to combine the outputs of the depthwise convolution.”. Refer to Section 3.1 of this research paper to understand more.

Q6. What is a Region Proposal network?

  1. a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position
  2. a fully connected network that simultaneously predicts object bounds and objectness scores at each position
  3. a fully convolutional network that predicts only the objectness scores at each position
  4. a fully connected network that predicts only the object bounds at each position

Answer: 1
Explanation: According to the Faster R-CNN paper, Region Proposal network (RPN) is a fully convolutional network that takes an image(of any size) as input and outputs a set of rectangular object proposals, each with an objectness score. Refer to Section 3.1 of this research paper to understand more.

Q7. In MobileNetv2, the Depthwise Separable Convolutions are replaced by _________ ?

  1. Normal Convolution
  2. Strided Convolution
  3. Bottleneck Residual Block (Inverted Residuals and Linear Bottleneck)
  4. Residual Blocks

Answer: 3
Explanation: In MobileNetv2, the Depthwise Separable Convolutions are replaced by Bottleneck Residual Block (Inverted Residuals and Linear Bottleneck). Refer to Table 1 of this research paper to understand more.

Q8. Can we use Convolutional Neural Networks for image classification?

  1. Yes
  2. No

Answer: 1
Explanation: Generally, Convolutional Neural Networks are preferred for any image related tasks such as image classification, object detection etc.