Q1. For a multi-channel input feature map, we apply Max-pooling independently on each channel and then concatenate the results along the channel axis?
True
False
Answer: 1 Explanation: Max-pooling operation is applied independently on each channel and then the results are concatenated along the channel axis to form the final output. Refer to this beautiful explanation by Andrew Ng to understand more.
Q2. A fully convolutional network can take as input the image of any size?
True
False
Answer: 1 Explanation: Because a fully convolutional network doesnot contain any fully connected or Dense layer, this can take as input the image of any size.
Q3. In R-CNN, the bounding box loss is only calculated for positive samples (samples that contains classes present in the dataset)?
True
False
Answer: 1 Explanation: In R-CNN, the bounding box loss is only calculated for positive samples (samples that contains classes present in the dataset) as it makes no sense to fine-tune a bounding box that doesn’t contain object.
Q4. In the VGG16 model, we have all the Conv layers with same padding and filter size and the downsampling is done by MaxPooling only?
True
False
Answer: 1 Explanation: Earlier models like AlexNet use large filter size in the beginning and downsampling was done either by max-pooling or by convolution. But in the VGG16 model, we have all the Conv layers with same padding and filter size and the downsampling is done by MaxPooling only. So what have we gained by using, for instance, a stack of three 3×3 conv. layers instead of a single 7×7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters. Refer to the Section 2.3 of this research paper to understand more.
Q5. 1×1 convolution can also help in decreasing the computation cost of a convolution operation?
True
False
Answer: 1 Explanation: 1×1 convolution can also help in decreasing the computation cost of a convolution operation. Refer to this beautiful explanation by Andrew Ng to understand more.
Q6. Can we use Fully Convolutional Neural Networks for object detection?
Yes
No
Answer: 1 Explanation: Yes a Fully Convolutional Neural Networks can be used for object detection. For instance, YOLO etc.
Q7. Which of the following networks can be used for object detection?
Overfeat
Faster RCNN
YOLO
All of the above
Answer: 4 Explanation: All of the above mentioned networks can be used for object detection. For instance, Faster RCNN belongs to Region based methods whereas YOLO, Overfeat belongs to sliding window based methods.
Q8. AlexNet was one of the first networks that uses ReLU activation function in the hidden layers instead of tanh/sigmoid (which were quite common at that time)?
True
False
Answer: 1 Explanation: This was one of the revolutionary ideas that boomed deep learning i.e. using ReLU activation function in the hidden layers instead of tanh/sigmoid (which were quite common at that time).
Answer: 3 Explanation: The values in a filter/mask are called as either coefficients or weights.
Q2. Which of the following networks uses the idea of Depthwise Separable Convolutions?
AlexNet
MobileNet
ResNet
VGG16
Answer: 2 Explanation: As mentioned in the MobileNet paper, MobileNets are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks that work even in low compute environment, such as a mobile phones. Refer to this research paper to understand more.
Q3. What is the output of a Region Proposal Network (RPN) at each sliding window location if we have k anchor boxes?
2k scores and 4k bounding box coordinates
4k scores and 2k bounding box coordinates
k scores and 4k bounding box coordinates
4k scores and 4k bounding box coordinates
Answer: 1 Explanation: In a Region Proposal Network (RPN), for k anchor boxes we get 2k scores (that estimate probability of object or not) and 4k bounding box coordinates corresponding to each sliding window location. Refer to Figure 3 of this research paper to understand more.
Q4. Which of the following networks uses Skip-connections?
DenseNet
ResNet
U-Net
All of the above
Answer: 4 Explanation: All of the above mentioned networks uses Skip-connections.
Q5. For binary classification, we generally use ________ activation function in the output layer?
Tanh
ReLU
Sigmoid
Leaky ReLU
Answer: 3 Explanation: For binary classification, we want the output (y) to be either 0 or 1. Because sigmoid outputs the P(y=1|x) and has value between 0 and 1, so it is appropriate for binary classification.
Q6. In ResNet’s Skip-connection, the output from the previous layer is ________ to the layer ahead?
added
concatenated
convoluted
multiplied
Answer: 1 Explanation: In ResNet’s Skip-connection, the output from the previous layer is added to the layer ahead. Refer to the Figure 2 of this research paper to understand more.
Q7. In Fast R-CNN, we extract feature maps from the input image only once as compared to R-CNN where we extract feature maps from each region proposal separately?
True
False
Answer: 1 Explanation: Earlier in R-CNN we were extracting features from each region proposals separately using a CNN and this was very time consuming. So, to counter this, in Fast R-CNN we extract feature maps from the input image only once and then project the region proposals onto this feature map. This saves a lot of time. Refer to this link to understand more.
Q8. For Multiclass classification, we generally use ________ activation function in the output layer?
Tanh
ReLU
Sigmoid
Softmax
Answer: 4 Explanation: For Multiclass classification, we generally use softmax activation function in the output layer. Refer to this beautiful explanation by Andrew Ng to understand more.
Q1. Which of the following object detection networks uses a ROI Pooling layer?
R-CNN
Fast R-CNN
YOLO
All of the above
Answer: 2 Explanation: Out of the above mentioned networks, only Fast R-CNN uses a ROI Pooling layer. Becuase of this, Fast R-CNN can take any size image as input as compared to R-CNN where we need to resize region proposals before passing into CNN. Refer to this research paper to understand more.
Q2. Which of the following techniques can be used to reduce the number of channels/feature maps?
Pooling
Padding
1×1 convolution
Batch Normalization
Answer: 3 Explanation: 1×1 convolution can be used to reduce the number of channels/feature maps. Refer to this beautiful explanation by Andrew Ng to understand more.
Q3. Which of the following networks has the fastest prediction time?
R-CNN
Fast R-CNN
Faster R-CNN
Answer: 3 Explanation: As clear from the name, Faster R-CNN has the fastest prediction time. Refer to this research paper to understand more.
Q4. Max-Pooling makes the Convolutional Neural Network translation invariant (for small translations of the input)?
True
False
Answer: 1 Explanation: According to Ian Goodfellow, Max pooling achieves partial invariance to small translations because the max of a region depends only on the single largest element. If a small translation doesn’t bring in a new largest element at the edge of the pooling region and also doesn’t remove the largest element by taking it outside the pooling region, then the max doesn’t change.
Q5. What do you mean by the term “Region Proposals” as used in the R-CNN paper?
regions of an image that could possibly contain an object of interest
regions of an image that could possibly contain information other than the object of interest
final bounding boxes given by the R-CNN
Answer: 1 Explanation: As clear from the name, Region Proposals are a set of candidate regions that could possibly contain an object of interest. These region proposals are then fed to a CNN which extracts features from each of these proposals and these features are then fed to a SVM classifier to determine what type of object (if any) is contained within the proposal. The main reason behind extracting these region proposals beforehand is that instead of searching the object at all image locations, we should search for only those locations where there is a possibility of object. This will reduce the false positives as we are only searching in the regions where there is a possibility of having an object. Refer to this research paper to understand more.
Q6. Because Pooling layer has no parameters, they don’t affect the gradient calculation during backpropagation?
True
False
Answer: 2 Explanation: It is true that Pooling layer has no parameters and hence no learning takes place during backpropagation. But it’s wrong to say that they don’t affect the gradient calculation during backpropagation because pooling layer guides the gradient to the right input from which the Pooling output came from. Refer to this link to know more.
Q7. Which of the following techniques was used by Traditional computer vision object detection algorithms to locate objects in images at varying scales and locations?
image pyramids for varing scale and sliding windows for varing locations
image pyramids for varing locations and sliding windows for varing scale
Answer: 1 Explanation: Becuase an object can be of any size and can be present at any location, so for object detection we need to search both at different locations and scales. As we know that by using image pyramids (multi-resolution representations for images) we can handle scale dependency and for locations we can use sliding window. So, traditional computer vision algorithms use these for object detection. For instance, refer to the Overfeat paper that shows how a multiscale and sliding window approach can be efficiently implemented within a ConvNet.
Q8. How do you introduce non-linearity in a Convolutional Neural Network (CNN)?
Using ReLU
Using a Max-Pooling layer
Both of the above
None of the above
Answer: 3 Explanation: Non-linearity can be introduced by either using ReLU (non-linear activation function) or by using a Max-Pooling layer (as max is a non-linear function).
Q1. Suppose we have an image of size 4×4 and we apply the Max-pooling with a filter of size 2×2 and a stide of 2. The resulting image will be of size:
2×2
2×3
3×3
2×4
Answer: 1 Explanation: Because in Max-pooling, we take the maximum value for each filter location so the output image size will be 2×2 (the number of filter locations). Refer to this beautiful explanation by Andrew Ng to understand more.
Q2. In Faster R-CNN, which loss function is used in the bounding box regressor?
L2 Loss
Smooth L1 Loss
Log Loss
Huber Loss
Answer: 2 Explanation: In Faster R-CNN, Smooth L1 loss is used in the bounding box regressor. This is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. Refer to Section 3.1.2 of this research paper to understand more.
Q3. For binary classification, we generally use ________ loss function?
Binary crossentropy
mean squared error
mean absolute error
ctc
Answer: 1 Explanation: For binary classification, we generally use Binary crossentropy loss function. Refer to this beautiful explanation by Andrew Ng to understand more.
Q4. How do we perform the convolution operation in computer vision?
we multiply the filter weights with the corresponding image pixels, and then sum these up
we multiply the filter weights with the corresponding image pixels, and then subtract these up
we add the filter weights and the corresponding image pixels, and then multiply these up
we add the filter weights with the corresponding image pixels, and then sum these up
Answer: 1 Explanation: In Convolution, we multiply the filter weights with the corresponding image pixels, and then sum these up.
Q5. In a Region Proposal Network (RPN), what is used in the last layer for calculating the objectness scores at each sliding window position?
Softmax
Linear SVM
ReLU
Sigmoid
Answer: 1 Explanation: In a Region Proposal Network (RPN), the authors of Faster R-CNN paper uses a 2 class softmax layer for calculating the objectness scores for each proposal at each sliding window position.
Q6. In R-CNN, the regression model outputs the actual absolute coordinates of the bounding boxes?
Yes
No
Answer: 2 Explanation: In R-CNN, the regression model outputs the deltas or the relative coordinate change of the bounding boxes instead of absolute coordinates. Refer to Appendix C of this research paper to understand more.
Q7. Is Dropout a form of Regularization?
Yes
No
Answer: 1 Explanation: Dropout, applied to a layer, consists of randomly dropping out(setting to zero) a number of output features of the layer during training. Because as any node can become zero, we can’t rely on any one feature so have to spread out the weights similar to regularization.
Q8. A fully convolutional network can be used for
Image Segmentation
Object Detection
Image Classification
All of the above
Answer: 4 Explanation: We can use a fully convolutional network for all of the above mentioned tasks. For instance, for image segmentation we have U-Net, for object detection we have YOLO etc.
Q1. Which of the following is not a good evaluation metric for Multi-label classification?
Mean Average Precision at K
Hamming Score
Accuracy
Top k categorical accuracy
Answer: 3 Explanation: Accuracy is not a good evaluation metric for Multi-label classification. As we know in multi-label each example can be assigned to multiple classes so let’s say if the predicted output was [0, 0, 0, 0, 1, 1,0] and the correct output was [1, 1, 0, 0, 0, 0, 0], my accuracy would still be 3/6 but it should be 0 as it is not able to predict any of the classes correctly.
Q2. Which of the following are the hyperparameters for a Pooling layer?
filter size
stride
which type of Pooling to use (max or average)
All of the above
Answer: 4 Explanation: All of the above mentioned are the hyperparameters for a Pooling layer.
Q3. Images are an example of ________ data?
Structured
Unstructured
Answer: 2 Explanation: Structured data refers to the type of data where each feature has a well defined meaning and opposite is true for unstructured data. So, images are an example of unstructured data.
Q4. For image classification, MaxPooling tends to works better than average pooling?
Yes
No
Answer: 1 Explanation: Because in image classification our main aim is to identify whether a feature is present or not so MaxPooling tends to works better than average pooling.
Q5. What is Pointwise Convolution?
1×1 convolution
Strided Convolution
convolution followed by MaxPool
convolution followed by Dropout
Answer: 1 Explanation: According to the MobileNet paper, “After depthwise convolution, The pointwise convolution then applies a 1×1 convolution to combine the outputs of the depthwise convolution.”. Refer to Section 3.1 of this research paper to understand more.
Q6. What is a Region Proposal network?
a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position
a fully connected network that simultaneously predicts object bounds and objectness scores at each position
a fully convolutional network that predicts only the objectness scores at each position
a fully connected network that predicts only the object bounds at each position
Answer: 1 Explanation: According to the Faster R-CNN paper, Region Proposal network (RPN) is a fully convolutional network that takes an image(of any size) as input and outputs a set of rectangular object proposals, each with an objectness score. Refer to Section 3.1 of this research paper to understand more.
Q7. In MobileNetv2, the Depthwise Separable Convolutions are replaced by _________ ?
Normal Convolution
Strided Convolution
Bottleneck Residual Block (Inverted Residuals and Linear Bottleneck)
Residual Blocks
Answer: 3 Explanation: In MobileNetv2, the Depthwise Separable Convolutions are replaced by Bottleneck Residual Block (Inverted Residuals and Linear Bottleneck). Refer to Table 1 of this research paper to understand more.
Q8. Can we use Convolutional Neural Networks for image classification?
Yes
No
Answer: 1 Explanation: Generally, Convolutional Neural Networks are preferred for any image related tasks such as image classification, object detection etc.