Tag Archives: computer vision

Computer Vision Quiz-5

Q1. For a multi-channel input feature map, we apply Max-pooling independently on each channel and then concatenate the results along the channel axis?

  1. True
  2. False

Answer: 1
Explanation: Max-pooling operation is applied independently on each channel and then the results are concatenated along the channel axis to form the final output. Refer to this beautiful explanation by Andrew Ng to understand more.

Q2. A fully convolutional network can take as input the image of any size?

  1. True
  2. False

Answer: 1
Explanation: Because a fully convolutional network doesnot contain any fully connected or Dense layer, this can take as input the image of any size.

Q3. In R-CNN, the bounding box loss is only calculated for positive samples (samples that contains classes present in the dataset)?

  1. True
  2. False

Answer: 1
Explanation: In R-CNN, the bounding box loss is only calculated for positive samples (samples that contains classes present in the dataset) as it makes no sense to fine-tune a bounding box that doesn’t contain object.

Q4. In the VGG16 model, we have all the Conv layers with same padding and filter size and the downsampling is done by MaxPooling only?

  1. True
  2. False

Answer: 1
Explanation: Earlier models like AlexNet use large filter size in the beginning and downsampling was done either by max-pooling or by convolution. But in the VGG16 model, we have all the Conv layers with same padding and filter size and the downsampling is done by MaxPooling only. So what have we gained by using, for instance, a stack of three 3×3 conv. layers instead of a single 7×7 layer? First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative. Second, we decrease the number of parameters. Refer to the Section 2.3 of this research paper to understand more.

Q5. 1×1 convolution can also help in decreasing the computation cost of a convolution operation?

  1. True
  2. False

Answer: 1
Explanation: 1×1 convolution can also help in decreasing the computation cost of a convolution operation. Refer to this beautiful explanation by Andrew Ng to understand more.

Q6. Can we use Fully Convolutional Neural Networks for object detection?

  1. Yes
  2. No

Answer: 1
Explanation: Yes a Fully Convolutional Neural Networks can be used for object detection. For instance, YOLO etc.

Q7. Which of the following networks can be used for object detection?

  1. Overfeat
  2. Faster RCNN
  3. YOLO
  4. All of the above

Answer: 4
Explanation: All of the above mentioned networks can be used for object detection. For instance, Faster RCNN belongs to Region based methods whereas YOLO, Overfeat belongs to sliding window based methods.

Q8. AlexNet was one of the first networks that uses ReLU activation function in the hidden layers instead of tanh/sigmoid (which were quite common at that time)?

  1. True
  2. False

Answer: 1
Explanation: This was one of the revolutionary ideas that boomed deep learning i.e. using ReLU activation function in the hidden layers instead of tanh/sigmoid (which were quite common at that time).

Computer Vision Quiz-4

Q1. The values in a filter/mask are called as

  1. Coefficients
  2. Weights
  3. Both of the above
  4. None of the above

Answer: 3
Explanation: The values in a filter/mask are called as either coefficients or weights.

Q2. Which of the following networks uses the idea of Depthwise Separable Convolutions?

  1. AlexNet
  2. MobileNet
  3. ResNet
  4. VGG16

Answer: 2
Explanation: As mentioned in the MobileNet paper, MobileNets are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks that work even in low compute environment, such as a mobile phones. Refer to this research paper to understand more.

Q3. What is the output of a Region Proposal Network (RPN) at each sliding window location if we have k anchor boxes?

  1. 2k scores and 4k bounding box coordinates
  2. 4k scores and 2k bounding box coordinates
  3. k scores and 4k bounding box coordinates
  4. 4k scores and 4k bounding box coordinates

Answer: 1
Explanation: In a Region Proposal Network (RPN), for k anchor boxes we get 2k scores (that estimate probability of object or not) and 4k bounding box coordinates corresponding to each sliding window location. Refer to Figure 3 of this research paper to understand more.

Q4. Which of the following networks uses Skip-connections?

  1. DenseNet
  2. ResNet
  3. U-Net
  4. All of the above

Answer: 4
Explanation: All of the above mentioned networks uses Skip-connections.

Q5. For binary classification, we generally use ________ activation function in the output layer?

  1. Tanh
  2. ReLU
  3. Sigmoid
  4. Leaky ReLU

Answer: 3
Explanation: For binary classification, we want the output (y) to be either 0 or 1. Because sigmoid outputs the P(y=1|x) and has value between 0 and 1, so it is appropriate for binary classification.

Q6. In ResNet’s Skip-connection, the output from the previous layer is ________ to the layer ahead?

  1. added
  2. concatenated
  3. convoluted
  4. multiplied

Answer: 1
Explanation: In ResNet’s Skip-connection, the output from the previous layer is added to the layer ahead. Refer to the Figure 2 of this research paper to understand more.

Q7. In Fast R-CNN, we extract feature maps from the input image only once as compared to R-CNN where we extract feature maps from each region proposal separately?

  1. True
  2. False

Answer: 1
Explanation: Earlier in R-CNN we were extracting features from each region proposals separately using a CNN and this was very time consuming. So, to counter this, in Fast R-CNN we extract feature maps from the input image only once and then project the region proposals onto this feature map. This saves a lot of time. Refer to this link to understand more.

Q8. For Multiclass classification, we generally use ________ activation function in the output layer?

  1. Tanh
  2. ReLU
  3. Sigmoid
  4. Softmax

Answer: 4
Explanation: For Multiclass classification, we generally use softmax activation function in the output layer. Refer to this beautiful explanation by Andrew Ng to understand more.

Computer Vision Quiz-3

Q1. Which of the following object detection networks uses a ROI Pooling layer?

  1. R-CNN
  2. Fast R-CNN
  3. YOLO
  4. All of the above

Answer: 2
Explanation: Out of the above mentioned networks, only Fast R-CNN uses a ROI Pooling layer. Becuase of this, Fast R-CNN can take any size image as input as compared to R-CNN where we need to resize region proposals before passing into CNN. Refer to this research paper to understand more.

Q2. Which of the following techniques can be used to reduce the number of channels/feature maps?

  1. Pooling
  2. Padding
  3. 1×1 convolution
  4. Batch Normalization

Answer: 3
Explanation: 1×1 convolution can be used to reduce the number of channels/feature maps. Refer to this beautiful explanation by Andrew Ng to understand more.

Q3. Which of the following networks has the fastest prediction time?

  1. R-CNN
  2. Fast R-CNN
  3. Faster R-CNN

Answer: 3
Explanation: As clear from the name, Faster R-CNN has the fastest prediction time. Refer to this research paper to understand more.

Q4. Max-Pooling makes the Convolutional Neural Network translation invariant (for small translations of the input)?

  1. True
  2. False

Answer: 1
Explanation: According to Ian Goodfellow, Max pooling achieves partial invariance to small translations because the max of a region depends only on the single largest element. If a small translation doesn’t bring in a new largest element at the edge of the pooling region and also doesn’t remove the largest element by taking it outside the pooling region, then the max doesn’t change.

Q5. What do you mean by the term “Region Proposals” as used in the R-CNN paper?

  1. regions of an image that could possibly contain an object of interest
  2. regions of an image that could possibly contain information other than the object of interest
  3. final bounding boxes given by the R-CNN

Answer: 1
Explanation: As clear from the name, Region Proposals are a set of candidate regions that could possibly contain an object of interest. These region proposals are then fed to a CNN which extracts features from each of these proposals and these features are then fed to a SVM classifier to determine what type of object (if any) is contained within the proposal. The main reason behind extracting these region proposals beforehand is that instead of searching the object at all image locations, we should search for only those locations where there is a possibility of object. This will reduce the false positives as we are only searching in the regions where there is a possibility of having an object. Refer to this research paper to understand more.

Q6. Because Pooling layer has no parameters, they don’t affect the gradient calculation during backpropagation?

  1. True
  2. False

Answer: 2
Explanation: It is true that Pooling layer has no parameters and hence no learning takes place during backpropagation. But it’s wrong to say that they don’t affect the gradient calculation during backpropagation because pooling layer guides the gradient to the right input from which the Pooling output came from. Refer to this link to know more.

Q7. Which of the following techniques was used by Traditional computer vision object detection algorithms to locate objects in images at varying scales and locations?

  1. image pyramids for varing scale and sliding windows for varing locations
  2. image pyramids for varing locations and sliding windows for varing scale

Answer: 1
Explanation: Becuase an object can be of any size and can be present at any location, so for object detection we need to search both at different locations and scales. As we know that by using image pyramids (multi-resolution representations for images) we can handle scale dependency and for locations we can use sliding window. So, traditional computer vision algorithms use these for object detection. For instance, refer to the Overfeat paper that shows how a multiscale and sliding window approach can be efficiently implemented within a ConvNet.

Q8. How do you introduce non-linearity in a Convolutional Neural Network (CNN)?

  1. Using ReLU
  2. Using a Max-Pooling layer
  3. Both of the above
  4. None of the above

Answer: 3
Explanation: Non-linearity can be introduced by either using ReLU (non-linear activation function) or by using a Max-Pooling layer (as max is a non-linear function).

Computer Vision Quiz-2

Q1. Suppose we have an image of size 4×4 and we apply the Max-pooling with a filter of size 2×2 and a stide of 2. The resulting image will be of size:

  1. 2×2
  2. 2×3
  3. 3×3
  4. 2×4

Answer: 1
Explanation: Because in Max-pooling, we take the maximum value for each filter location so the output image size will be 2×2 (the number of filter locations). Refer to this beautiful explanation by Andrew Ng to understand more.

Q2. In Faster R-CNN, which loss function is used in the bounding box regressor?

  1. L2 Loss
  2. Smooth L1 Loss
  3. Log Loss
  4. Huber Loss

Answer: 2
Explanation: In Faster R-CNN, Smooth L1 loss is used in the bounding box regressor. This is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. Refer to Section 3.1.2 of this research paper to understand more.

Q3. For binary classification, we generally use ________ loss function?

  1. Binary crossentropy
  2. mean squared error
  3. mean absolute error
  4. ctc

Answer: 1
Explanation: For binary classification, we generally use Binary crossentropy loss function. Refer to this beautiful explanation by Andrew Ng to understand more.

Q4. How do we perform the convolution operation in computer vision?

  1. we multiply the filter weights with the corresponding image pixels, and then sum these up
  2. we multiply the filter weights with the corresponding image pixels, and then subtract these up
  3. we add the filter weights and the corresponding image pixels, and then multiply these up
  4. we add the filter weights with the corresponding image pixels, and then sum these up

Answer: 1
Explanation: In Convolution, we multiply the filter weights with the corresponding image pixels, and then sum these up.

Q5. In a Region Proposal Network (RPN), what is used in the last layer for calculating the objectness scores at each sliding window position?

  1. Softmax
  2. Linear SVM
  3. ReLU
  4. Sigmoid

Answer: 1
Explanation: In a Region Proposal Network (RPN), the authors of Faster R-CNN paper uses a 2 class softmax layer for calculating the objectness scores for each proposal at each sliding window position.

Q6. In R-CNN, the regression model outputs the actual absolute coordinates of the bounding boxes?

  1. Yes
  2. No

Answer: 2
Explanation: In R-CNN, the regression model outputs the deltas or the relative coordinate change of the bounding boxes instead of absolute coordinates. Refer to Appendix C of this research paper to understand more.

Q7. Is Dropout a form of Regularization?

  1. Yes
  2. No

Answer: 1
Explanation: Dropout, applied to a layer, consists of randomly dropping out(setting to zero) a number of output features of the layer during training. Because as any node can become zero, we can’t rely on any one feature so have to spread out the weights similar to regularization.

Q8. A fully convolutional network can be used for

  1. Image Segmentation
  2. Object Detection
  3. Image Classification
  4. All of the above

Answer: 4
Explanation: We can use a fully convolutional network for all of the above mentioned tasks. For instance, for image segmentation we have U-Net, for object detection we have YOLO etc.

Computer Vision Quiz-1

Q1. Which of the following is not a good evaluation metric for Multi-label classification?

  1. Mean Average Precision at K
  2. Hamming Score
  3. Accuracy
  4. Top k categorical accuracy

Answer: 3
Explanation: Accuracy is not a good evaluation metric for Multi-label classification. As we know in multi-label each example can be assigned to multiple classes so let’s say if the predicted output was [0, 0, 0, 0, 1, 1,0] and the correct output was [1, 1, 0, 0, 0, 0, 0], my accuracy would still be 3/6 but it should be 0 as it is not able to predict any of the classes correctly.

Q2. Which of the following are the hyperparameters for a Pooling layer?

  1. filter size
  2. stride
  3. which type of Pooling to use (max or average)
  4. All of the above

Answer: 4
Explanation: All of the above mentioned are the hyperparameters for a Pooling layer.

Q3. Images are an example of ________ data?

  1. Structured
  2. Unstructured

Answer: 2
Explanation: Structured data refers to the type of data where each feature has a well defined meaning and opposite is true for unstructured data. So, images are an example of unstructured data.

Q4. For image classification, MaxPooling tends to works better than average pooling?

  1. Yes
  2. No

Answer: 1
Explanation: Because in image classification our main aim is to identify whether a feature is present or not so MaxPooling tends to works better than average pooling.

Q5. What is Pointwise Convolution?

  1. 1×1 convolution
  2. Strided Convolution
  3. convolution followed by MaxPool
  4. convolution followed by Dropout

Answer: 1
Explanation: According to the MobileNet paper, “After depthwise convolution, The pointwise convolution then applies a 1×1 convolution to combine the outputs of the depthwise convolution.”. Refer to Section 3.1 of this research paper to understand more.

Q6. What is a Region Proposal network?

  1. a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position
  2. a fully connected network that simultaneously predicts object bounds and objectness scores at each position
  3. a fully convolutional network that predicts only the objectness scores at each position
  4. a fully connected network that predicts only the object bounds at each position

Answer: 1
Explanation: According to the Faster R-CNN paper, Region Proposal network (RPN) is a fully convolutional network that takes an image(of any size) as input and outputs a set of rectangular object proposals, each with an objectness score. Refer to Section 3.1 of this research paper to understand more.

Q7. In MobileNetv2, the Depthwise Separable Convolutions are replaced by _________ ?

  1. Normal Convolution
  2. Strided Convolution
  3. Bottleneck Residual Block (Inverted Residuals and Linear Bottleneck)
  4. Residual Blocks

Answer: 3
Explanation: In MobileNetv2, the Depthwise Separable Convolutions are replaced by Bottleneck Residual Block (Inverted Residuals and Linear Bottleneck). Refer to Table 1 of this research paper to understand more.

Q8. Can we use Convolutional Neural Networks for image classification?

  1. Yes
  2. No

Answer: 1
Explanation: Generally, Convolutional Neural Networks are preferred for any image related tasks such as image classification, object detection etc.

Multi-Class Classification

In the previous blog, we discussed the binary classification problem where each image can contain only one class out of two classes. So, in this blog, we will extend this to the multi-class classification problem. In multi-class problem, we classify each image into one of three or more classes. So, let’s get started.

Here, we will use the CIFAR-10 dataset, developed by the Canadian Institute for Advanced Research (CIFAR). The CIFAR-10 dataset consists of 60000 (32×32) color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The classes are completely mutually exclusive. Below are the classes in the dataset, as well as 10 random images from each class.

Source: CIFAR-10

1. Load the Data

CIFAR-10 dataset can be downloaded by using any of the two methods:

  • Using Keras builtin datasets
  • From the official website

Method-1

Downloading using the Keras builtin datasets is pretty straightforward and simple. It’s already transformed into the shape appropriate for the CNN input. No headache, just write one line of code and you are done.

Method-2

The data can also be downloaded from the official website. But the only thing is that it is not in the standard format that can be inputted directly to the model. Let’s see how the dataset is arranged.

The dataset is broken into 5 files so as to prevent your machine from running out of memory. Each file contains a dictionary of data and the corresponding labels. Data is a 10000×3072 array where 10000 is the number of images and 3072 are the pixel values in row-major order. So, the first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. You need to convert it into a (32,32) color image.

Steps:

  • First, unpickle all the train and test files
  • Then convert the image format to (width x height x num_channel)

Then append all the unpickled train files into one array.

Split the data into train and validation

Because the training data contains images in the random order thus simple splitting will be sufficient. Another way is to take some % of images from each of the 5 train files to constitute a validation set.

To make sure that this splitting leads to the uniform proportion of examples for each class, we can plot the counts of each class in the validation dataset. Below is the bar plot. Looks like all the classes are uniformly distributed in the validation set.

Model Architecture

Since the images contain a diverse amount of information, we will be needing a bigger network. Bigger the network more will be the chances of overfitting, So, to prevent this we may need to apply some regularization techniques.

Data Augmentation

Fit the model using the fit_generator

Let’s visualize the training events using the History() callback.

That seems pretty nice. You can play with the architecture, optimizers and other hyperparameters to obtain even more accuracy. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Calculating Screen Time of an Actor using Deep Learning

Screen time of an actor in a movie or an episode is very important. Many actors get paid according to their total screen time. Moreover, we also want to know how much time our favorite character acted on screen. So, have you ever wondered how can you calculate the total screen time of an actor? One of the plausible answer is with deep learning.

With the advancement of deep learning now its possible to solve various difficult problems. In this blog, we will learn how to use transfer learning and image classification concepts of deep learning to calculate the screen time of an actor.

To solve any problem with deep learning, the first requirement is the data. For this tutorial, we will use a video clip from the famous TV show “Friends”. We are going to calculate the screen time of my favorite character “Ross”.

Creating Dataset

First, we need to get a video. To do this I have downloaded a video from YouTube using pytube library. For more understanding of pytube, you can follow this blog or use the following code to get started.

Now we have our data in the form of a video which is nothing but a group of frames( images). Since we are going to solve this problem using image classification, we need to extract the images from this video. For this task, I have used OpenCV as shown below

The video is now converted into individual frames. In this problem, there is only one class, either “Ross” or “No Ross”. To create a dataset, we need to separate images according to these two manually. For this, I have created a folder named “data” which is having two sub-folder “ross” and “no_ross”. Then manually added images to these two sub-folders. After creating dataset we are ready to dive into the code and concepts.

Input Data and Preprocessing

We are having data in the form of images. To prepare this data for input to our neural network, we need to do some preprocessing with the following steps:

  • Read all images one by one using openCV
  • Resize each image to (224, 224, 3) for the input to the model
  • Divide the data by 255 to make input features to neural network in the same range
  • Append to corresponding class

Transfer Learning

Since we have only 6814 images, so it will be difficult to train a neural network with this little dataset. Here comes the concept of transfer learning.

With the help of transfer learning, we can use features generated by a model trained on a large dataset into our model. Here we will use VGG16 model trained on “imagenet” dataset. For this, we are using tensorflow high-level API Keras. With keras, you can directly import VGG16 model as shown in the code below.

VGG16 model trained with imagenet dataset predicts on lots of classes, but in this problem, we are only having one class, either “Ross” or “No Ross”. That’s why above we are using include_top = False, which signifies that we are not including fully connected layers from the VGG16 model. Now we will pass our input data to vgg_model and generate the features.

Network Architectures

Since we are not including fully connected layers from VGG16 model, we need to create a model with some fully connected layers and an output layer with 1 class, either “Ross” or “No Ross”. Output features from VGG16 model will be having shape 7*7*512, which will be input shape for our model. Here I am also using dropout layer to make model less over-fit. Let’s see the code:

Splitting Data into Train and Validation

Now we have input features from VGG16 model and our own network architecture defined above. Next thing is to train this neural network. But we are lacking our validation data. We are having 6814 images, so we will split this into 5000 training images and 1814 validation images.

According to our created class 1, class 2, training and validation data, we will create our output y labels.

Training the Network

All set, we are ready to train our model. Here, we will use stochastic gradient descent as an optimizer and binary cross-entropy as our loss function. We are also going to save our checkpoint for the best model according to it’s validation dataset accuracy.

I am using batch size of 64 and 10 epochs to train.

Training and validation accuracy looks quite pleasing. Now let’s calculate screen time of “Ross”.

Calculating Screen Time

To test our trained model and calculate the screen time, I have downloaded another “friends” video clip from YouTube and extracted images. To calculate the screen time, first I have used the trained model to predict each image to find out which class it belongs, either “Ross” or “No Ross”. Since video is made up of 24 frames per second, we will count the number of frames which has been predicted for having “Ross” in it and then divide it by 24 to count the number of seconds “Ross” was on screen.

This test video clip is made up of 24 frames per second and number of images predicted for having “Ross” in it are 4715. So the screen time for Ross will be 4715/24 = 196 seconds.

Summary

We can see good accuracy on train and validation dataset but when I tested it on test dataset, the accuracy was about 65%. The one reason that I figured out is less training data. If you can get more data then accuracy can be higher. Another reason can be co-variance shift which means the test dataset is quite different from training dataset due to different video quality.

This type of technique can be very helpful in calculating screen time of a particular character.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.