Screen time of an actor in a movie or an episode is very important. Many actors get paid according to their total screen time. Moreover, we also want to know how much time our favorite character acted on screen. So, have you ever wondered how can you calculate the total screen time of an actor? One of the plausible answer is with deep learning.
With the advancement of deep learning now its possible to solve various difficult problems. In this blog, we will learn how to use transfer learning and image classification concepts of deep learning to calculate the screen time of an actor.
To solve any problem with deep learning, the first requirement is the data. For this tutorial, we will use a video clip from the famous TV show “Friends”. We are going to calculate the screen time of my favorite character “Ross”.
Creating Dataset
First, we need to get a video. To do this I have downloaded a video from YouTube using pytube library. For more understanding of pytube, you can follow this blog or use the following code to get started.
1 2 3 4 5 6 7 |
from pytube import YouTube as yt video_link = 'https://www.youtube.com/watch?v=jbRVoTL5djs' vid = yt(video_link) stream = vid.streams.first() stream.download() |
Now we have our data in the form of a video which is nothing but a group of frames( images). Since we are going to solve this problem using image classification, we need to extract the images from this video. For this task, I have used OpenCV as shown below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Opens the Video file cap= cv2.VideoCapture('Friends - Unagi.mp4') i=0 image_folder = 'img' while True: ret, frame = cap.read() if ret == False: break cv2.imwrite(image_folder+'/'+str(i)+'.jpg',frame) i+=1 cap.release() cv2.destroyAllWindows() |
The video is now converted into individual frames. In this problem, there is only one class, either “Ross” or “No Ross”. To create a dataset, we need to separate images according to these two manually. For this, I have created a folder named “data” which is having two sub-folder “ross” and “no_ross”. Then manually
Input Data and Preprocessing
We are having data in the form of images. To prepare this data for input to our neural network, we need to do some preprocessing with the following steps:
- Read all images one by one using openCV
- Resize each image to (224, 224, 3) for the input to the model
- Divide the data by 255 to make input features to
neural network in the same range - Append to
corresponding class
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from tqdm import tqdm import cv2 import os import numpy as np img_path = 'D:/Downloads/youtube/train/data_1' class1_data = [] class2_data = [] for classes in os.listdir(img_path): fin_path = os.path.join(img_path, classes) for fin_classes in tqdm(os.listdir(fin_path)): img = cv2.imread(os.path.join(fin_path, fin_classes)) img = cv2.resize(img, (224,224)) img = img/255. if classes == 'ross': class1_data.append(img) else: class2_data.append(img) class1_data = np.array(class1_data) class2_data = np.array(class2_data) |
Transfer Learning
Since we have only 6814 images, so it will be difficult to train a neural network with this little dataset. Here comes the concept of transfer learning.
With the help of transfer learning, we can use features generated by a model trained on a large dataset into our model. Here we will use VGG16 model trained on “imagenet” dataset. For this, we are using tensorflow high-level API Keras. With keras, you can directly import VGG16 model as shown in the code below.
1 2 3 4 |
import keras from keras.applications import VGG16 vgg_model = VGG16(include_top=False, weights='imagenet') |
VGG16 model trained with imagenet dataset predicts on lots of classes, but in this problem, we are only having one class, either “Ross” or “No Ross”. That’s why above we are using include_top = False, which signifies that we are not including fully connected layers from the VGG16 model. Now we will pass our input data to vgg_model and generate the features.
1 2 |
vgg_class1 = vgg_model.predict(class1_data) vgg_class2 = vgg_model.predict(class2_data) |
Network Architectures
Since we are not including fully connected layers from VGG16 model, we need to create a model with some fully connected layers and an output layer with 1 class, either “Ross” or “No Ross”. Output features from VGG16 model will be having shape 7*7*512, which will be input shape for our model. Here I am also using dropout layer to make model less over-fit. Let’s see the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
from keras.layers import Input, Dense, Dropout from keras.models import Model inputs = Input(shape=(7*7*512,)) dense1 = Dense(1024, activation = 'relu')(inputs) drop1 = Dropout(0.5)(dense1) dense2 = Dense(512, activation = 'relu')(drop1) drop2 = Dropout(0.5)(dense2) outputs = Dense(1, activation = 'sigmoid')(drop2) model = Model(inputs, outputs) model.summary() |
Splitting Data into Train and Validation
Now we have input features from VGG16 model and our own network architecture defined above. Next thing is to train this neural network. But we are lacking our validation data. We are having 6814 images, so we will split this into 5000 training images and 1814 validation images.
1 2 3 4 5 |
train_data = np.concatenate((vgg_class1[:3000], vgg_class2[:2000]), axis = 0) train_data = train_data.reshape(train_data.shape[0],7*7*512) valid_data = np.concatenate((vgg_class1[3000:], vgg_class2[2000:]), axis = 0) valid_data = valid_data.reshape(valid_data.shape[0],7*7*512) |
According to our created class 1, class 2, training and validation data, we will create our output y labels.
1 2 |
train_label = np.array([0]*vgg_class1[:3000].shape[0] + [1]*vgg_class2[:2000].shape[0]) valid_label = np.array([0]*vgg_class1[3000:].shape[0] + [1]*vgg_class2[2000:].shape[0]) |
Training the Network
All set, we are ready to train our model. Here, we will use stochastic gradient descent as an optimizer and binary cross-entropy as our loss function. We are also going to save our checkpoint for the best model according to
1 2 3 4 5 6 7 8 9 |
import tensorflow as tf from keras.callbacks import ModelCheckpoint tf.logging.set_verbosity(tf.logging.ERROR) model.compile(optimizer = 'sgd', loss = 'binary_crossentropy', metrics = ['accuracy']) filepath="best_model.hdf5" checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max') callbacks_list = [checkpoint] |
I am using batch size of 64 and 10 epochs to train.
1 |
model.fit(train_data, train_label, epochs = 10, batch_size = 64, validation_data = (valid_data, valid_label), verbose = 2, callbacks = callbacks_list) |
Training and validation accuracy looks quite pleasing. Now let’s calculate screen time of “Ross”.
Calculating Screen Time
To test our trained model and calculate the screen time, I have downloaded another “friends” video clip from YouTube and extracted images. To calculate the screen time, first I have used the trained model to predict each image to find out which class it belongs, either “Ross” or “No Ross”. Since video is made up of 24 frames per second, we will count the number of frames which has been predicted for having “Ross” in it and then divide it by 24 to count the number of seconds “Ross” was on screen.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import os import numpy as np ross_images = [] no_ross_images = [] test_path = 'D:/Downloads/youtube/test/data_4/test_images' for test in tqdm(os.listdir(test_path)): test_img = cv2.imread(os.path.join(test_path, test)) test_img = cv2.resize(test_img, (224,224)) test_img = test_img/255. test_img = np.expand_dims(test_img, 0) pred_img = vgg_model.predict(test_img) pred_feat = pred_img.reshape(1, 7*7*512) out_class = model.predict(pred_feat) if out_class < 0.5: ross_images.append(out_class) else: no_ross_images.append(out_class) |
This test video clip is made up of 24 frames per second and number of images predicted for having “Ross” in it are 4715. So the screen time for Ross will be 4715/24 = 196 seconds.
Summary
We can see good accuracy on train and validation dataset but when I tested it on test dataset, the accuracy was about 65%. The one reason that I figured out is less training data. If you can get more data then accuracy can be higher. Another reason can be co-variance shift which means the test dataset is quite different from training dataset due to different video quality.
This type of technique can be very helpful in calculating screen time of a particular character.
Hope you enjoy reading.
If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.
Hello Dear, I read your blog and I found it very useful and informative… Nice post and thanks for writing such a good article on Machine learning. This content is so valuable and surely unique that people are happy and really helpful for them.
Hi,
Is it possible to use the same model for a movie? Like if we put the actors images from many places as training set and then use the movie as the test data and find the screen time?
i face this error.plz help me how i fix this
ValueError: Error when checking input: expected input_14 to have 4 dimensions, but got array with shape (0, 1)
hi,my images are no appending in folder ross ans no_ross.that the issue.plz reply me