Calculating Screen Time of an Actor using Deep Learning

Screen time of an actor in a movie or an episode is very important. Many actors get paid according to their total screen time. Moreover, we also want to know how much time our favorite character acted on screen. So, have you ever wondered how can you calculate the total screen time of an actor? One of the plausible answer is with deep learning.

With the advancement of deep learning now its possible to solve various difficult problems. In this blog, we will learn how to use transfer learning and image classification concepts of deep learning to calculate the screen time of an actor.

To solve any problem with deep learning, the first requirement is the data. For this tutorial, we will use a video clip from the famous TV show “Friends”. We are going to calculate the screen time of my favorite character “Ross”.

Creating Dataset

First, we need to get a video. To do this I have downloaded a video from YouTube using pytube library. For more understanding of pytube, you can follow this blog or use the following code to get started.

from pytube import YouTube as yt

video_link = 'https://www.youtube.com/watch?v=jbRVoTL5djs'
vid = yt(video_link)

stream = vid.streams.first()
stream.download()

from pytube import YouTube as yt

video_link = 'https://www.youtube.com/watch?v=jbRVoTL5djs'

vid = yt(video_link)

stream = vid.streams.first()

stream.download()

Now we have our data in the form of a video which is nothing but a group of frames( images). Since we are going to solve this problem using image classification, we need to extract the images from this video. For this task, I have used OpenCV as shown below

# Opens the Video file
cap= cv2.VideoCapture('Friends - Unagi.mp4')
i=0

image_folder = 'img'
while True:
    ret, frame = cap.read()
    
    if ret == False:
        break
    cv2.imwrite(image_folder+'/'+str(i)+'.jpg',frame)
    i+=1

cap.release()
cv2.destroyAllWindows()

# Opens the Video file

cap= cv2.VideoCapture('Friends - Unagi.mp4')

i=0

image_folder = 'img'

while True:

ret, frame = cap.read()

if ret == False:

break

cv2.imwrite(image_folder+'/'+str(i)+'.jpg',frame)

i+=1

cap.release()

cv2.destroyAllWindows()

The video is now converted into individual frames. In this problem, there is only one class, either “Ross” or “No Ross”. To create a dataset, we need to separate images according to these two manually. For this, I have created a folder named “data” which is having two sub-folder “ross” and “no_ross”. Then manually added images to these two sub-folders. After creating dataset we are ready to dive into the code and concepts.

Input Data and Preprocessing

We are having data in the form of images. To prepare this data for input to our neural network, we need to do some preprocessing with the following steps:

Read all images one by one using openCV
Resize each image to (224, 224, 3) for the input to the model
Divide the data by 255 to make input features to neural network in the same range
Append to corresponding class

from tqdm import tqdm
import cv2
import os
import numpy as np

img_path = 'D:/Downloads/youtube/train/data_1'

class1_data = []
class2_data = []
for classes in os.listdir(img_path):
        fin_path = os.path.join(img_path, classes)
        for fin_classes in tqdm(os.listdir(fin_path)):
            img = cv2.imread(os.path.join(fin_path, fin_classes))
            img = cv2.resize(img, (224,224))
            img = img/255.
            if classes == 'ross':
                class1_data.append(img)
            else:
                class2_data.append(img)

class1_data = np.array(class1_data)
class2_data = np.array(class2_data)

from tqdm import tqdm

import cv2

import os

import numpy as np

img_path = 'D:/Downloads/youtube/train/data_1'

class1_data = []

class2_data = []

for classes in os.listdir(img_path):

fin_path = os.path.join(img_path, classes)

for fin_classes in tqdm(os.listdir(fin_path)):

img = cv2.imread(os.path.join(fin_path, fin_classes))

img = cv2.resize(img, (224,224))

img = img/255.

if classes == 'ross':

class1_data.append(img)

else:

class2_data.append(img)

class1_data = np.array(class1_data)

class2_data = np.array(class2_data)

Transfer Learning

Since we have only 6814 images, so it will be difficult to train a neural network with this little dataset. Here comes the concept of transfer learning.

With the help of transfer learning, we can use features generated by a model trained on a large dataset into our model. Here we will use VGG16 model trained on “imagenet” dataset. For this, we are using tensorflow high-level API Keras. With keras, you can directly import VGG16 model as shown in the code below.

import keras
from keras.applications import VGG16

vgg_model = VGG16(include_top=False, weights='imagenet')

import keras

from keras.applications import VGG16

vgg_model = VGG16(include_top=False, weights='imagenet')

VGG16 model trained with imagenet dataset predicts on lots of classes, but in this problem, we are only having one class, either “Ross” or “No Ross”. That’s why above we are using include_top = False, which signifies that we are not including fully connected layers from the VGG16 model. Now we will pass our input data to vgg_model and generate the features.

vgg_class1 = vgg_model.predict(class1_data)
vgg_class2 = vgg_model.predict(class2_data)

1 2	vgg_class1 = vgg_model.predict(class1_data) vgg_class2 = vgg_model.predict(class2_data)

Network Architectures

Since we are not including fully connected layers from VGG16 model, we need to create a model with some fully connected layers and an output layer with 1 class, either “Ross” or “No Ross”. Output features from VGG16 model will be having shape 7*7*512, which will be input shape for our model. Here I am also using dropout layer to make model less over-fit. Let’s see the code:

from keras.layers import Input, Dense, Dropout
from keras.models import Model

inputs = Input(shape=(7*7*512,))

dense1 = Dense(1024, activation = 'relu')(inputs)
drop1 = Dropout(0.5)(dense1)
dense2 = Dense(512, activation = 'relu')(drop1)
drop2 = Dropout(0.5)(dense2)
outputs = Dense(1, activation = 'sigmoid')(drop2)

model = Model(inputs, outputs)
model.summary()

from keras.layers import Input, Dense, Dropout

from keras.models import Model

inputs = Input(shape=(7*7*512,))

dense1 = Dense(1024, activation = 'relu')(inputs)

drop1 = Dropout(0.5)(dense1)

dense2 = Dense(512, activation = 'relu')(drop1)

drop2 = Dropout(0.5)(dense2)

outputs = Dense(1, activation = 'sigmoid')(drop2)

model = Model(inputs, outputs)

model.summary()

Splitting Data into Train and Validation

Now we have input features from VGG16 model and our own network architecture defined above. Next thing is to train this neural network. But we are lacking our validation data. We are having 6814 images, so we will split this into 5000 training images and 1814 validation images.

train_data = np.concatenate((vgg_class1[:3000], vgg_class2[:2000]), axis = 0)
train_data = train_data.reshape(train_data.shape[0],7*7*512)

valid_data = np.concatenate((vgg_class1[3000:], vgg_class2[2000:]), axis = 0)
valid_data = valid_data.reshape(valid_data.shape[0],7*7*512)

train_data = np.concatenate((vgg_class1[:3000], vgg_class2[:2000]), axis = 0)

train_data = train_data.reshape(train_data.shape[0],7*7*512)

valid_data = np.concatenate((vgg_class1[3000:], vgg_class2[2000:]), axis = 0)

valid_data = valid_data.reshape(valid_data.shape[0],7*7*512)

According to our created class 1, class 2, training and validation data, we will create our output y labels.

train_label = np.array([0]*vgg_class1[:3000].shape[0] + [1]*vgg_class2[:2000].shape[0])
valid_label = np.array([0]*vgg_class1[3000:].shape[0] + [1]*vgg_class2[2000:].shape[0])

1 2	train_label = np.array([0]vgg_class1[:3000].shape[0] + [1]vgg_class2[:2000].shape[0]) valid_label = np.array([0]vgg_class1[3000:].shape[0] + [1]vgg_class2[2000:].shape[0])

Training the Network

All set, we are ready to train our model. Here, we will use stochastic gradient descent as an optimizer and binary cross-entropy as our loss function. We are also going to save our checkpoint for the best model according to it’s validation dataset accuracy.

import tensorflow as tf
from keras.callbacks import ModelCheckpoint

tf.logging.set_verbosity(tf.logging.ERROR)
model.compile(optimizer = 'sgd', loss = 'binary_crossentropy', metrics = ['accuracy'])

filepath="best_model.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

import tensorflow as tf

from keras.callbacks import ModelCheckpoint

tf.logging.set_verbosity(tf.logging.ERROR)

model.compile(optimizer = 'sgd', loss = 'binary_crossentropy', metrics = ['accuracy'])

filepath="best_model.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

callbacks_list = [checkpoint]

I am using batch size of 64 and 10 epochs to train.

model.fit(train_data, train_label, epochs = 10, batch_size = 64, validation_data = (valid_data, valid_label), verbose = 2, callbacks = callbacks_list)

1	model.fit(train_data, train_label, epochs = 10, batch_size = 64, validation_data = (valid_data, valid_label), verbose = 2, callbacks = callbacks_list)

Training and validation accuracy looks quite pleasing. Now let’s calculate screen time of “Ross”.

Calculating Screen Time

To test our trained model and calculate the screen time, I have downloaded another “friends” video clip from YouTube and extracted images. To calculate the screen time, first I have used the trained model to predict each image to find out which class it belongs, either “Ross” or “No Ross”. Since video is made up of 24 frames per second, we will count the number of frames which has been predicted for having “Ross” in it and then divide it by 24 to count the number of seconds “Ross” was on screen.

import os
import numpy as np

ross_images = []
no_ross_images = []

test_path = 'D:/Downloads/youtube/test/data_4/test_images'

for test in tqdm(os.listdir(test_path)):
    test_img = cv2.imread(os.path.join(test_path, test))
    test_img = cv2.resize(test_img, (224,224))
    test_img = test_img/255.
    test_img = np.expand_dims(test_img, 0)
    pred_img = vgg_model.predict(test_img)
    pred_feat = pred_img.reshape(1, 7*7*512)
    out_class = model.predict(pred_feat)
    if out_class < 0.5:
        ross_images.append(out_class)
    else:
        no_ross_images.append(out_class)

import os

import numpy as np

ross_images = []

no_ross_images = []

test_path = 'D:/Downloads/youtube/test/data_4/test_images'

for test in tqdm(os.listdir(test_path)):

test_img = cv2.imread(os.path.join(test_path, test))

test_img = cv2.resize(test_img, (224,224))

test_img = test_img/255.

test_img = np.expand_dims(test_img, 0)

pred_img = vgg_model.predict(test_img)

pred_feat = pred_img.reshape(1, 7*7*512)

out_class = model.predict(pred_feat)

if out_class < 0.5:

ross_images.append(out_class)

else:

no_ross_images.append(out_class)

This test video clip is made up of 24 frames per second and number of images predicted for having “Ross” in it are 4715. So the screen time for Ross will be 4715/24 = 196 seconds.

Summary

We can see good accuracy on train and validation dataset but when I tested it on test dataset, the accuracy was about 65%. The one reason that I figured out is less training data. If you can get more data then accuracy can be higher. Another reason can be co-variance shift which means the test dataset is quite different from training dataset due to different video quality.

This type of technique can be very helpful in calculating screen time of a particular character.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

0 Shares

4 thoughts on “Calculating Screen Time of an Actor using Deep Learning”

anonymous 17 Jul 2019 at 10:45 am

Hello Dear, I read your blog and I found it very useful and informative… Nice post and thanks for writing such a good article on Machine learning. This content is so valuable and surely unique that people are happy and really helpful for them.

Reply ↓
Alan JÖzf Pallath 20 Apr 2020 at 1:14 pm

Hi,
Is it possible to use the same model for a movie? Like if we put the actors images from many places as training set and then use the movie as the test data and find the screen time?

Reply ↓
1. Andleeb 15 Oct 2020 at 10:51 am
  
  i face this error.plz help me how i fix this
  ValueError: Error when checking input: expected input_14 to have 4 dimensions, but got array with shape (0, 1)
  
  Reply ↓
andleeb 15 Oct 2020 at 9:23 pm

hi,my images are no appending in folder ross ans no_ross.that the issue.plz reply me

Reply ↓

TheAILearner

Mastering Artificial Intelligence