Author Archives: kang & atul

Detecting low contrast images using Scikit-image

In the previous blog, we discussed what is contrast in image processing and how histograms can help us distinguish between low and high contrast images. If you remember, for a high contrast image, the histogram spans the entire dynamic range while for low contrast the histogram covers only a narrow range as shown below

So, just by looking at the histogram of an image, we can tell whether this has low or high contrast.

Problem

But what if you have a large number of images such as in computer vision when training a model. In that case, we generally want to remove these low contrast images as they don’t provide us enough knowledge about the task. But manually examining the histogram of each image will be a tedious and time-consuming task. So, we need to find a way to automate this process.

Solution

Luckily, scikit-image provides a built-in function is_low_contrast() that determines if an image is a low contrast or not. This function returns a boolean where True indicates low contrast. Below is the syntax of this function.

skimage.exposure.is_low_contrast(image, fraction_threshold=0.05, lower_percentile=1, upper_percentile=99, method='linear')

1	skimage.exposure.is_low_contrast(image, fraction_threshold=0.05, lower_percentile=1, upper_percentile=99, method='linear')

Below is the algorithm that this function uses

First, this function converts the image to greyscale
Then this disregards the image intensity values below lower_percentile and above upper_percentile. This is similar to percentile stretching that we did earlier (See here)
Then this calculate the full brightness range for a given image datatype. For instance, for 8-bit, the full brightness range is [0,255]
Finally, this calculates the ratio of image brightness range and full brightness range. If this is less than a set threshold (see fraction_threshold argument above), then the image is considered low contrast. For instance, for a 8-bit image if the image brightness range is [100-150] and the threshold is 0.1 then the ratio will be 50/255 that is 0.19 approx. So, this image is having a high contrast. You need to change this threshold according to your application

I hope you understood this. Now, let’s take an example and see how to implement this.

import cv2
from skimage.exposure import is_low_contrast

# Read the image
img = cv2.imread('D:/downloads/stretch_original.jpg')

# Check if it is low contrast or not
out = is_low_contrast(img, fraction_threshold=0.3)

# if true print low contrast otherwise high contrast
if out:
    print('image has low contrast')
else:
    print('image has high contrast')

import cv2

from skimage.exposure import is_low_contrast

# Read the image

img = cv2.imread('D:/downloads/stretch_original.jpg')

# Check if it is low contrast or not

out = is_low_contrast(img, fraction_threshold=0.3)

# if true print low contrast otherwise high contrast

if out:

print('image has low contrast')

else:

print('image has high contrast')

So, for the below image, this function outputs ‘image has low contrast’ corresponding to the given threshold.

I hope you understood this. Now, in the pre-processing step, you can check whether the image has high or low contrast and then take action accordingly. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Goodbye until next time.

Introduction to SIFT (Scale-Invariant Feature Transform)

1 Reply

In the previous blogs, we discussed some corner detectors such as Harris Corner, Shi-Tomasi, etc. If you remember, these corner detectors were rotation invariant, which basically means, even if the image is rotated we would still be able to detect the same corners. This is obvious because corners remain corners in the rotated image also. But when it comes to scaling, these algorithms suffer and don’t give satisfactory results. This is obvious because if we scale the image, a corner may not remain a corner. Let’s understand this with the help of the following image (Source: OpenCV)

See on the left we have a corner in the small green window. But when this corner is zoomed (see on the right), it no longer remains a corner in the same window. So, this is the issue that scaling poses. I hope you understood this.

So, to solve this, in 2004, D.Lowe, University of British Columbia, in his paper, Distinctive Image Features from Scale-Invariant Keypoints came up with a new algorithm, Scale Invariant Feature Transform (SIFT). This algorithm not only detects the features but also describes them. And the best thing about these features is that these features are invariant to changes in

Scale
Rotation
Illumination (partially)
Viewpoint (partially)
Minor image artifacts/ Noise/ Blur

That’s why this was a breakthrough in this field at that time. So, you can use these features to perform different tasks such as object recognition, tracking, image stitching, etc, and don’t need to worry about scale, rotation, etc. Isn’t this cool and that too around 2004!!!

There are mainly four steps involved in SIFT algorithm to generate the set of image features

Scale-space extrema detection: As clear from the name, first we search over all scales and image locations(space) and determine the approximate location and scale of feature points (also known as keypoints). In the next blog, we will discuss how this is done but for now just remember that the first step simply finds the approximate location and scale of the keypoints
Keypoint localization: In this, we take the keypoints detected in the previous step and refine their location and scale to subpixel accuracy. For instance, if the approximate location is 17 then after refinement this may become 17.35 (more precise). Don’t worry we will discuss how this is done in the next blogs. After the refinement step, we discard bad keypoints such as edge points and the low contrast keypoints. So, after this step we get robust set of keypoints.
Orientation assignment: Then we calculate the orientation for each keypoint using its local neighborhood. All future operations are performed on image data that has been transformed relative to the assigned orientation, scale, and location for each feature, thereby providing invariance to these transformations.
Keypoint descriptor: All the previous steps ensured invariance to image location, scale and rotation. Finally we create the descriptor vector for each keypoint such that the descriptor is highly distinctive and partially invariant to the remaining variations such as illumination, 3D viewpoint, etc. This helps in uniquely identify features. Once you have obtained these features along with descriptors we can do whatever we want such as object recognition, tracking, stitching, etc. This sums up the SIFT algorithm on a coarser level.

Because SIFT is an extensive algorithm so we won’t be covering this in a single blog. We will understand each of these 4 steps in separate blogs and finally, we will implement this using OpenCV-Python. And as we will proceed, we will also understand how this algorithm achieves scale, rotation, illumination, and viewpoint invariance as discussed above.

So, in the next blog, let’s start with the scale-space extrema detection and understand this in detail. See you in the next blog. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Goodbye until next time.

Shi-Tomasi Corner Detector

Leave a reply

In the previous blog, we discussed the Harris Corner Detector and saw how this uses a score function to evaluate whether a point is a corner or not. But this algorithm doesn’t always yield satisfactory results. So, in 1994, J. Shi and C. Tomasi in their paper Good Features to Track made a small modification to it which shows better results compared to Harris Corner Detector. So, let’s understand how they improved the algorithm.

As you might remember, the scoring function used by Harris Corner Detector is

Instead of this, Shi-Tomasi proposed a new scoring function

So, for a pixel, if this score R is greater than a certain threshold then that pixel is considered as a corner. Similar to Harris Corner Detector if we plot this in λ1−λ2 space, then we will get the below plot

So, as we can see that

only when λ1 and λ2 are above a minimum value, λmin, it is considered as a corner(green region)
when either λ1 or λ2 are below a minimum value, λmin, it is considered as a edge(orange region)
when both λ1 and λ2 are below a minimum value, λmin, it is considered as a flat region(grey region)

So, this is the improvement that Shi-Tomasi did to the Harris Corner Detector. Other than this, the entire algorithm is the same. Now, let’s see how to implement this using OpenCV-Python.

OpenCV

OpenCV provides a built-in function cv2.goodFeaturesToTrack() that finds N strongest corners in the image by either Shi-Tomasi or Harris Corner Detector. Below is the algorithm that this function uses

First, this function calculates the corner quality score at every pixel using either Shi-Tomasi or Harris Corner
Then this function performs a non-maximum suppression (the local maximums in 3 x 3 neighborhood are retained).
After this, all the corners with the quality score less than qualityLevel*max_x,yqualityScore(x,y) are rejected. This max_x,yqualityScore(x,y) is the best corner score. For instance, if the best corner has the quality score = 1500, and the qualityLevel=0.01 , then all the corners with the quality score less than 15 are rejected.
Now, all the remaining corners are sorted by the quality score in the descending order.
Function throws away each corner for which there is a stronger corner at a distance less than maxDistance.

Here is the syntax of this function

cv2.goodFeaturesToTrack(image, maxCorners, qualityLevel, minDistance, [,mask[,blockSize[,useHarrisDetector[,k]]]])

# image - Input 8-bit or floating-point 32-bit, single-channel image
# maxCorners - Maximum number of corners to return. If there are more corners than are found, the strongest of them is returned. if <= 0 implies that no limit on the maximum is set and all detected corners are returned
# qualityLevel - Parameter characterizing the minimal accepted quality of image corners. See the above paragraph for explanation
# minDistance - Minimum possible Euclidean distance between the returned corners
# mask - Optional region of interest. If the image is not empty it specifies the region in which the corners are detected
# blockSize - Size of an average block for computing a derivative covariation matrix over each pixel neighborhood
# useHarrisDetector - whether to use Shi-Tomasi or Harris Corner
# k - Free parameter of the Harris detector

cv2.goodFeaturesToTrack(image, maxCorners, qualityLevel, minDistance, [,mask[,blockSize[,useHarrisDetector[,k]]]])

# image - Input 8-bit or floating-point 32-bit, single-channel image

# maxCorners - Maximum number of corners to return. If there are more corners than are found, the strongest of them is returned. if <= 0 implies that no limit on the maximum is set and all detected corners are returned

# qualityLevel - Parameter characterizing the minimal accepted quality of image corners. See the above paragraph for explanation

# minDistance - Minimum possible Euclidean distance between the returned corners

# mask - Optional region of interest. If the image is not empty it specifies the region in which the corners are detected

# blockSize - Size of an average block for computing a derivative covariation matrix over each pixel neighborhood

# useHarrisDetector - whether to use Shi-Tomasi or Harris Corner

# k - Free parameter of the Harris detector

Now, let’s take the image we used in the previous blog and detect the top 20 corners. Below is the code for this

import numpy as np
import cv2

# Read the image and convert to greyscale
img = cv2.imread('D:/downloads/contracing.png')
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

# Find the top 20 corners using the cv2.goodFeaturesToTrack()
corners = cv2.goodFeaturesToTrack(gray,20,0.01,10)
corners = np.int0(corners)

# Iterate over the corners and draw a circle at that location
for i in corners:
    x,y = i.ravel()
    cv2.circle(img,(x,y),5,(0,0,255),-1)
    
# Display the image
cv2.imshow('a', img)
cv2.waitKey(0)

import numpy as np

import cv2

# Read the image and convert to greyscale

img = cv2.imread('D:/downloads/contracing.png')

gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

# Find the top 20 corners using the cv2.goodFeaturesToTrack()

corners = cv2.goodFeaturesToTrack(gray,20,0.01,10)

corners = np.int0(corners)

# Iterate over the corners and draw a circle at that location

for i in corners:

x,y = i.ravel()

cv2.circle(img,(x,y),5,(0,0,255),-1)

# Display the image

cv2.imshow('a', img)

cv2.waitKey(0)

Below is the result of this

You can also use the Harris Corner Detector method by specifying the flag useHarrisDetector and the k parameter in the above function as shown

corners = cv2.goodFeaturesToTrack(gray,20,0.01,10, useHarrisDetector=True, k=0.04)

1	corners = cv2.goodFeaturesToTrack(gray,20,0.01,10, useHarrisDetector=True, k=0.04)

So, that’s all about Shi-Tomasi Detector.

Limitations

Both Shi-Tomasi and Harris Corner work well for most of the cases but when the scale of the image changes both of these algorithms doesn’t give satisfactory results. So, in the next blog, we will discuss one of the famous algorithms for finding scale-invariant features known as SIFT (Scale-Invariant Feature Transform). This algorithm was a breakthrough in this field. See you in the next blog. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Goodbye until next time.

Harris Corner Detection

Leave a reply

In the previous blog, we discussed what are features and how corners are considered as a good feature as compared to edges and flat surfaces. In this blog, let’s discuss one of the famous and most commonly used corner detection methods known as Harris Corner Detection. This was one of the early attempts to find the corners by Chris Harris & Mike Stephens in their paper A Combined Corner and Edge Detector in 1988. Now it is called the Harris Corner Detector. So, let’s first understand the basic idea behind this algorithm, and then we will dive into mathematics. Let’s get started.

As discussed in the previous blog, corners are regions in the image with large variations in intensity in all directions. For instance, take a look at the below image. If you shift the window by a small amount, then corners will produce a significant change in all directions while edges will output no change if we move the window along the edge direction. And the flat region will output no change in all directions on window movement.

So, the authors took this simple idea of finding the difference in intensity for a displacement of (u,v) in all directions into a mathematical form. This is expressed as

Here,

the window function is either a rectangular window or a Gaussian window which gives weights to pixels underneath.
E(u,v) is the difference in intensities between the original and the moved window.

As can be clearly seen, for nearly constant patches the error function will be close to 0 while for distinctive patches this will be larger. Hence, our aim is to find patches where this error function is large. In other words, we need to maximize this error function for corner detection. That means we have to maximize the second term. We can do this by applying Taylor Expansion and using some mathematical steps as shown below

So, the final equation becomes

Then comes the main part. As we have already discussed that corners are the regions in the image with large variations in intensity in all directions. Or we can say it in terms of the above matrix M as “A corner is characterized by a large variation of M in all directions of the vector [u,v]”. So, if you remember that eigenvalues tell us about the variance thus by simply analyzing the eigenvalues of the matrix M we can infer the results.

But the authors note that the exact computation of the eigenvalues is computationally expensive, since it requires the computation of a square root, and instead suggests the following score function which determines if a window contains a corner or not. This is shown below

Therefore, the algorithm does not have to actually compute the eigenvalue decomposition of the matrix M and instead it is sufficient to evaluate the determinant and trace of matrix M to find the corners.

Now, depending upon the magnitude of the eigenvalues and the score (R), we can decide whether a region is a corner, an edge, or flat.

When |R| is small, which happens when λ1 and λ2 are small, the region is flat.
When R<0, which happens when λ1>>λ2 or vice versa, the region is edge.
- If λ1>>λ2, then vertical edge
- otherwise horizontal edge
When R is large, which happens when λ1 and λ2 are large and λ1∼λ2, the region is a corner

This can also be represented by the below image

So, this algorithm will give us a score corresponding to each pixel. Then we need to do thresholding in order to find the corners.

Because we consider only the eigenvalues of the matrix (M), we are considering quantities that are invariant also to rotation, which is important because objects that we are tracking might rotate as well as move. So, this makes this algorithm rotation invariant.

So, this concludes the Harris Corner Detector. I hope you understood this. Now, let’s see how to do this using OpenCV-Python.

OpenCV

OpenCV provides a builtin function cv2.cornerHarris() that runs the Harris corner detector on the image. Below is the syntax for this.

cv2.cornerHarris(src, blockSize, ksize, k)

# src - Input single-channel 8-bit or floating-point image.
# blockSize - Neighborhood size used when computing the matrix M.
# ksize - Aperture parameter for the Sobel operator.
# k - Harris detector free parameter in the score equation.

cv2.cornerHarris(src, blockSize, ksize, k)

# src - Input single-channel 8-bit or floating-point image.

# blockSize - Neighborhood size used when computing the matrix M.

# ksize - Aperture parameter for the Sobel operator.

# k - Harris detector free parameter in the score equation.

For each pixel (x,y) it calculates a 2×2 gradient covariance matrix M(x,y) over a blockSize×blockSize neighborhood. Then using this matrix M, this calculates the score for each pixel. Below is the code for this

import cv2

# Read the image and convert to greyscale
img = cv2.imread('D:/downloads/contracing.png')
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

# Find the corners using the Harris Corner Detector
dst = cv2.cornerHarris(gray,2,3,0.04)

#result is dilated for enhancing the corners, not important
dst = cv2.dilate(dst,None)

# Threshold for an optimal value, it may vary depending on the image.
thresh = 0.01*dst.max()
img[dst>thresh]=[0,255,0]

# Display the image
cv2.imshow('dst',img)
cv2.waitKey(0)

import cv2

# Read the image and convert to greyscale

img = cv2.imread('D:/downloads/contracing.png')

gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

# Find the corners using the Harris Corner Detector

dst = cv2.cornerHarris(gray,2,3,0.04)

#result is dilated for enhancing the corners, not important

dst = cv2.dilate(dst,None)

# Threshold for an optimal value, it may vary depending on the image.

thresh = 0.01*dst.max()

img[dst>thresh]=[0,255,0]

# Display the image

cv2.imshow('dst',img)

cv2.waitKey(0)

Below is the result of this.

So, this is how you can implement the Harris Corner Detector using OpenCV-Python. I hope you understood this. In the next blog, we will discuss the Shi-Tomasi algorithm that improves this Harris Corner Detector even further. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Goodbye until next time.

Feature Detection, Description, and Matching

Leave a reply

In the previous blogs, we discussed different segmentation algorithms such as watershed, grabcut, etc. From this blog, we will start another interesting topic known as Feature Detection, Description, and Matching. This has many applications in the field of computer vision such as image-stitching, object tracking, serving as the first step for many computer vision applications, etc. Over the past few decades, a number of algorithms has been proposed but before diving into these algorithms let’s first understand what in general are the features, and why are important. So, let’s get started.

What is a Feature?

According to Wikipedia, a feature is any piece of information that is relevant for solving any task. For instance, let’s say we have the task of identifying an apple in the image. So, the features useful in this case can be shape, color, texture, etc.

Now, that you know what features are, let’s try to understand which features are more important than others. For this, let’s take the example of image matching. Suppose you are given two images (see below) and your task is to match the rectangle present in the first image with the other. And, let’s say you are given 3 feature points A- flat area, B- edge, and C- corner. So now the question is, which of these is a better feature for matching the rectangle.

Clearly, A is a flat area. So, it’s difficult to find the exact location of this point in the other image. Thus, this is not a good feature point for matching. For B (edge), we can find the approximate location but not the accurate location. So, an edge is, therefore, a better feature compared to the flat area, but not good enough. But we can easily and accurately locate C (corner) in the other image and is thus is considered a good feature. So, corners are considered to be good features in an image. These feature points are also known as interest points.

What is a good feature or interest point?

A good feature or interest point is one that is robust to changes in illumination or brightness, scale and can be reliably computed with a high degree of repeatability. And also gives us enough knowledge about the task (see corner feature points for matching above). Also, a good feature should be unique, distinctive, and global.

So, I hope now you have some idea about the features. Now, let’s take a look at some of the applications of Feature Detection, Description, and Matching.

Applications

Object tracking
Image matching
Object Recognition
3D object reconstruction
image stitching
Motion-based segmentation

All these applications follow the same general steps i.e. Feature Detection, Feature Description, and Feature Matching. All these steps are discussed below.

Steps

First, we detect all the feature points. This is known as Feature Detection. There are several algorithms developed for this such as

Harris Corner
SIFT(Scale Invariant Feature Transform)
SURF(Speeded Up Robust Feature)
FAST(Features from Accelerated Segment Test)
ORB(Oriented FAST and Rotated BRIEF)

We will discuss each of these algorithms in detail in the next blogs.

Then we describe each of these feature points. This is known as Feature Description. Suppose we have 2 images as shown below. Both of these contain corners. So, the question is are they the same or different.

Obviously, both are different as the first one contains a green area to the lower right while the other one has a green area to the upper right. So, basically what you did is you described both these features and that has led us to answer the question. Similarly, a computer also should describe the region around the feature so that it can find it in other images. So, this is the feature description. There are also several algorithms for this such as

SIFT(Scale Invariant Feature Transform)
SURF(Speeded Up Robust Feature)
BRISK (Binary Robust Invariant Scalable Keypoints)
BRIEF (Binary Robust Independent Elementary Features)
ORB(Oriented FAST and Rotated BRIEF)

As you might have noticed, that some of the above algorithms were also there in feature detection. These algorithms perform both feature detection and description. We will discuss each of these algorithms in detail in the next blogs.

Once we have the features and their descriptors, the next task is to match these features in the different images. This is known as Feature Matching. Below are some of the algorithms for this

Brute-Force Matcher
FLANN(Fast Library for Approximate Nearest Neighbors) Matcher

We will discuss each of these algorithms in detail in the next blogs. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Goodbye until next time.

Creating gif from video using OpenCV and imageio

Leave a reply

In this blog, we will learn how to create gif from videos using OpenCV and imageio library. To install imageio library, simply do pip install imageio. So, let’s get started.

Steps

Open the video file using cv2.VideoCapture()
Read the frames one by one using the cap.read() method
Convert each frame to RGB. This is required because imageio accepts images in RGB format.
Save the frames in a list and close the video file
Convert the frames list to gif using the imageio.mimsave() method. Set the frame per second (fps) according to your application.

Below is the code for this

import cv2

import imageio

cap = cv2.VideoCapture('D:/downloads/video.mp4')
image_lst = []

while True:
    ret, frame = cap.read()
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    image_lst.append(frame_rgb)
    
    cv2.imshow('a', frame)
    key = cv2.waitKey(1)
    if key == ord('q'):
        break
        
cap.release()
cv2.destroyAllWindows()

# Convert to gif using the imageio.mimsave method
imageio.mimsave('D:/downloads/video.gif', image_lst, fps=60)

import cv2

import imageio

cap = cv2.VideoCapture('D:/downloads/video.mp4')

image_lst = []

while True:

ret, frame = cap.read()

frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

image_lst.append(frame_rgb)

cv2.imshow('a', frame)

key = cv2.waitKey(1)

if key == ord('q'):

break

cap.release()

cv2.destroyAllWindows()

# Convert to gif using the imageio.mimsave method

imageio.mimsave('D:/downloads/video.gif', image_lst, fps=60)

This is how you convert video to gif. Now, let’s see how to convert a specific part of a video to gif.

Converting a specific part of a video to gif

There might be a case where instead of converting the entire video to a gif, you only want to convert a specific part of the video to a gif. There are several ways you can do this.

Approach 1

Using the fps of the video, we can easily calculate the starting and ending frame number and then extract all the frames lying between these two. Once the specific frames are extracted, we can easily convert them to gifs using imageio as discussed above. Below is the code for this where the frames are extracted from 20 seconds to 25 seconds.

import cv2

import imageio

cap = cv2.VideoCapture('D:/downloads/video.mp4')
fps = cap.get(cv2.CAP_PROP_FPS)
start_time = 20*fps
end_time = 25*fps
image_lst = []
i = 0

while True:
    ret, frame = cap.read()
    if ret == False:
        break
    if (i>=start_time and i<=end_time):
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image_lst.append(frame_rgb)

        cv2.imshow('a', frame)
        key = cv2.waitKey(1)
        if key == ord('q'):
            break
    i +=1

cap.release()
cv2.destroyAllWindows()

# Convert to gif using the imageio.mimsave method
imageio.mimsave('D:/downloads/video.gif', image_lst, fps=60)

import cv2

import imageio

cap = cv2.VideoCapture('D:/downloads/video.mp4')

fps = cap.get(cv2.CAP_PROP_FPS)

start_time = 20*fps

end_time = 25*fps

image_lst = []

i = 0

while True:

ret, frame = cap.read()

if ret == False:

break

if (i>=start_time and i<=end_time):

frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

image_lst.append(frame_rgb)

cv2.imshow('a', frame)

key = cv2.waitKey(1)

if key == ord('q'):

break

i +=1

cap.release()

cv2.destroyAllWindows()

# Convert to gif using the imageio.mimsave method

imageio.mimsave('D:/downloads/video.gif', image_lst, fps=60)

Approach 2

You can also save the frames manually by pressing some keys. For instance, you can start saving frames when key ‘s’ is pressed and stop saving when key ‘q’ is pressed. Once the specific frames are extracted, we can easily convert them to gifs using imageio as discussed above. Below is the code for this.

import cv2

import imageio

cap = cv2.VideoCapture('D:/downloads/video.mp4')
image_lst = []

prev_key = -1
while True:
    ret, frame = cap.read()
    cv2.imshow('a', frame)
    key = cv2.waitKey(1)
    
    if key == ord('s'):
        key = -1
        prev_key = ord('s')
    
    if key == -1 and prev_key == ord('s'):
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image_lst.append(frame_rgb)
    
    if key == ord('q'):
        break
        
cap.release()
cv2.destroyAllWindows()

# Convert to gif using the imageio.mimsave method
imageio.mimsave('D:/downloads/video.gif', image_lst, fps=60)

import cv2

import imageio

cap = cv2.VideoCapture('D:/downloads/video.mp4')

image_lst = []

prev_key = -1

while True:

ret, frame = cap.read()

cv2.imshow('a', frame)

key = cv2.waitKey(1)

if key == ord('s'):

key = -1

prev_key = ord('s')

if key == -1 and prev_key == ord('s'):

frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

image_lst.append(frame_rgb)

if key == ord('q'):

break

cap.release()

cv2.destroyAllWindows()

# Convert to gif using the imageio.mimsave method

imageio.mimsave('D:/downloads/video.gif', image_lst, fps=60)

Approach 3

This approach is comparatively more tedious. In this, you go over each frame one by one and if you want to include that frame in gif you press the key ‘a’. To exit, you press the key ‘q’. Once the specific frames are extracted, we can easily convert them to gifs using imageio as discussed above. Below is the code for this.

import cv2
import imageio

cap = cv2.VideoCapture('D:/downloads/video.mp4')
image_lst = []

while True:
    ret, frame = cap.read()
    cv2.imshow('a', frame)
    key = cv2.waitKey(0)

    
    if key == ord('s'):
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image_lst.append(frame_rgb)
    
    if key == ord('q'):
        break
        
cap.release()
cv2.destroyAllWindows()

# Convert to gif using the imageio.mimsave method
imageio.mimsave('D:/downloads/video.gif', image_lst, fps=60)

import cv2

import imageio

cap = cv2.VideoCapture('D:/downloads/video.mp4')

image_lst = []

while True:

ret, frame = cap.read()

cv2.imshow('a', frame)

key = cv2.waitKey(0)

if key == ord('s'):

frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

image_lst.append(frame_rgb)

if key == ord('q'):

break

cap.release()

cv2.destroyAllWindows()

# Convert to gif using the imageio.mimsave method

imageio.mimsave('D:/downloads/video.gif', image_lst, fps=60)

This is how you can convert specific part of a video to gif. Hope you enjoy reading.

If you have any doubts/suggestions, please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

CTC – Problem Statement

Leave a reply

In the previous blog, we had an overview of the text recognition step. There we discussed that in order to avoid character segmentation, two major techniques have been adopted. One is CTC-based and another one is Attention-based. So, in this blog, let’s first discuss the intuition behind the CTC algorithm like why do we even need this or where is this algorithm used. And then in the next blog, we will discuss this algorithm in detail. Here, we will understand this using the Text Recognition case. Let’s get started.

As we have already discussed that in text recognition, we are given a segmented image and our task is to recognize what text is present in that segment. Thus, for the text recognition problem the input is an image while the output is a text as shown below.

So, in order to solve the text recognition problem, we need to develop a model that takes the image as an input and outputs the recognized text. If you have ever taken any deep learning class, you must know that the Convolutional Neural Networks (CNNs) are good in handling image data, while for the sequence data such as text, Recurrent Neural Networks (RNNs) are preferred.

So, for the text recognition problem, an obvious choice would be to use a combination of Convolutional Neural Network and Recurrent Neural Network. Now, let’s discuss how to combine CNN and RNN together for the text recognition task. Below is one such architecture that combines the CNN and RNN together. This is taken from the famous CRNN paper.

In this, first, the input image is fed through a number of convolutional layers to extract the feature maps. These feature maps are then divided into a sequence of feature vectors as shown by the blue color. These are obtained by dividing the feature maps into columns of single-pixel width. Now, a question might come to your mind that why are we dividing the feature maps by columns. The answer to this question lies in the receptive field concept. The receptive field is defined as the region in the input image that a particular CNN’s feature map is looking at. For instance, for the above input image, the receptive field of each feature vector corresponds to a rectangular region in the input image as shown below.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

And each of these rectangular regions is ordered from left to right. Thus, each feature vector can be considered as the image descriptor of that rectangular region. These feature vectors are then fed to a bi-directional LSTM. Because of the softmax activation function, this LSTM layer outputs the probability distribution at each time step over the character set. To obtain the per-timestep output, we can either take the max of the probability distribution at each time step or apply any other method.

But as you might have noticed in the above image that these feature vectors sometimes may not contain the complete character. For instance, see the below image where the 2 feature vectors marked by red color contains some part of the character “S”.

Thus, in the LSTM output, we may get repeated characters as shown below by the red box. We call these per-frame or per-timestep predictions.

Now, here comes the problem. As we have already discussed, that for the text recognition, the training data consists of images and the corresponding text as shown below.

**Training Data** **for text recognition**

Thus, we only know the final output and we don’t know the per-timestep predictions. Now, in order to train this network, we either need to know the per-timestep output for each input image or we need to develop a mechanism to convert either per-timestep output to final output or vice-versa.

So, the problem is how to align the final output with the per-timestep predictions in order to train this network?

Approach 1

One thing we can do is devise a rule like “one character corresponds to some fixed time steps”. For instance, for the above image, if we have 10 timesteps, then we can repeat “State” as “S, S, T, T, A, A, T, T, E, E” (repeat each character twice) and then train the network. But this approach can be easily violated for different fonts, writing styles, etc.

Approach 2

Another approach can be to manually annotate the data for each time step as shown below. Then train the network using this annotated data. The problem with this approach is that this will be very time consuming for a reasonably sized dataset.

Clearly, both the above naïve approaches have some serious downsides. So, isn’t there any efficient way to solve this? This is where the CTC comes into picture.

Connectionist Temporal Classification(CTC)

This was introduced in 2006 and is used for training deep networks where alignment is a problem. With CTC, we need not to worry about the alignments or the per-timestep predictions. CTC takes care of all the alignments and now we can train the network only using the final outputs. So, in the next blog, let’s discuss how the CTC algorithm works. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Text Recognition Datasets

Leave a reply

In the previous blog, we build our own Text Recognition system from scratch using the very famous CNN+RNN+CTC based approach. As you might remember, we got pretty decent results. In order to further fine-tune our model, one thing we can do is more training. But for that, we need more training data. So, in this blog, let’s discuss some of the open-source text recognition datasets available and how to create synthetic data for text recognition. Let’s get started.

Open Source Datasets

Below are some of the open source text recognition datasets available.

The ICDAR datasets: ICDAR stands for International Conference for Document Analysis and Recognition. This is held every 2 years. They brought about a series of scene text datasets that have shaped the research community. For instance, ICDAR-2013 and ICDAR-2015 datasets.
MJSynth Dataset: This synthetic word dataset is provided by the Visual Geometry Group, University of Oxford. This dataset consists of synthetically generated 9 million images covering 90k English words and includes the training, validation, and test splits used in our work.
IIIT 5K-word dataset: This is one of the most challenging and largest recognition datasets available. The dataset contains 5000 cropped word images from Scene Texts and born-digital images. They also provide a lexicon of more than 0.5 million dictionary words with this dataset.
The Street View House Numbers (SVHN) Dataset: This dataset contains cropped images of house numbers in natural scenes collected from Google View images. This dataset is usually used in digit recognition. You can also use MNIST handwritten dataset.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Synthetic Data

Similar to text detection, when it comes to data, the text recognition task is also not so rich. Thus, in order to further train or fine-tune the model, synthetic data can help. So, let’s discuss how to create synthetic data containing different fonts using Python. Here, we will use the famous PIL library. Let’s first import the libraries that will be used.

import random
import string
import PIL
from PIL import ImageFont
from PIL import Image
from PIL import ImageDraw
from tqdm import tqdm

import random

import string

import PIL

from PIL import ImageFont

from PIL import Image

from PIL import ImageDraw

from tqdm import tqdm

Then, we will create a list of characters that will be used in creating the dataset. This can be easily done using the string library as shown below.

# create a list of characters to be used in creating dataset
char_list = []
for char in string.ascii_letters:
    char_list.append(char)

# create a list of characters to be used in creating dataset

char_list = []

for char in string.ascii_letters:

char_list.append(char)

Similarly, create a list of fonts that you want to use. Here, I have used 10 different types of fonts as shown below.

# create font list
font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']

1 2	# create font list font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']

Now, we will generate images corresponding to each font. Here, for each font, for each character in the char list, we will generate words. For this, first we choose a random word size as shown below.

word_size = random.randrange(0,10)

1	word_size = random.randrange(0,10)

Then, we will create a word of length word_size and starting with the current character as shown below.

# create word starting with the current character
char_list_copy = char_list.copy()
char_list_copy.remove(char_list[i])
new_word = char_list[i]
for _ in range(word_size):
    new_word +=random.choice(char_list_copy)

# create word starting with the current character

char_list_copy = char_list.copy()

char_list_copy.remove(char_list[i])

new_word = char_list[i]

for _ in range(word_size):

new_word +=random.choice(char_list_copy)

Now, we need to draw that word on to the image. For that, first we will create a font object for a font of the given size. Here, I’ve used a font size of 14.

font = ImageFont.truetype(fonts+".ttf",14)

1	font = ImageFont.truetype(fonts+".ttf",14)

Now, we will create a new image of size (110,20) with white color (255,255,255). Then we will create a drawing context and draw the text at (5,0) with black color(0,0,0) as shown below.

img=Image.new("RGBA", (110,20),(255,255,255))
draw = ImageDraw.Draw(img)
draw.text((5, 0),new_word,(0,0,0),font=font)

img=Image.new("RGBA", (110,20),(255,255,255))

draw = ImageDraw.Draw(img)

draw.text((5, 0),new_word,(0,0,0),font=font)

Finally, save the image and the corresponding text file as shown below.

img.save('english/'+new_word+".png")

# Save the word in the text file
with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:
    txt_file.write(new_word)

img.save('english/'+new_word+".png")

# Save the word in the text file

with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:

txt_file.write(new_word)

Below is the full code

import random
import string
import PIL
from PIL import ImageFont
from PIL import Image
from PIL import ImageDraw
from tqdm import tqdm

# create a list of characters to be used in creating dataset
char_list = []
for char in string.ascii_letters:
    char_list.append(char)
    
# create font list
font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']  

# generate images for each fonts
for fonts in font_lst:
    for i in tqdm(range(1)):
        for i in range(len(char_list)):
            # Choose a random word size
            word_size = random.randrange(0,10)
            # create word starting with the current character
            char_list_copy = char_list.copy()
            char_list_copy.remove(char_list[i])
            new_word = char_list[i]
            for _ in range(word_size):
                new_word +=random.choice(char_list_copy)
            # Draw the word on the image
            font = ImageFont.truetype(fonts+".ttf",14)
            img=Image.new("RGBA", (110,20),(255,255,255))
            draw = ImageDraw.Draw(img)
            draw.text((5, 0),new_word,(0,0,0),font=font)
            # Save the image and the corresponding text file            
            img.save('english/'+new_word+".png")

            with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:
                txt_file.write(new_word)

import random

import string

import PIL

from PIL import ImageFont

from PIL import Image

from PIL import ImageDraw

from tqdm import tqdm

# create a list of characters to be used in creating dataset

char_list = []

for char in string.ascii_letters:

char_list.append(char)

# create font list

font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']

# generate images for each fonts

for fonts in font_lst:

for i in tqdm(range(1)):

for i in range(len(char_list)):

# Choose a random word size

word_size = random.randrange(0,10)

# create word starting with the current character

char_list_copy = char_list.copy()

char_list_copy.remove(char_list[i])

new_word = char_list[i]

for _ in range(word_size):

new_word +=random.choice(char_list_copy)

# Draw the word on the image

font = ImageFont.truetype(fonts+".ttf",14)

img=Image.new("RGBA", (110,20),(255,255,255))

draw = ImageDraw.Draw(img)

draw.text((5, 0),new_word,(0,0,0),font=font)

# Save the image and the corresponding text file

img.save('english/'+new_word+".png")

with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:

txt_file.write(new_word)

Below are some of the generated images shown.

To make it more realistic and challenging, you can add some geometric transformations (such as rotation, skewness, etc), or add some noise or even change the background color.

Now, using any above datasets, we can further fine-tune our recognition model. That’s all for this blog. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Implementation of EAST

Leave a reply

In the previous blog, we discussed the theory behind the EAST algorithm. If you remember, we stated that this algorithm is both accurate and efficient. So, in this blog, let’s find it out. For this, first, we will run the EAST algorithm using its Github repository, and then we will analyze the results. So, let’s get started. Here, I’m using a Linux system.

Clone the Repository

First, search “EAST Github” in the browser. You will find several EAST implementations but in this blog, we will use the one provided by argman. So, open this and clone the repository. In order to clone the repository, you can either use git or download it as a zip file. To install git, you can run the following command.

sudo apt-get install git

1	sudo apt-get install git

Once you have installed git, clone the repository using the following command.

git clone https://github.com/argman/EAST.git

1	git clone https://github.com/argman/EAST.git

This will clone the repository into your system as shown below.

Compile lanms

As you might remember, in the previous blog, we discussed that the EAST algorithm uses a Locality-Aware NMS (lanms) instead of the standard NMS. Now, you need to compile the lanms. Why? because this GitHub implementation contains the lanms code written in C++ (See the lanms folder). So, in order to make it work with Python, we need to generate an adaptor.so file. This can be done as follows.

First, we need to install the g++ compiler in order to compile the adaptar.cpp file. This can be done using the following command.

sudo apt-get install build-essential

1	sudo apt-get install build-essential

This contains the essential tools for building most other packages from source (e.g. C/C++ compiler, libc, and make).

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Next, open the __init__.py file present inside the lanms folder and comment out the if condition as shown below.

Again open the terminal and change the directory to the lanms folder. After this, run the make command as shown below. This will generate the required adaptor.so file in the lanms folder.

cd EAST/lanms/
make

1 2	cd EAST/lanms/ make

Test the model

Now, to test the model, either we need to first train it or find some pre-trained weights, if available. Luckily, pre-trained weights are available. You can download it from here. These are trained on the ICDAR-2013 and ICDAR-2015 datasets.

After downloading the pre-trained weights, extract them and place them inside the EAST folder. Now, to test the model, open the terminal and change the directory to the EAST folder. Also activate the virtual environment if any. Then type the following command by giving the arguments.

python eval.py --test_data_path=training_samples/ --gpu_list=0 --checkpoint_path=east_icdar2015_resnet_v1_50_rbox/ --output_dir=outputs/

1	python eval.py --test_data_path=training_samples/ --gpu_list=0 --checkpoint_path=east_icdar2015_resnet_v1_50_rbox/ --output_dir=outputs/

For arguments, first, we need to specify the test images path as a “test_data_path” argument. Second, we need to specify the recently downloaded checkpoints path as a “checkpoint_path” argument. And lastly, we need to specify the output directory path as an “output_dir” argument as shown below. This will automatically create the output directory if not present.

This will run the EAST algorithm on the test images we provided. Below an output image is shown.

In the next blog, we will explore different text detection datasets that are available. We will also learn how we can create our own text detection dataset. This will help us with training and fine-tuning our EAST model further. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition: Introduction and its Applications

Leave a reply

Hello! and welcome to this series on Optical Character Recognition, also known as OCR in short. Most of you already might be familiar with this OCR term, if not, no worries, we will be discussing everything in detail in this series. So, in this blog, let’s first start by giving you an OCR introduction followed by some motivation, that why you should invest your time in learning this. So, let’s get started.

What is OCR?

Optical character recognition is a method of converting the text present in images or scanned documents to a machine-readable format that can later be edited, searched, and used for further processing.

The term machine-readable format means the text in electronic form or simply the text that you can select, edit, process, etc. Let’s take an example to understand what this actually means.

Suppose we are given the below image. Clearly, as we can see, there is some text present in the image. But for a computer, this is nothing but an array of pixel values.

The computer doesn’t know whether the image contains text, car, or bus, etc. We can’t select, edit, or do any further processing on the text. Thus, this is not a machine-encoded text.

So, what the OCR system will do is, digitize the printed text, that is, take this image as an input, and outputs a text file containing all the text present in the image. Now, you can do anything with the text you want.

So, now you know coarsely what an OCR is. In the next section, let’s try to understand why you should invest your time in learning this?

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Applications

Let’s take a few real-life examples where OCR has made our life easy. You already might have encountered these in your day-to-day life but might not have thought about how these work.

Automatic Data Entry

This is one of the most prominent applications of OCR. Earlier people used to manually enter the details from business documents, invoices, passports, receipts, etc. But now with the help of OCR, most of these tasks are now automated. Also instead of managing a colossal pile of paper documents, now everything is archived digitally.

For instance, in banks instead of manually entering the cheque details, the cheque is first scanned, and then the OCR extracts all the useful information such as account number, amount, etc. thus leading to faster processing. Similarly, at airports, your passport information is extracted from the Machine-readable zone (MRZ) leading to faster processing.

Vehicle Number Plate Recognition

Almost everyone might have seen or heard about this application. OCR is used to recognise the vehicle registration plate which can then be used for vehicle tracking, toll collection, etc. This was invented in 1976 in Britain but became popular only after 90’s.

Self-driving cars

Most of you might be thinking of how and where the OCR is used in self-driving cars. The answer is recognizing the traffic signs. The autonomous car uses OCR to recognize the traffic signs and thus take actions accordingly. Without this, the self-driving car will pose a risk for both pedestrians and other vehicles on road.

Book Scanning

OCR is widely used in digitizing scanned documents. For instance, you might have heard about Project Gutenberg that tries to digitize and archive cultural works. Most of these items are available free of cost. Similarly, Google Books scans books, converts them to text using OCR, and stores them in its digital database.

For Visually Impaired persons

In this, we can use OCR to extract the text and use text-to-speech to read the extracted text. This approach was first used around 1976.

Your Personal Translator

Suppose you are roaming in a country that doesn’t speak your language. You find a signboard that you are not able to understand. Obviously, you can ask someone. But OCR can also help you out in this situation. Just click a photo of that signboard, run the OCR (constraint: language must be known), extract the text and then use google translate or any other API to translate it to your native language. Isn’t this cool!!!

These are just a few of many OCR applications. From these applications, we can see that because of OCR most of the work has now been automated which helps in saving time, money, manpower, etc. Hope these applications have motivated you enough to learn OCR. From the next blog, we will start discussing how the OCR works. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.