Tag Archives: sift opencv

SIFT: Scale-Space Extrema Detection

In the previous blog, we had an overview of the SIFT algorithm. We discussed different steps involved in this and the invariance that it offers against scale, rotation, illumination, viewpoint, etc. But we didn’t discuss it in detail. So, in this blog, let’s start with the first step which is scale-space extrema detection. So, let’s get started.

Before moving forward, let’s quickly recap what we are doing and why we are doing it and how we are doing it?

  • We saw how the corner detectors like Harris, Shi-Tomasi, etc suffer when it comes to scaling. (Why we are doing)
  • So, we want to detect features that are invariant to scale changes and can be robustly detected (What we are doing)
  • This can be done by searching for stable features (Extremas) across all possible scales, using a continuous function of scale known as scale space. That’s why the name scale-space extrema detection. (How we are doing)

I hope you understood this. Now, let’s understand what is scale-space.

What is a Scale-Space?

As we all know that the real-world objects are composed of different structures at different scales. For instance, the concept of a “tree” is appropriate at the scale of meters, while concepts such as leaves and molecules are more appropriate at finer scales. It would make no sense to analyze leaves or molecules at the scale of the tree (meters). So, this means that you need to analyze everything at an appropriate scale in order to make sense.

But given an image or an unknown scene, there is no apriori way to determine what scales are appropriate for describing the interesting structures in the image data. Hence, the only reasonable approach is to consider descriptions at multiple scales. This representation of images at multiple scales constitutes a so-called scale-space.

How to construct a Scale-Space?

Now, the next thing is how to construct a scale-space? So as we know if we increase the scale, the fine details will be lost and only coarser information prevails. Can you relate this with something you did in image processing? Does Blurring an image with a low pass filter sound similar to this? The answer is yes but there is a catch. We can’t use any low pass filter, only the Gaussian filter helps in mimicking a scale space. This is because the Gaussian filter is shown to be the only filter that obeys the following

  • Linearity
  • Shift-invariance
  • smoothing process does not produce new structures when going from fine to coarser scale
  • Rotational symmetry and some other properties (You can read about it on Wikipedia)

So to create a scale space, you take the original image and generate progressively blurred-out images using a Gaussian filter. Mathematically, the scale-space representation of an image can be expressed as

  • L(x,y,σ) is the blurred image or scale space representation of an image
  • G(x,y,σ) is the Gaussian filter
  • I(x,y) is the image
  • σ is the scaling parameter or the amount of blur. As we increase σ, more and more details are removed from the image i.e. more blur

See below where an image is shown at different scales(σ) (source: Wikipedia). See how at larger scales, the fine details got lost.

So, I hope you understood what is a scale-space and how to construct it using Gaussian filter.

Scale-Space in SIFT

In the SIFT paper, the authors modified the scale-space representation. Instead of creating the scale-space representation for the original image only, they created the scale-space representations for different image sizes. This helps in increasing the number of keypoints detected. The idea is shown below

Take the original image, and generate progressively blurred out images. Then, resize the original image to half size. And generate blurred out images again. And keep repeating. This is shown below

Here, we use the term octave to denote the scale-space representation for a particular image size. For instance, all the same size images in vertical line forms one octave. Here, we have 3 octaves and all the octaves contain 4 images at different scales (blurred using Gaussian filter).

Within an octave, the adjacent scales differ by a constant factor k. If an octave contains s+1 images, then k = 2(1/s). The first image has scale σ0, the second image has scale kσ0, the third image has scale k2σ0, and the last image has scale ksσ0. In the paper, they have used the values as number of octaves = 4, number of scale levels = 5, initial σ0 =1.6, k=√2 etc.

How to decide the number of octaves and number of scales per octave?

The number of octaves and scale depends on the size of the original image. You need to adjust this yourself depending upon the application.

But it has been found out empirically that 3 number of scales sampled per octave provide optimal repeatability under downsampling/upsampling/rotation of the image as well as image noise. Also, Adding more scales per octave will increase the number of detected keypoints, but this does not improve the repeatability (in fact there is a small decrease) – so we settle for the computationally less expensive option. See the below plot

So, once we have constructed the scale-space, the next task is to detect the extrema in this scale-space. That’s why this step is called scale-space extrema detection. To keep this blog short, we will discuss this in the next blog. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Goodbye until next time.

Introduction to SIFT (Scale-Invariant Feature Transform)

In the previous blogs, we discussed some corner detectors such as Harris Corner, Shi-Tomasi, etc. If you remember, these corner detectors were rotation invariant, which basically means, even if the image is rotated we would still be able to detect the same corners. This is obvious because corners remain corners in the rotated image also. But when it comes to scaling, these algorithms suffer and don’t give satisfactory results. This is obvious because if we scale the image, a corner may not remain a corner. Let’s understand this with the help of the following image (Source: OpenCV)

See on the left we have a corner in the small green window. But when this corner is zoomed (see on the right), it no longer remains a corner in the same window. So, this is the issue that scaling poses. I hope you understood this.

So, to solve this, in 2004, D.Lowe, University of British Columbia, in his paper, Distinctive Image Features from Scale-Invariant Keypoints came up with a new algorithm, Scale Invariant Feature Transform (SIFT). This algorithm not only detects the features but also describes them. And the best thing about these features is that these features are invariant to changes in

  • Scale
  • Rotation
  • Illumination (partially)
  • Viewpoint (partially)
  • Minor image artifacts/ Noise/ Blur

That’s why this was a breakthrough in this field at that time. So, you can use these features to perform different tasks such as object recognition, tracking, image stitching, etc, and don’t need to worry about scale, rotation, etc. Isn’t this cool and that too around 2004!!!

There are mainly four steps involved in SIFT algorithm to generate the set of image features

  • Scale-space extrema detection: As clear from the name, first we search over all scales and image locations(space) and determine the approximate location and scale of feature points (also known as keypoints). In the next blog, we will discuss how this is done but for now just remember that the first step simply finds the approximate location and scale of the keypoints
  • Keypoint localization: In this, we take the keypoints detected in the previous step and refine their location and scale to subpixel accuracy. For instance, if the approximate location is 17 then after refinement this may become 17.35 (more precise). Don’t worry we will discuss how this is done in the next blogs. After the refinement step, we discard bad keypoints such as edge points and the low contrast keypoints. So, after this step we get robust set of keypoints.
  • Orientation assignment: Then we calculate the orientation for each keypoint using its local neighborhood. All future operations are performed on image data that has been transformed relative to the assigned orientation, scale, and location for each feature, thereby providing invariance to these transformations.
  • Keypoint descriptor: All the previous steps ensured invariance to image location, scale and rotation. Finally we create the descriptor vector for each keypoint such that the descriptor is highly distinctive and partially invariant to the remaining variations such as illumination, 3D viewpoint, etc. This helps in uniquely identify features. Once you have obtained these features along with descriptors we can do whatever we want such as object recognition, tracking, stitching, etc. This sums up the SIFT algorithm on a coarser level.

Because SIFT is an extensive algorithm so we won’t be covering this in a single blog. We will understand each of these 4 steps in separate blogs and finally, we will implement this using OpenCV-Python. And as we will proceed, we will also understand how this algorithm achieves scale, rotation, illumination, and viewpoint invariance as discussed above.

So, in the next blog, let’s start with the scale-space extrema detection and understand this in detail. See you in the next blog. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Goodbye until next time.