SIFT: Scale-Space Extrema Detection

In the previous blog, we had an overview of the SIFT algorithm. We discussed different steps involved in this and the invariance that it offers against scale, rotation, illumination, viewpoint, etc. But we didn’t discuss it in detail. So, in this blog, let’s start with the first step which is scale-space extrema detection. So, let’s get started.

Before moving forward, let’s quickly recap what we are doing and why we are doing it and how we are doing it?

  • We saw how the corner detectors like Harris, Shi-Tomasi, etc suffer when it comes to scaling. (Why we are doing)
  • So, we want to detect features that are invariant to scale changes and can be robustly detected (What we are doing)
  • This can be done by searching for stable features (Extremas) across all possible scales, using a continuous function of scale known as scale space. That’s why the name scale-space extrema detection. (How we are doing)

I hope you understood this. Now, let’s understand what is scale-space.

What is a Scale-Space?

As we all know that the real-world objects are composed of different structures at different scales. For instance, the concept of a “tree” is appropriate at the scale of meters, while concepts such as leaves and molecules are more appropriate at finer scales. It would make no sense to analyze leaves or molecules at the scale of the tree (meters). So, this means that you need to analyze everything at an appropriate scale in order to make sense.

But given an image or an unknown scene, there is no apriori way to determine what scales are appropriate for describing the interesting structures in the image data. Hence, the only reasonable approach is to consider descriptions at multiple scales. This representation of images at multiple scales constitutes a so-called scale-space.

How to construct a Scale-Space?

Now, the next thing is how to construct a scale-space? So as we know if we increase the scale, the fine details will be lost and only coarser information prevails. Can you relate this with something you did in image processing? Does Blurring an image with a low pass filter sound similar to this? The answer is yes but there is a catch. We can’t use any low pass filter, only the Gaussian filter helps in mimicking a scale space. This is because the Gaussian filter is shown to be the only filter that obeys the following

  • Linearity
  • Shift-invariance
  • smoothing process does not produce new structures when going from fine to coarser scale
  • Rotational symmetry and some other properties (You can read about it on Wikipedia)

So to create a scale space, you take the original image and generate progressively blurred-out images using a Gaussian filter. Mathematically, the scale-space representation of an image can be expressed as

  • L(x,y,σ) is the blurred image or scale space representation of an image
  • G(x,y,σ) is the Gaussian filter
  • I(x,y) is the image
  • σ is the scaling parameter or the amount of blur. As we increase σ, more and more details are removed from the image i.e. more blur

See below where an image is shown at different scales(σ) (source: Wikipedia). See how at larger scales, the fine details got lost.

So, I hope you understood what is a scale-space and how to construct it using Gaussian filter.

Scale-Space in SIFT

In the SIFT paper, the authors modified the scale-space representation. Instead of creating the scale-space representation for the original image only, they created the scale-space representations for different image sizes. This helps in increasing the number of keypoints detected. The idea is shown below

Take the original image, and generate progressively blurred out images. Then, resize the original image to half size. And generate blurred out images again. And keep repeating. This is shown below

Here, we use the term octave to denote the scale-space representation for a particular image size. For instance, all the same size images in vertical line forms one octave. Here, we have 3 octaves and all the octaves contain 4 images at different scales (blurred using Gaussian filter).

Within an octave, the adjacent scales differ by a constant factor k. If an octave contains s+1 images, then k = 2(1/s). The first image has scale σ0, the second image has scale kσ0, the third image has scale k2σ0, and the last image has scale ksσ0. In the paper, they have used the values as number of octaves = 4, number of scale levels = 5, initial σ0 =1.6, k=√2 etc.

How to decide the number of octaves and number of scales per octave?

The number of octaves and scale depends on the size of the original image. You need to adjust this yourself depending upon the application.

But it has been found out empirically that 3 number of scales sampled per octave provide optimal repeatability under downsampling/upsampling/rotation of the image as well as image noise. Also, Adding more scales per octave will increase the number of detected keypoints, but this does not improve the repeatability (in fact there is a small decrease) – so we settle for the computationally less expensive option. See the below plot

So, once we have constructed the scale-space, the next task is to detect the extrema in this scale-space. That’s why this step is called scale-space extrema detection. To keep this blog short, we will discuss this in the next blog. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Goodbye until next time.

Leave a Reply