W h i t e p a p e r s
products solutions services support about news
Contract Services

Whitepapers

Porting to VR

Commodity VR

Sobel Filter

Technological Challenge
To increase the computational performance of the vertical edge detection convolution algorithm by optimizing software program execution.

Algorithm Background
Convolution in 2-dimensions can be calculated using the following formula
where f(x,y) is the output convolved image data, g(x,y) is the input image data, and k(x,y) is the convolution kernel. One particularily simple method of applying this technique that yields useful results is the 'discrete 3x3 convolution kernel'. In the discrete space of 2D raster images, this 3x3 convolution formula can be expressed as
or in the expanded form (after offsetting and inverting the kernel indicies)
Using this 3x3 convolution kernel formula, the vertical edges (or, gradients along the horizontal) of an image can be efficiently detected using the horizontal variant of the Sobel kernel
As an example, the following images show the results of applying this filter after rescaling (original/input image is on the left, filtered/output image is on the right)

Simplest Code
The most direct and simplest form of coding this convolution kernel in C would most likely resemble the following
  void SobelFilterImage( unsigned char** img, int imgWidth, int imgHeight,
                         short** filteredData )
  {
    register int i, j;
    
    for( i = 1; i < ( imgHeight - 1 ); i++ )
      for( j = 1; j < ( imgWidth - 1 ); j++ )
        filteredData[i-1][j-1] =
	    ( short )img[i-1][j+1] - ( short )img[i-1][j-1] +
	    2 * ( ( short )img[i][j+1] - ( short )img[i][j-1] ) +
	    ( short )img[i+1][j+1] - ( short )img[i+1][j-1]; 
  }
where 'img' is a 2D-array of unsigned 8-bit data values representing the input monochrome image and 'filteredData' is a 2D-array of signed 16-bit data values representing the output filtered data. Benchmarking this code was performed by using a 512x512 input image and executing the function on this input image data 1000 times in a continuous loop. The code was compiled using GCC version 3.0.2 with the flags '-O3 -fomit-frame-pointer -ffast-math -fstrength-reduce' and executed on a Intel PentiumII 366MHz PC using Linux version 2.4.8. The performance of the software was evaluated using the 'time' command to ensure that equal amounts of resource usage.

The performance evaluation of the software using the above system and methods yielded the following output

  0:08.79elapsed 99% CPU

First Software Modification
The most common method of increasing software performance is to focus on the area of code that requires the most processing needs. In this case, the nested loop within a loop is clearly the most time consuming execution area, so we will concentrate on specifically that. Just by looking at the formulation of the Sobel gradient kernel, one can recognize duplicate calculations that will be computed as the algorithm progresses. Specifically, since the kernel is applied in an 'overlapping' fashion on the input image, there are calculations that are common between overlapping regions due to the formulation of the Sobel kernel. Without going into too much detail on exactly how this algorithm is formulated, the primary focus is that it requires less calculations within the nested inner loop at the expense of some temporary arrays to store previous calculations.
  void SobelFilterImage( unsigned char** img, int imgWidth, int imgHeight,
                         short** filteredData )
  {
    short *prevDiff, *thisDiff, tmp;
    register int i, j;
    
    prevDiff = new short[ imgWidth - 2 ];
    thisDiff = new short[ imgWidth - 2 ];
    
    for( i = 0; i < ( imgHeight - 2 ); i++ )
    {
      thisDiff[i] = ( short )img[1][i+2] - ( short )img[1][i];
      prevDiff[i] = ( short )img[0][i+2] - ( short )img[0][i] + thisDiff[i];
    }
    
    for( i = 0; i < ( imgHeight - 2 ); i++ )
    {
      for( j = 0; j < ( imgWidth - 2 ); j++ )
      {
        tmp = ( short )img[i+2][j+2] - ( short )img[i+2][j];
        filteredData[i][j] = prevDiff[j];
        prevDiff[j] = tmp + thisDiff[j];
        filteredData[i][j] += prevDiff[j];
        thisDiff[j] = tmp;
      }
    }
    
    delete[] prevDiff;
    delete[] thisDiff;
  }
By utilizing performance evaluation of the software, we get the following output
  0:11.52elapsed 99% CPU
Yikes! The performance is actually worse than before! How is that possible? There are fewer calculations in the innermost loop!

The answer is actually quite simple: we ran out of registers. This is a very typical scenario when dealing with the x86 architecture as the number of general purpose registers is very limited. This limitation means that these extra variables must be referenced from the stack frame pointer which puts a stress on the CPU cache. This burden can become quite severe, as can be noticed in the performance decrease of over 30%.

In addition to this, we did not pay attention to memory alignment issues associated with the temporary arrays which were utilized. In particular, the memory referenced by indexing the 'prevDiff' and 'thisDiff' arrays is not ideal since each index pair is the length of these arrays apart.

Improving on Previous Modification
By taking the previously mentioned areas of limited registers and memory alignment issues in mind, the following improvement to the previous modification is as follows
  typedef struct
  {
    short prevDiff, thisDiff;
  
  } Diff_t;
  
  void SobelFilterImage( unsigned char** img, int imgWidth, int imgHeight,
                         short** filteredData )
  {
    Diff_t *diff, *diffPtr;
    unsigned char *imgPtr;
    short *dataPtr, tmp;
    register int i;
  
    diff = new Diff_t[ imgWidth - 2 ];
      
    for( i = 0; i < ( imgHeight - 2 ); i++ )
    {
      diff[i].thisDiff = ( short )img[1][i+2] - ( short )img[1][i];
      diff[i].prevDiff = ( short )img[0][i+2] - ( short )img[0][i] +
          diff[i].thisDiff;
    }

    for( i = 1; i < ( imgHeight - 1 ); i++ )
    {
      imgPtr = img[i+1];
      dataPtr = filteredData[i-1];
      diffPtr = diff;
        
      for( diffPtr = diff; diffPtr < &diff[imgWidth - 2]; diffPtr++, dataPtr++ )
      {
        tmp = ( short )( *( imgPtr + 2 ) ) - ( short )( *imgPtr );
        imgPtr++; 
      
        *dataPtr = diffPtr->prevDiff;
        diffPtr->prevDiff = tmp + diffPtr->thisDiff;
        diffPtr->thisDiff = tmp;
        *dataPtr += diffPtr->prevDiff;
      }
    }
  
    delete[] diff;
  }
The most noticeable modification is the use of a structure type definition to store previous calculations rather than 2 separate arrays which improves the memory utilization issues. Secondly, an attempt to reduce the amount of data needed in immediate registers was implemented by removing indexing values and directly incrementing data pointers. The performance evaluation of this version gives the following results
  0:08.51elapsed 99% CPU
Well, now thats more like it, but it is only roughly 3% faster than the original code.

Utilizing MMX Instructions
With the data reduction techniques and other general performance 'tricks' incorporated, the next step in performance enhancement is to drop down to the assembly level. Most of the time, this step will not be necessary as GCC and most other compilers contain pretty good optimzation techniques. Specific to this case, there is a large potential for speed increases by utilizing SIMD techniques of the newer Intel and AMD processors. SIMD is an ackronym for Single Instruction Multiple Data and it is a general classification of CPU instructions that operate on a vector of data. These instructions have been around since the early Pentuim processors of 200MHz and above but have not been utilized by most programmers. This is primarily due to the fact that most programmers are not fluent with assembly level programming as it is architecture specific, plus the difficulty of incorporating these instructions in C-compilers. The reason for this is that most C/C++ codes are written around the concept of operating on scalar values, not vectors. In this next code modification, the use of inline assembly is utilized to access the SIMD instructions of the PentiumII and above.
  typedef struct
  {
    short prevDiff[8], thisDiff[8];
  
  } Diff8_t;


  void SobelFilterImage( unsigned char** img, int imgWidth, int imgHeight,
			 short** filteredData )
  {
    unsigned char *nextRow;
    Diff8_t *diffRow;
    short *dataRow, tmpData;
    register int i, j, k, l;
  
    diffRow = new Diff8_t[ ( ( imgWidth - 2 ) >> 3 ) + 1 ];
    
    for( j = k = 0; j < ( ( imgWidth - 2 ) >> 3 ); j++, k += 8 )
    {
      for( l = 0; l < 8; l++ )
      {
        diffRow[j].thisDiff[l] =
            ( short )img[1][k+l+2] - ( short )img[1][k+l];
        diffRow[j].prevDiff[l] =
            ( short )img[0][k+l+2] - ( short )img[0][k+l] +
            diffRow[j].thisDiff[l];
      }
    }
  
    for( i = 0; k < ( imgWidth - 2 ); k++, i++ )
    {
      diffRow[j].thisDiff[i] = ( short )img[1][k+2] - ( short )img[1][k];
      diffRow[j].prevDiff[i] =
          ( short )img[0][k+2] - ( short )img[0][k] + diffRow[j].thisDiff[i];
    }
  
    for( i = 1; i < ( imgHeight - 1 ); i++ )
    {
      nextRow = img[i+1];
      dataRow = filteredData[i-1];
    
      for( j = k = 0; j < ( ( imgWidth - 2 ) >> 3 ); j++, k += 8 )
      {
        __asm__ __volatile__(
          "pxor         %%mm7,  %%mm7\n\t"
          "movq         (%0),   %%mm0\n\t"
          "movq         %%mm0,  %%mm1\n\t"
          "movq         2(%0),  %%mm2\n\t"
          "movq         %%mm2,  %%mm3\n\t"
          "punpcklbw    %%mm7,  %%mm0\n\t"
          "punpckhbw    %%mm7,  %%mm1\n\t"
          "punpcklbw    %%mm7,  %%mm2\n\t"
          "punpckhbw    %%mm7,  %%mm3\n\t"
          "psubd        %%mm2,  %%mm0\n\t"
          "psubd        %%mm3,  %%mm1\n\t"
          "movq         (%1),   %%mm2\n\t"
          "movq         8(%1),  %%mm3\n\t"
          "movq         16(%1), %%mm4\n\t"
          "movq         24(%1), %%mm5\n\t"
          "paddd        %%mm0,  %%mm4\n\t"
          "paddd        %%mm1,  %%mm5\n\t"
          "paddd        %%mm4,  %%mm2\n\t"
          "paddd        %%mm5,  %%mm3\n\t"
          "movq         %%mm4,  (%1)\n\t"
          "movq         %%mm5,  8(%1)\n\t"
          "movq         %%mm0,  16(%1)\n\t"
          "movq         %%mm1,  24(%1)\n\t"
          "movq         %%mm2,  (%2)\n\t"
          "movq         %%mm3,  8(%2)\n\t"
          :
          : "r" ( &nextRow[k] ), "r" ( &diffRow[j] ), "r" ( &dataRow[k] )
          : "memory" );
      }
      __asm__ __volatile__( "emms\n\t" );
    
      for( l = 0; k < ( imgWidth - 2 ); k++, l++ )
      {
        tmpData = diffRow[j].thisDiff[l];
        diffRow[j].thisDiff[l] = ( short )nextRow[k+2] - ( short )nextRow[k];
        dataRow[k] = diffRow[j].prevDiff[l];
        diffRow[j].prevDiff[l] = diffRow[j].thisDiff[l] + tmpData;
        dataRow[k] += diffRow[j].prevDiff[l];
      }
    }
  
    delete[] diffRow;
  }
Although it looks like a big mess of code, the performance is a rather large increase from the latter
  0:03.62elapsed 99% CPU
which turns out to be more than a %140 speed increase from original. Unfortunately, this speed increase comes at the price of more memory requirements and the general 'ugliness' of the new code. It should be noted, however, that this software code can be effectively cleaned up by utilizing properly structured C++ classes with templates and specializations to optimize the code for certain data types as we have exploited above.
top of page
      Contact Copyright ©2004-2011 Inv3rsion, LLC. All rights reserved.