|
|
|
|
|
Technological Challenge
|
|
|
To increase the computational performance of the vertical edge
detection convolution algorithm by optimizing software program
execution.
|
|
|
Algorithm Background
|
|
|
Convolution in 2-dimensions can be calculated using the following
formula
|
|
|
|
|
where f(x,y) is the output convolved image data, g(x,y) is the input
image data, and k(x,y) is the convolution kernel. One particularily
simple method of applying this technique that yields useful results is
the 'discrete 3x3 convolution kernel'. In the discrete space of 2D
raster images, this 3x3 convolution formula can be expressed as
|
|
|
|
|
or in the expanded form (after offsetting and inverting the kernel
indicies)
|
|
|
|
|
Using this 3x3 convolution kernel formula, the vertical edges (or,
gradients along the horizontal) of an image can be efficiently detected
using the horizontal variant of the Sobel kernel
|
|
|
|
|
As an example, the following images show the results of applying this
filter after rescaling (original/input image is on the left,
filtered/output image is on the right)
|
|
|
|
|
|
Simplest Code
|
|
The most direct and simplest form of coding this convolution kernel
in C would most likely resemble the following
void SobelFilterImage( unsigned char** img, int imgWidth, int imgHeight,
short** filteredData )
{
register int i, j;
for( i = 1; i < ( imgHeight - 1 ); i++ )
for( j = 1; j < ( imgWidth - 1 ); j++ )
filteredData[i-1][j-1] =
( short )img[i-1][j+1] - ( short )img[i-1][j-1] +
2 * ( ( short )img[i][j+1] - ( short )img[i][j-1] ) +
( short )img[i+1][j+1] - ( short )img[i+1][j-1];
}
where 'img' is a 2D-array of unsigned 8-bit data values representing
the input monochrome image and 'filteredData' is a 2D-array of signed
16-bit data values representing the output filtered data.
Benchmarking this code was performed by using a 512x512 input image and
executing the function on this input image data 1000 times in a
continuous loop. The code was compiled using GCC version 3.0.2 with the
flags '-O3 -fomit-frame-pointer -ffast-math -fstrength-reduce' and
executed on a Intel PentiumII 366MHz PC using Linux version 2.4.8.
The performance of the software was evaluated using the 'time' command
to ensure that equal amounts of resource usage.
The performance evaluation of the software using the above system and
methods yielded the following output
0:08.79elapsed 99% CPU
|
|
|
First Software Modification
|
|
The most common method of increasing software performance is to focus
on the area of code that requires the most processing needs. In this
case, the nested loop within a loop is clearly the most time consuming
execution area, so we will concentrate on specifically that. Just by
looking at the formulation of the Sobel gradient kernel, one can
recognize duplicate calculations that will be computed as the algorithm
progresses. Specifically, since the kernel is applied in an
'overlapping' fashion on the input image, there are calculations that
are common between overlapping regions due to the formulation of the
Sobel kernel. Without going into too much detail on exactly how this
algorithm is formulated, the primary focus is that it requires less
calculations within the nested inner loop at the expense of some
temporary arrays to store previous calculations.
void SobelFilterImage( unsigned char** img, int imgWidth, int imgHeight,
short** filteredData )
{
short *prevDiff, *thisDiff, tmp;
register int i, j;
prevDiff = new short[ imgWidth - 2 ];
thisDiff = new short[ imgWidth - 2 ];
for( i = 0; i < ( imgHeight - 2 ); i++ )
{
thisDiff[i] = ( short )img[1][i+2] - ( short )img[1][i];
prevDiff[i] = ( short )img[0][i+2] - ( short )img[0][i] + thisDiff[i];
}
for( i = 0; i < ( imgHeight - 2 ); i++ )
{
for( j = 0; j < ( imgWidth - 2 ); j++ )
{
tmp = ( short )img[i+2][j+2] - ( short )img[i+2][j];
filteredData[i][j] = prevDiff[j];
prevDiff[j] = tmp + thisDiff[j];
filteredData[i][j] += prevDiff[j];
thisDiff[j] = tmp;
}
}
delete[] prevDiff;
delete[] thisDiff;
}
By utilizing performance evaluation of the software, we get the
following output
0:11.52elapsed 99% CPU
Yikes! The performance is actually worse than before! How is that
possible? There are fewer calculations in the innermost loop!
The answer is actually quite simple: we ran out of registers. This is
a very typical scenario when dealing with the x86 architecture as the
number of general purpose registers is very limited. This limitation
means that these extra variables must be referenced from the stack
frame pointer which puts a stress on the CPU cache. This burden can
become quite severe, as can be noticed in the performance decrease of
over 30%.
In addition to this, we did not pay attention to memory alignment
issues associated with the temporary arrays which were utilized. In
particular, the memory referenced by indexing the 'prevDiff' and
'thisDiff' arrays is not ideal since each index pair is the length of
these arrays apart.
|
|
|
Improving on Previous Modification
|
|
By taking the previously mentioned areas of limited registers and
memory alignment issues in mind, the following improvement to the
previous modification is as follows
typedef struct
{
short prevDiff, thisDiff;
} Diff_t;
void SobelFilterImage( unsigned char** img, int imgWidth, int imgHeight,
short** filteredData )
{
Diff_t *diff, *diffPtr;
unsigned char *imgPtr;
short *dataPtr, tmp;
register int i;
diff = new Diff_t[ imgWidth - 2 ];
for( i = 0; i < ( imgHeight - 2 ); i++ )
{
diff[i].thisDiff = ( short )img[1][i+2] - ( short )img[1][i];
diff[i].prevDiff = ( short )img[0][i+2] - ( short )img[0][i] +
diff[i].thisDiff;
}
for( i = 1; i < ( imgHeight - 1 ); i++ )
{
imgPtr = img[i+1];
dataPtr = filteredData[i-1];
diffPtr = diff;
for( diffPtr = diff; diffPtr < &diff[imgWidth - 2]; diffPtr++, dataPtr++ )
{
tmp = ( short )( *( imgPtr + 2 ) ) - ( short )( *imgPtr );
imgPtr++;
*dataPtr = diffPtr->prevDiff;
diffPtr->prevDiff = tmp + diffPtr->thisDiff;
diffPtr->thisDiff = tmp;
*dataPtr += diffPtr->prevDiff;
}
}
delete[] diff;
}
The most noticeable modification is the use of a structure type
definition to store previous calculations rather than 2 separate
arrays which improves the memory utilization issues. Secondly, an
attempt to reduce the amount of data needed in immediate registers
was implemented by removing indexing values and directly incrementing
data pointers. The performance evaluation of this version gives the
following results
0:08.51elapsed 99% CPU
Well, now thats more like it, but it is only roughly 3% faster than the
original code.
|
|
|
Utilizing MMX Instructions
|
|
With the data reduction techniques and other general performance
'tricks' incorporated, the next step in performance enhancement is to
drop down to the assembly level. Most of the time, this step will not
be necessary as GCC and most other compilers contain pretty good
optimzation techniques. Specific to this case, there is a large
potential for speed increases by utilizing SIMD techniques of the newer
Intel and AMD processors. SIMD is an ackronym for Single Instruction
Multiple Data and it is a general classification of CPU instructions
that operate on a vector of data.
These instructions have been around since the early Pentuim processors
of 200MHz and above but have not been utilized by most programmers.
This is primarily due to the fact that most programmers are not fluent
with assembly level programming as it is architecture specific, plus
the difficulty of incorporating these instructions in C-compilers.
The reason for this is that most C/C++ codes are written around the
concept of operating on scalar values, not vectors.
In this next code modification, the use of inline assembly is utilized
to access the SIMD instructions of the PentiumII and above.
typedef struct
{
short prevDiff[8], thisDiff[8];
} Diff8_t;
void SobelFilterImage( unsigned char** img, int imgWidth, int imgHeight,
short** filteredData )
{
unsigned char *nextRow;
Diff8_t *diffRow;
short *dataRow, tmpData;
register int i, j, k, l;
diffRow = new Diff8_t[ ( ( imgWidth - 2 ) >> 3 ) + 1 ];
for( j = k = 0; j < ( ( imgWidth - 2 ) >> 3 ); j++, k += 8 )
{
for( l = 0; l < 8; l++ )
{
diffRow[j].thisDiff[l] =
( short )img[1][k+l+2] - ( short )img[1][k+l];
diffRow[j].prevDiff[l] =
( short )img[0][k+l+2] - ( short )img[0][k+l] +
diffRow[j].thisDiff[l];
}
}
for( i = 0; k < ( imgWidth - 2 ); k++, i++ )
{
diffRow[j].thisDiff[i] = ( short )img[1][k+2] - ( short )img[1][k];
diffRow[j].prevDiff[i] =
( short )img[0][k+2] - ( short )img[0][k] + diffRow[j].thisDiff[i];
}
for( i = 1; i < ( imgHeight - 1 ); i++ )
{
nextRow = img[i+1];
dataRow = filteredData[i-1];
for( j = k = 0; j < ( ( imgWidth - 2 ) >> 3 ); j++, k += 8 )
{
__asm__ __volatile__(
"pxor %%mm7, %%mm7\n\t"
"movq (%0), %%mm0\n\t"
"movq %%mm0, %%mm1\n\t"
"movq 2(%0), %%mm2\n\t"
"movq %%mm2, %%mm3\n\t"
"punpcklbw %%mm7, %%mm0\n\t"
"punpckhbw %%mm7, %%mm1\n\t"
"punpcklbw %%mm7, %%mm2\n\t"
"punpckhbw %%mm7, %%mm3\n\t"
"psubd %%mm2, %%mm0\n\t"
"psubd %%mm3, %%mm1\n\t"
"movq (%1), %%mm2\n\t"
"movq 8(%1), %%mm3\n\t"
"movq 16(%1), %%mm4\n\t"
"movq 24(%1), %%mm5\n\t"
"paddd %%mm0, %%mm4\n\t"
"paddd %%mm1, %%mm5\n\t"
"paddd %%mm4, %%mm2\n\t"
"paddd %%mm5, %%mm3\n\t"
"movq %%mm4, (%1)\n\t"
"movq %%mm5, 8(%1)\n\t"
"movq %%mm0, 16(%1)\n\t"
"movq %%mm1, 24(%1)\n\t"
"movq %%mm2, (%2)\n\t"
"movq %%mm3, 8(%2)\n\t"
:
: "r" ( &nextRow[k] ), "r" ( &diffRow[j] ), "r" ( &dataRow[k] )
: "memory" );
}
__asm__ __volatile__( "emms\n\t" );
for( l = 0; k < ( imgWidth - 2 ); k++, l++ )
{
tmpData = diffRow[j].thisDiff[l];
diffRow[j].thisDiff[l] = ( short )nextRow[k+2] - ( short )nextRow[k];
dataRow[k] = diffRow[j].prevDiff[l];
diffRow[j].prevDiff[l] = diffRow[j].thisDiff[l] + tmpData;
dataRow[k] += diffRow[j].prevDiff[l];
}
}
delete[] diffRow;
}
Although it looks like a big mess of code, the performance is a rather
large increase from the latter
0:03.62elapsed 99% CPU
which turns out to be more than a %140 speed increase from original.
Unfortunately, this speed increase comes at the price of more memory
requirements and the general 'ugliness' of the new code. It should be
noted, however, that this software code can be effectively cleaned up
by utilizing properly structured C++ classes with templates and
specializations to optimize the code for certain data types as we
have exploited above.
|
|
top of page
|
| |