- Description:
- The benchmark performs the vertical pass of 2D wavelet transform It performs a vertical filter on 8 rows which are pointed to by the pointers contained in an array of pointers. It produces two lines worth of output, one being the low-pass and the other being the high pass result. Instead of performing a transpose on the column and re-using the wave_horz kernel, the vertical filter is traversed over the entire width of the line and the low pass and high pass filtering kernels are performed together.
- This implies that the low-pass and highpass filters be overlapped in execution so that the input data array may be loaded once and both filters can be exceuted in parallel.
- Parameters:
-
| in_data | Array of (8) row pointers |
| qmf | Low pass QMF filter |
| mqmf | High pass QMF filter |
| out_ldata | Low pass output data |
| out_hdata | High pass output data |
| cols | Length of rows to process |
- Algorithm:
- The inner loop that advances along each filter tap is totally optimized by unrolling. Double-word loads are performed, and paired multiplies are used to perform four iterations of low-pass filter in parallel.
- For the high-pass kernel, the same loop is reused, in order to save codesize. This is done by loading the filter coefficients in a special order.
- The kernels assume that the number of filter taps is exactly 8. In addition data that is loaded for producing out_ldata[0] and out_hdata[0] is not identical. The data loaded for producing out_hdata[0] produces results at the location
out_lstart = o_im + ((rows >> 1) - 3) * cols
out_hstart = o_im + (rows >> 1) * cols
Where o_im is start of output image, rows is the number of rows in the input image, and cols is the number of columns in the output image.
- The following table illustrates how ylptr and yhptr need to be updated at the start of each call to this function:
Call# out_ldata out_hdata
1 out_lstart out_hstart
2 out_lstart + cols out_hstart + cols
3 out_lstart + 2*cols out_hstart + 2*cols
At this point ylptr wraps around to become o_im, while yhptr proceeds as usual:
4 o_im out_hstart + 3*cols
In addition the kernel accepts a pointer to an array of pointers for each input line so that a working buffer of 10 lines can be used to effectively mix DMA's and processing as illustrated below:
ihptr LINE BUFFER
ptr0 ---->|-------------------------------------------------|
ptr1 ---->|-------------------------------------------------|
...
ptr7 ---->|-------------------------------------------------|
At the start of the kernel 8 input lines are filled to the first 8 lines and processing begins. In the background the next two lines are fetched. The pointers are moved up by 2 namely ptr[i] = ptr[i+2] and the last two lines now point to lines 9 and 10 and processing starts again. In the background the next two lines are brought in the first two lines of the line buffer. Pointers move up again by 2 but now the last two pointers to line 0 and 1. This pattern then repeats.
- The first line to begin filtering is always obtained from ptr[0], the next from ptr[1] and so on.
- Assumptions:
- The input image dimensions (rows and cols) are assumed to be powers of 2
- The input filters qmf and mqmf are assumed to be word aligned and have exactly 8 taps.
- The output data and input data on any line is assumed to be double-word aligned.
- Implementation Notes:
- This code is a LITTLE ENDIAN implementation
- In order to eliminate bank conflicts succesive lines in the line buffer or the pointers to these lines are seperated by exactly two banks (one word) so that loads to any succesive lines may be parallelized together
- This code is compatible with C66x processors
- Benchmarks:
- See IMGLIB_Test_Report.html for cycle and memory information.