Kernel Optimizations
Because the kernel is running in programmable logic on the target
platform, optimizing your task to the environment is an important element of application
design. Most of the optimization techniques discussed in C/C++ Kernels can be applied to OpenCL kernels. Instead of applying the HLS pragmas used for C/C++
kernels, you will use the __attribute__ keyword
described in OpenCL Attributes. Following is an
example:
// Process the whole image
__attribute__((xcl_pipeline_loop))
image_traverse: for (uint idx = 0, x = 0 , y = 0 ; idx < size ; ++idx, x+= DATA_SIZE)
{
...
}
The example above specifies that the for loop, image_traverse, should be
pipelined to improve the performance of the kernel. The target II in this case is 1. For
more information, refer to xcl_pipeline_loop.
In the following code example, the watermark function uses the opencl_unroll_hint attribute to let the Vitis compiler unroll the loop to reduce latency and
improve performance. However, in this case the __attribute__ is only a suggestion that the compiler can ignore if needed.
For details, refer to opencl_unroll_hint.
//Unrolling below loop to process all 16 pixels concurrently
__attribute__((opencl_unroll_hint))
watermark: for ( int i = 0 ; i < DATA_SIZE ; i++)
{
...
}
For more information, review the OpenCL Attributes topics to see what specific optimizations are supported for OpenCL kernels, and review the C/C++ Kernels content to see how these optimizations can be applied in your kernel design.
Setting Data Width in OpenCL Kernels
For OpenCL kernels, the API provides attributes to support incrementing AXI data width usage. To eliminate manual code modifications, the following OpenCL attributes are interpreted to perform data path widening and vectorization of the algorithm:
Examine the combined functionality on the following case:
__attribute__((reqd_work_group_size(64, 1, 1)))
__attribute__((vec_type_hint(int)))
__attribute__((xcl_zero_global_work_offset))
__kernel void vector_add(__global int* c, __global const int* a, __global const int* b) {
size_t idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
In this case, the hard coded interface is a 32-bit wide data path (int *c, int* a, int *b), which drastically limits the
memory throughput if implemented directly. However, the automatic widening and
transformation is applied, based on the values of the three attributes.
__attribute__((vec_type_hint(int)))- Declares that
intis the main type used for computation and memory transfer (32-bit). This knowledge is used to calculate the vectorization/widening factor based on the target bandwidth of the AXI interface (512 bits). In this example the factor would be 16 = 512 bits / 32-bit. This implies that in theory, 16 values could be processed if vectorization can be applied. __attribute__((reqd_work_group_size(X, Y, Z)))- Defines the total number of work items (where
X,Y, andZare positive constants).X*Y*Zis the maximum number of work items therefore defining the maximum possible vectorization factor which would saturate the memory bandwidth. In this example, the total number of work items is64*1*1=64.The actual vectorization factor to be applied will be the greatest common divider of the vectorization factor defined by the actual coded type or the
vec_type_hint, and the maximum possible vectorization factor defined throughreqd_work_group_size.The quotient of maximum possible vectorization factor divided by the actual vectorization factor provides the remaining loop count of the OpenCL description. As this loop is pipelined, it can be advantageous to have several remaining loop iterations to take advantage of a pipelined implementation. This is especially true if the vectorized OpenCL code has long latency.
__attribute__((xcl_zero_global_work_offset))- The
__attribute__((xcl_zero_global_work_offset))instructs the compiler that no global offset parameter is used at runtime, and all accesses are aligned. This gives the compiler valuable information with regard to alignment of the work groups, which in turn usually propagates to the alignment of the memory accesses (less hardware).
It should be noted, that the application of these transformations changes the actual design to be synthesized. Partially unrolled loops require reshaping of local arrays in which data is stored. This usually behaves nicely, but can interact poorly in rare situations.
For example:
- For partitioned arrays, when the partition factor is not divisible
by the unrolling/vectorization factor.
- The resulting access requires a lot of multiplexers and will create a difficult issue for the scheduler (might severely increase memory usage and compilation time). Xilinx recommends using partitioning factors that are powers of two (as the vectorization factor is always a power of two).
- If the loop being vectorized has an unrelated resource constraint,
the scheduler complains about II not being met.
- This is not necessarily correlated with a loss of performance (usually it is still performing better) because the II is computed on the unrolled loop (which has therefore a multiplied throughput for each iteration).
- The scheduler informs you of the possible resources constraints and resolving those will further improve the performance.
- Note that a common occurrence is that a local array does not get automatically reshaped (usually because it is accessed in a later section of the code in non-vectorizable method).
Reducing Kernel to Kernel Communication Latency in OpenCL Kernels
The OpenCL API 2.0 specification introduces a new memory object called a pipe. A pipe stores data organized as a FIFO. Pipe objects can only be accessed using built-in functions that read from and write to a pipe. Pipe objects are not accessible from the host. Pipes can be used to stream data from one kernel to another inside the FPGA without having to use the external memory, which greatly improves the overall system latency. For more information, see Pipe Functions on Version 2.0 of the OpenCL C Specification from Khronos Group.
In the Vitis IDE, pipes must be statically
defined outside of all kernel functions. Dynamic pipe allocation using the OpenCL 2.x clCreatePipe
API is not supported. The depth of a pipe must be specified by using the OpenCL attribute xcl_reqd_pipe_depth in the pipe declaration. For more information, see
xcl_reqd_pipe_depth.
As specified in xcl_reqd_pipe_depth, the
valid depth values are as follows: 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192,
16384, 32768.
pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));read_pipe_block() and write_pipe_block() functions in blocking mode. read_pipe() or
write_pipe() functions is not
supported.The status of pipes can be queried using OpenCL
get_pipe_num_packets() and
get_pipe_max_packets() built-in
functions.
gentype indicates the built-in OpenCL C scalar integer or floating-point data
types.int read_pipe_block (pipe gentype p, gentype *ptr)
int write_pipe_block (pipe gentype p, const gentype *ptr) The following “dataflow/dataflow_pipes_ocl” from Xilinx Getting Started Examples on GitHub
uses pipes to pass data from one processing stage to the next using blocking read_pipe_block() and write_pipe_block() functions:
pipe int p0 __attribute__((xcl_reqd_pipe_depth(32)));
pipe int p1 __attribute__((xcl_reqd_pipe_depth(32)));
// Input Stage Kernel : Read Data from Global Memory and write into Pipe P0
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void input_stage(__global int *input, int size)
{
__attribute__((xcl_pipeline_loop))
mem_rd: for (int i = 0 ; i < size ; i++)
{
//blocking Write command to pipe P0
write_pipe_block(p0, &input[i]);
}
}
// Adder Stage Kernel: Read Input data from Pipe P0 and write the result
// into Pipe P1
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void adder_stage(int inc, int size)
{
__attribute__((xcl_pipeline_loop))
execute: for(int i = 0 ; i < size ; i++)
{
int input_data, output_data;
//blocking read command to Pipe P0
read_pipe_block(p0, &input_data);
output_data = input_data + inc;
//blocking write command to Pipe P1
write_pipe_block(p1, &output_data);
}
}
// Output Stage Kernel: Read result from Pipe P1 and write the result to Global
// Memory
kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
void output_stage(__global int *output, int size)
{
__attribute__((xcl_pipeline_loop))
mem_wr: for (int i = 0 ; i < size ; i++)
{
//blocking read command to Pipe P1
read_pipe_block(p1, &output[i]);
}
}
The Device Traceline view shows the detailed activities and stalls on the OpenCL pipes after hardware emulation is run. This information can be used to choose the correct FIFO sizes to achieve the optimal application area and performance.