Advanced Topics: Multiple Compute Units and Kernel Streaming

The prior discussions of the host code and kernel migration were based on a simple application to let you clearly see the required steps for moving from the SDSoC™ environment to the Vitis environment. The following examples use a more complex application, with multiple kernels running concurrently to illustrate some of the more advanced design patterns that you might encounter.

This section shows how to start building a more complex application with multiple hardware accelerators showing, and to migrate other aspects of the SDSoC environment to the Vitis environment. The example design has two transpose accelerators streaming data into a matrix multiply. This section discusses:

  • Replicating the kernels
  • Using AXI4-Stream connections to move data between kernels
  • Coding flow control into the host application

Multiple Compute Units

A key aspect of an acceleration environment is the ability to specify multiple hardware functions to implement the algorithm, including parallelization through multiple instances of the same kernel.

In the SDSoC environment, the tool assumes you want one copy of each hardware accelerator. To duplicate accelerators, add pragmas in the main function as follows.

#pragma SDS resource(1) 
transpose(matA, matAT, col, row);
#pragma SDS resource(2)
transpose(matB, matBT, col, row);
mmult(matAT, matBT, matC, col, row);

In the host application, create different cl::Kernel() objects to access each compute unit using their individual name as shown below.

cl::Kernel kernel_mmult(program,"mmult");
cl::Kernel kernel_transpose1(program,"transpose:{transpose_1}");
cl::Kernel kernel_transpose2(program,"transpose:{transpose_2}");

In the Vitis environment, you specify the number of instances of a kernel, known as a compute unit, during v++ command linking through the --connectivity.nk switch. The --connectivity.nk switch allows you to specify the number of instances of a given kernel to instantiate in the xlcbin file and the name mapping. In the following code, define one instance of kernel_mmult named mmult, and two instances of kernel_transpose named transpose_1 and transpose_2.

v++ -l --connectivity.nk mmult:1:mmult_1 --connectivity.nk transpose:2:transpose_1.transpose_2

Kernel-to-Kernel Streaming

As in the SDSoC environment, in the Vitis environment, it is possible to transfer data from one kernel to the other bypassing global memory. The changes required for the kernels, the build script, and the host applications are discussed below.

Kernel Coding Guidelines

The kernel streaming interface directly sends or receives data to another kernel streaming interface should be defined by hls::stream with the ap_axiu<D,0,0,0> data type. The ap_axiu<D,0,0,0> data type needs to #include ap_axi_sdata.h. The transpose kernel code snippet implements an m_axi input and a steaming output by first adding new includes.

#include <ap_int.h>
#include <hls_stream.h>
#include <ap_axi_sdata.h>
#include "transpose.h"
 
void transpose(int A[BUFFER_SIZE*BUFFER_SIZE], int AT[BUFFER_SIZE*BUFFER_SIZE], int col, int row) {
    for (int i = 0; i < col; i++) {
#pragma PIPELINE II=1
        for (int j = 0; j < row; j++) {
            AT[i*col+j] = A[j*col+i];
        }
    }

    }

Then, defining the ports and port interfaces.

void transpose(int *A, hls::stream<ap_axiu<32, 0, 0, 0> >& stream, const int col, const int row) {
#pragma HLS INTERFACE m_axi port=A offset=slave bundle=gmem
#pragma HLS INTERFACE axis port=stream
#pragma HLS INTERFACE s_axilite port=col bundle=control
#pragma HLS INTERFACE s_axilite port=row bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle=control

The pragma HLS INTERFACE axis port=stream defines the port stream to be an AXI4-Stream.

The mmult function implements two streaming inputs with a memory mapped output to the host.

Connecting the Kernels

When building the application, use the v++ --connectivity.sc switch to define connectivity between the kernels. The following line shows how to connect the streaming output ports of transpose_1 and transpose_2 to the input ports on mmult.

v++ -l …. --connectivity.sc transpose_1.stream:mmult_1.inputa --connectivity.sc transpose_2.stream:mmult_1.inputb

Host Coding Guidelines

As in the SDSoC environment, the application needs to implement flow control between kernels and the accelerator. In this example, you will use OpenCL APIs to set up out-of-order command queues and events to control kernel execution. In the main() function code snippet, the queue is defined as follows.

// Create Command Queue 
cl::CommandQueue q(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_PROFILING_ENABLE);  

When using out-of-order queues, you must use events to specify the dependencies between each command. The XRT scheduler uses these events to determine when and in which order the enqueued command can be executed.

Make enqueueWriteBuffer() non-blocking with CL_FALSE return control to the host program while data copies to the kernel as shown below.

q.enqueueWriteBuffer(bufMatA, CL_FALSE, 0,
max_col*max_row*sizeof(int), matA, NULL, &events[0]);

q.enqueueWriteBuffer(bufMatB, CL_FALSE, 0, 
max_col*max_row*sizeof(int), matB, NULL, &events[1]);

The design implements flow control between the host and the compute units with events as follows.

// Create Events
std::vector<cl::Event> events(4);
std::vector<cl::Event> kernel_events;
...

// Setup event dependencies for transpose_1
kernel_events.resize(0);
kernel_events.push_back(events[0]);
// place transpose_1 in the command queue with a ready for exectution flag
q.enqueueTask(kernel_transpose1, &kernel_events, NULL);

// Setup event dependencies for transpose_2
kernel_events.resize(0);
kernel_events.push_back(events[1]);
// place transpose_2 in the command queue with a ready for exectution flag
q.enqueueTask(kernel_transpose2, &kernel_events, NULL);

// place mmult in the command queue with an execution complete event
// axi stream protocol determins data flow control between the transpose and mmult compute units
q.enqueueTask(kernel_mmult, NULL, &events[2]);

The second argument of the enqueueTask() specifies a list of events that need to complete before this particular command can be executed. If the list is NULL, then this command can be executed immediately.

The third argument of the enqueueTask() specifies the completion event of this particular task. In this example, events[0] and events[1] are used to track the completion of the enqueueWriteBuffer() commands used to transfer matA and matB. These two events are then used as input dependencies to transpose_1 and transpose_2 tasks. Because the transpose kernels operate on matA and matB, the transfer of these buffers must complete before the kernels can start.

The mmult uses the AXI4-Stream protocol to control the data flow to the inputs at the hardware level with no dependency on events[] to start execution.

For more information on working with multiple kernels, refer to Methodology for Accelerating Applications with the Vitis Software Platform.