Streaming Connections
Streaming Data Between the Host and Kernel (H2K)
The Vitis core development kit provides a programming model that supports the direct streaming of data from host-to-kernel and kernel-to-host, without the need to migrate data through global memory as an intermediate step. This programming model uses minimal storage compared to the larger and slower global memory bank, and thus significantly improves both performance and power.
By using data streams, you can realize some of the following advantages:
- The host application does not need to know the size of the data coming from the kernel.
- Data residing in host memory can be transferred to the kernel as soon as it is needed.
- Processed data can be transferred from the kernel back to the host
program when it is required.
Host-to-kernel and kernel-to-host streaming are only supported in PCIe-based platforms, such as the Alveo Data Center accelerator cards. However, kernel-to-kernel streaming data transfer is supported for both PCIe-based and embedded platforms. In addition, this feature is only available on specific target platforms, such as the QDMA platform for the Alveo Data Center accelerator cards. If your platform is not configured to support streaming, your application will not run.
Host Coding Guidelines
Xilinx provides new OpenCL™ APIs for streaming operation as extension APIs.
clCreateStream()- Creates a read or write stream.
clReleaseStream()- Frees the created stream and its associated memory.
clWriteStream()- Writes data to stream.
clReadStream()- Gets data from stream.
clPollStreams()- Polls for any stream on the device to finish. Required only for non-blocking stream operation.
The typical API flow is described below:
- Create the required number of the read/write streams by
clCreateStream.- Streams should be directly attached to the OpenCL device object because it does not use any command queue. A stream itself is a command queue that only passes the data in a particular direction, either from host to kernel or from kernel to host.
- An appropriate flag should be used to denote the stream as CL_STREAM_READ_ONLY or CL_STREAM_WRITE_ONLY from the perspective of the host program).
-
To specify how the stream is connected to the device, a Xilinx extension pointer object (
cl_mem_ext_ptr_t) is used to identify the kernel, and the kernel argument the stream is associated with.IMPORTANT: If the streaming kernel has multiple compute units, the host code needs to use a uniquecl_kernelobject for each compute unit. The host code must useclCreateKernelwith<kernel_name>:{compute_unit_name}to get each compute unit, creating streams for them, and enqueuing them individually.In the following code example, a
read_streamand awrite_streamare created, and associated with acl_kernelobject, and specified kernel arguments.#include <CL/cl_ext_xilinx.h> // Required for Xilinx extension pointer // Device connection specification of the stream through extension pointer cl_mem_ext_ptr_t ext; // Extension pointer ext.param = kernel; // The .param should be set to kernel (cl_kernel type) ext.obj = nullptr; // The .flag should be used to denote the kernel argument // Create write stream for argument 3 of kernel ext.flags = 3; cl_stream write_stream = clCreateStream(device_id, CL_STREAM_WRITE_ONLY, CL_STREAM, &ext, &ret); // Create read stream for argument 4 of kernel ext.flags = 4; cl_stream read_stream = clCreateStream(device_id, CL_STREAM_READ_ONLY, CL_STREAM, &ext,&ret);
- Set the remaining non-streaming kernel arguments and enqueue the kernel. The
following code block shows setting typical kernel argument (non-stream arguments,
such as buffer and/or scalar) and kernel
enqueuing:
// Set kernel non-stream argument (if any) clSetKernelArg(kernel, 0,...,...); clSetKernelArg(kernel, 1,...,...); clSetKernelArg(kernel, 2,...,...); // Argument 3 and 4 are not set as those are already specified during // the clCreateStream through the extension pointer // Schedule kernel enqueue clEnqueueTask(commands, kernel, . .. . ); - Initiate Read and Write transfers by
clReadStreamandclWriteStreamcommands.- Note the usage of attribute CL_STREAM_XFER_REQ associated with read and write request.
- The
.flagis used to denote transfer mechanism.- CL_STREAM_EOT
- Currently, successful stream transfer mechanism depends on identifying the end of the transfer by an End of Transfer signal. This flag is mandatory in the current release.
- CL_STREAM_NONBLOCKING
- By default the Read and Write transfers are blocking. For non-blocking transfer, CL_STREAM_NONBLOCKING has to be set.
- The
.priv_datais used to specify a string (as a name for tagging purpose) associated with the transfer. This will help identify specific transfer completion when polling the stream completion. It is required when using the non-blocking version of the API.In the following code block, the stream read and write transfers are executed with the non-blocking approach.
// Initiate the READ transfer cl_stream_xfer_req rd_req {0}; rd_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING; rd_req.priv_data = (void*)"read"; // You can think this as tagging the transfer with a name clReadStream(read_stream, host_read_ptr, max_read_size, &rd_req, &ret); // Initiating the WRITE transfer cl_stream_xfer_req wr_req {0}; wr_req.flags = CL_STREAM_EOT | CL_STREAM_NONBLOCKING; wr_req.priv_data = (void*)"write"; clWriteStream(write_stream, host_write_ptr, write_size, &wr_req , &ret);
- Poll all the streams for completion. For the non-blocking transfer, a polling API is provided to ensure the read/write transfers are completed. For the blocking version of the API, polling is not required.
- The polling results are stored in the
cl_streams_poll_req_completionsarray, which can be used in verifying and checking the stream events result. - The
clPollStreamsis a blocking API. It returns the execution to the host code as soon as it receives the notification that all stream requests have been completed, or until you specify the timeout.// Checking the request completion cl_streams_poll_req_completions poll_req[2] {0, 0}; // 2 Requests auto num_compl = 2; clPollStreams(device_id, poll_req, 2, 2, &num_compl, 5000, &ret); // Blocking API, waits for 2 poll request completion or 5000ms, whichever occurs first
- The polling results are stored in the
- Read and use the stream data in host.
- After the successful poll request is completed, the host can read the data from the host pointer.
- Also, the host can check the size of the data transferred to the host. For this purpose, the host needs to find the correct poll request by matching
priv_dataand then fetching nbytes (the number of bytes transferred) from thecl_streams_poll_req_completions structure.for (auto i=0; i<2; ++i) { if(rd_req.priv_data == poll_req[i].priv_data) { // Identifying the read transfer // Getting read size, data size from kernel is unknown ssize_t result_size=poll_req[i].nbytes; } }
The header file containing function prototype and argument description is available in the Xilinx Runtime GitHub repository.
Kernel Coding Guidelines
The basic guidelines to develop stream-based C kernel are as follows:
- Use
hls::streamwith theqdma_axis<D,0,0,0>data type. Theqdma_axisdata type needs the header file ap_axi_sdata.h. - The
qdma_axis<D,0,0,0>is a special class used for data transfer between host and kernel when using the streaming platform. This is only used in the streaming kernel interface interacting with the host, not with another kernel. The template parameter <D> denotes data width. The remaining three parameters should be set to 0 (not to be used in the current release). - The following code block shows a simple kernel interface with one input stream and one output stream.
#include "ap_axi_sdata.h" #include "hls_stream.h" //qdma_axis is the HLS class for stream data transfer between host and kernel for streaming platform //It contains "data" and two sideband signals (last and keep) exposed to the user via class member function. typedef qdma_axis<64,0,0,0> datap; void kernel_top ( hls::stream<datap> &input, hls::stream<datap> &output, ..... , // Other Inputs/Outputs if any ) { #pragma HLS INTERFACE axis port=input #pragma HLS INTERFACE axis port=output } - The
qdma_axisdata type contains three variables which should be used inside the kernel code:- data
- Internally
qdma_axiscontains anap_uint<D> that should be accessed by the.get_data()and.set_data()method.- The D must be 8, 16, 32, 64, 128, 256, or 512 bits wide.
- last
- The
lastvariable is used to indicate the last value of an incoming and outgoing stream. When reading from the input stream,lastis used to detect the end of the stream. Similarly when kernel writes to an output stream transferred to the host, thelastmust be set to indicate the end of stream.get_last/set_last: Accesses and sets thelastvariable used to denote the last data in the stream.
- keep
- In some special situations, the keep signal can be used to
truncate the last data to the fewer number of bytes. However,
keep should not be used to any data other than the
last data from the stream. So, in most of the cases, you should set
keep to -1 for all the outgoing data from the
kernel.
get_keep/set_keep: Accesses/sets thekeepvariable.- For all the data before the last data,
keepmust be set to -1 to denote all bytes of the data are valid. - For the last data, the kernel has the
flexibility to send fewer bytes. For example, for the four bytes
data transfer, the kernel can truncate the last data by sending
one byte, two bytes, or three bytes by using the following
set_keep()function.- If the last data is one byte ≥
.set_keep(1) - If the last data is two bytes ≥
.set_keep(3) - If the last data is three bytes ≥
.set_keep(7) - If the last data is all four bytes
(similar to all non-last data) ≥
.set_keep(-1)
- If the last data is one byte ≥
- The following code block shows how the stream
inputis read. Note the usage of.lastto determine the last data.// Stream Read // Using "last" flag to determine the end of input-stream // when kernel does not know the length of the input data hls::stream<ap_uint<64> > internal_stream; while(true) { datap temp = input.read(); // "input" -> Input stream internal_stream << temp.get_data(); // Getting data from the stream if(temp.get_last()) // Getting last signal to determine the EOT (end of transfer). break; } - The following code block shows how the stream
outputis written. Theset_keepis setting -1 for all data (general case). Also, the kernel uses theset_last()to specify the last data of the stream.IMPORTANT: For the proper functionality of the host and kernel system, it is very important to set thelastbit setting.// Stream Write for(int j = 0; j <....; j++) { datap t; t.set_data(...); t.set_keep(-1); // keep flag -1 , all bytes are valid if(... ) // check if this is last data to be write t.set_last(1); // Setting last data of the stream else t.set_last(0); output.write(t); // output stream from the kernel }
Streaming Data Transfers Between Kernels (K2K)
Host Coding Guidelines
The kernel ports involved in kernel-to-kernel streaming do not require setup using
the clSetKernelArg from the host
code. All kernel arguments not involved in the streaming
connection should be set up using clSetKernelArg as described in Setting Kernel Arguments. However,
kernel ports involved in streaming will be defined within the
kernel itself, and are not addressed by the host program.
Streaming Kernel Coding Guidelines
The streaming interface in a kernel, directly sending or receiving data to another
kernel streaming interface, is defined by hls::stream
with the ap_axiu<D,0,0,0> data type. The ap_axiu<D,0,0,0> data type requires the use of the
ap_axi_sdata.h header file.
qdma_axis data type. Both the ap_axiu and qdma_axis data types
are defined inside the ap_axi_sdata.h header file that is
distributed with the Vitis software platform
installation.// Producer kernel - provides output as a data stream
// The example kernel code does not show any other inputs or outputs.
void kernel1 (.... , hls::stream<ap_axiu<32, 0, 0, 0> >& stream_out) {
#pragma HLS interface axis port=stream_out
for(int i = 0; i < ...; i++) {
int a = ...... ; // Internally generated data
ap_axiu<32, 0, 0, 0> v; // temporary storage for ap_axiu
v.data = a; // Writing the data
stream_out.write(v); // Writing to the output stream.
}
}
// Consumer kernel - reads data stream as input
// The example kernel code does not show any other inputs or outputs.
void kernel2 (hls::stream<ap_axiu<32, 0, 0, 0> >& stream_in, .... ) {
#pragma HLS interface axis port=stream_in
for(int i = 0; i < ....; i++) {
ap_axiu<32, 0, 0, 0> v = stream_in.read(); // Reading the input stream
int a = v.data; // Extract the data
// Do further processing
}
}kernel1 to kernel2 must be defined
during the kernel linking process as described in Specify Streaming Connections Between Compute Units. Free-running Kernel
The Vitis core development kit provides support for one or more free-running kernels. Free-running kernels have no control signal ports, and cannot be started or stopped. The no-control signal feature of the free-running kernel results in the following characteristics:
- The free-running kernel has no memory input or output port, and therefore it interacts with the host or other kernels (other kernels can be regular kernel or another free running kernel) only through streams.
- When the FPGA is programmed by the binary container (xclbin), the free-running
kernel starts running on the FPGA, and therefore it does not need the
clEnqueueTaskcommand from the host code. - The kernel works on the stream data as soon as it starts receiving from the host or other kernels, and it stalls when the data is not available.
- The free-running kernel needs a special interface pragma
ap_ctrl_noneinside the kernel body.
Host Coding for Free Running Kernels
If the free-running kernel interacts with the host, the host code should manage
the stream operation by clCreateStream/clReadStream/clWriteStream
as discussed in Host Coding Guidelines of Streaming Data Between the Host and Kernel (H2K).
As the free-running kernel has no other types of inputs or outputs, such as memory ports
or control ports, there is no need to specify clSetKernelArg. The clEnqueueTask is not
used because the kernel works on the stream data as soon as it starts receiving from the
host or other kernels, and it stalls when the data is not available.
Coding Guidelines for Free Running Kernels
As mentioned previously, the free-running kernel only contains hls::stream inputs and outputs. The recommended coding
guidelines include:
- Using
hls::stream<ap_axiu<D,0,0,0> >if the port is interacting with another stream port from the kernel. - Using
hls::stream<qdma_axis<D,0,0,0> >if the port is interacting with the host.
The guidelines for using a pragma include:
- The kernel interface should not have any
#pragma HLS interface s_axiliteor#pragma HLS interface m_axi(as there should not be any memory or control port). - The kernel interface must contain this special pragma:
#pragma HLS interface ap_ctrl_none port=return
The following code example shows a free-running kernel with one input and one
output communicating with another kernel. The while(1) loop structure
contains the substance of the kernel code, which repeats as long as the kernel runs.
void kernel_top(hls::stream<ap_axiu<32, 0, 0, 0> >& input,
hls::stream<ap_axiu<32, 0, 0, 0> >& output) {
#pragma HLS interface axis port=input
#pragma HLS interface axis port=output
#pragma HLS interface ap_ctrl_none port=return // Special pragma for free-running kernel
#pragma HLS DATAFLOW // The kernel is using DATAFLOW optimization
while(1) {
...
}
}