Migrating the Host Application

In this section, you will review a simple SDSoC™ program with both the main() and accelerated functions to identify the elements that must be changed. To begin the process of migrating an application and hardware functions to the Vitis environment platforms and tools, examine your main function and the hardware function code. The code presented here is the mmult example application.

The following code snippet is an example main() function in the original development application project.

#include <stdlib.h>
#include <iostream>
#include "mmult.h"
#include "sds_lib.h"

#define NUM_TESTS 5

void printMatrix(int *mat, int col, int row) {
	for (int i = 0; i < col; i++) {
		for (int j = 0; j < row; j++) {
			std::cout << mat[i*row+j] << "\t";
		}
		std::cout << std::endl;
	}
	std::cout << std::endl;
}

int main() {
	int col = BUFFER_SIZE;
	int row = BUFFER_SIZE;
	int *matA = (int*)sds_alloc(col*row*sizeof(int));
	int *matB = (int*)sds_alloc(col*row*sizeof(int));
	int *matC = (int*)sds_alloc(col*row*sizeof(int));

	std::cout << "Mat A" << std::endl;
	printMatrix(matA, col, row);

	std::cout << "Mat B" << std::endl;
	printMatrix(matB, col, row);

	//Run the hardware function multiple times
	for(int i = 0; i < NUM_TESTS; i++) {
		std::cout << "Test #: " << i << std::endl;
		// Populate matA and matB
		srand(time(NULL));
		for (int i = 0; i < col*row; i++) {
			matA[i] = rand()%10;
			matB[i] = rand()%10;
		}

		std::cout << "MatA * MatB" << std::endl;
		mmult(matA, matB, matC, col, row);

	}
	printMatrix(matC, col, row);

	return 0;

The code allocates memory for three different two-dimensional matrices stored as one-dimensional arrays, populates matA and matB with random numbers, and multiplies matA and matB to compute matC. The results are printed to the screen and the test is run ten times.

When moving to the Vitis environment, several of the tasks that are implicitly handled by the sds++ compiler and runtime needs to instead be explicitly managed by the application developer.

Updating the Required #include Files

The following sections discuss the specific code changes in the main() function.

The following changes will need to be made.

#include <stdlib.h>   
#include <iostream>   
#include "mmult.h"   
//#include "sds_lib.h"   
#include <fstream>  
#include <vector>  
#include <ctime> 

In this example, main() function is compiled by the Arm® core cross compiler. Comment out the sds_lib include line, as you are no longer relying on the sds_alloc() function for memory allocation.

#define CL_HPP_CL_1_2_DEFAULT_BUILD  
#define CL_HPP_TARGET_OPENCL_VERSION 120  
#define CL_HPP_MINIMUM_OPENCL_VERSION 120  
#define CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY 1  
#include <CL/cl2.hpp> 

In this section, you #define pre-processor macros to specify the version of the OpenCL API to use for the application. The default settings specify the OpenCL API 2.0 framework, but the Xilinx tools support the OpenCL API 1.2 release. For more information on these pre-processor macros, refer to "OpenCL C++ Bindings" at https://github.khronos.org/OpenCL-CLHPP.

The OpenCL API provides support for both C and C++ languages; however, you can set up the environment to use C++ over the default C. To use the OpenCL C++ bindings, as shown in this example code, you must #include the cl2.hpp header file.

The following mmult example code snippet is compiled separately.

#include "mmult.h"
 
void mmult(int A[BUFFER_SIZE*BUFFER_SIZE], int B[BUFFER_SIZE*BUFFER_SIZE], int C[BUFFER_SIZE*BUFFER_SIZE], int col, int row) {
 
    int matA[BUFFER_SIZE*BUFFER_SIZE];
    int matB[BUFFER_SIZE*BUFFER_SIZE];
 
    readA: for(int i = 0; i < col*row; i++) {
#pragma HLS PIPELINE II=1
        matA[i] = A[i];
    }
 
    readB: for(int i = 0; i < col*row; i++) {
#pragma HLS PIPELINE II=1
        matB[i] = B[i];
    }
 
    for (int i = 0; i < col; i++) {
    #pragma HLS PIPELINE II=1
        for (int j = 0; j < row; j++) {
            int tmp = 0;
            for (int k = 0; k < row; k++) {
                tmp += matA[k+i*col] * matB[j+k*col];
            }
            //C[i+j*col] = tmp;
            C[i*row+j] = tmp;
        }
    }
}

Loading the Main Function

To initialize the OpenCL API environment, the software application needs to load the FPGA binary file (.xclbin). This example uses argc/argv to pass the name of this file through the command line argument of the application.

Note: This is just one possible approach. You could also hardcode the file name for the xclbin file in your application.

Given these changes, the application is run as follows.

host.exe ./binary_container_1.xclbin

Where:

host.exe
Compiled executable for the Arm core.
binary_container_1.xclbin
FPGA binary file generated by the Vitis compiler.

Next, add some error checking to ensure the required command-line arguments were specified.

// Check for valid arguments  
if (argc != 2) {  
printf("Usage: %s binary_container_1.xclbin\n", argv[0]);  
exit (EXIT_FAILURE);  
}  
// Get xclbin name  
char* xclbinFilename = argv[1];  

The variable declarations for the input and output matrices also changes, as the allocation of memory will be separately handled later in the code by creating OpenCL buffers. For now, you will simply define the three vectors needed to hold the matrix data.

Using the OpenCL API

The primary difference between the SDSoC development environment and the Vitis core development kit is the use of the OpenCL APIs to manage interactions between the main function and the hardware accelerated kernels. This section of the code is marked by the following opening and closing comments.

//OPENCL HOST CODE AREA STARTS  
//OPENCL HOST CODE AREA ENDS

You need to modify the host code and use the OpenCL C++ API to direct XRT to coordinate execution of the kernel with the host application. These steps are coded in the following order:

  1. Setup
    1. Specify the platform.
    2. Select the OpenCL device to run the kernel.
    3. Create an OpenCL context.
    4. Create a command queue.
    5. Create an OpenCL program.
    6. Create a kernel object for a hardware kernel.
    7. Create memory buffers for the OpenCL device.
  2. Execution
    1. Define arguments for the kernel.
    2. Transfer data from the host CPU to the kernel.
    3. Run the kernel.
    4. Return data from the kernel to the host application.

The following section discusses each of these steps and required code changes in detail.

The following code identifies the platform and the device.

// Get Platform  
std::vector<cl::Platform> platforms;  
cl::Platform::get(&platforms);  
cl::Platform platform = platforms[0];  
  
// Get Device  
std::vector<cl::Device> devices;  
cl::Device device;  
platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices);  
device = devices[0];

The platform is the Xilinx-specific implementation of OpenCL framework, including XRT and the accelerator hardware. The device is the hardware that will run the OpenCL kernel.

With a device selected, you must create a context, which is used by the runtime to manage objects, such as command-queues, memory, programs, and kernels on one or more devices. You must also create the command-queue which executes the commands, either in the order presented or out-of-order to parallelize different requests and improve throughput. This is done as follows.

// Create Context  
cl::Context context(device);  

// Create Command Queue  
cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE);  

As described above, the host and kernel code are compiled separately to create two different outputs. The kernel is compiled into a xclbin file using the Vitis compiler. The host application must identify and load the xclbin file as an OpenCL program object for XRT. You must create a program in a context, and identify the kernels in the program. These steps are reflected in the following code.

// Load xclbin  
std::cout << "Loading: " << xclbinFilename << "'\n";  
std::ifstream bin_file(xclbinFilename, std::ifstream::binary);  
bin_file.seekg (0, bin_file.end);  
unsigned nb = bin_file.tellg();  
bin_file.seekg (0, bin_file.beg);  
char *buf = new char [nb];  
bin_file.read(buf, nb);  
  
// Creating Program from Binary File  
cl::Program::Binaries bins;  
bins.push_back({buf,nb});  
cl::Program program(context, devices, bins);  
  
// Create Kernel object(s)  
cl::Kernel kernel_mmult(program,"mmult");

In the above example, the kernel_mmult object identifies a kernel called mmult specified in the program object (xclbin). In a later section, you will look at the specific steps for migrating the hardware function from the SDSoC environment to the Vitis environment.

Note: The xclbin can contain more than one kernel to be called by the host application and run on the device. An example of this is provided in Advanced Topics: Multiple Compute Units and Kernel Streaming.

Before executing the kernel, you must transfer data from the host application to the device. The SDSoC environment supports two types of transfers, data_copy and zero_copy. The Vitis environment only supports zero_copy. The OpenCL buffers are the conduit through which data is communicated from the host application to the kernels. To transfer data, the application must first declare OpenCL buffer objects, and then use API calls such as enqueueWriteBuffer() and enqueueReadBuffer() to perform the actual transfer. XRT copies data from user space memory to a physically contiguous region of OS kernel space memory that the hardware functions accesses directly through an AXI bus interface.

Start by defining memory buffers for the kernel and specifying the kernel arguments as follows.

// Create Buffers  
cl::Buffer bufMatA = cl::Buffer(context, CL_MEM_WRITE_ONLY, col*row*sizeof(int), NULL, NULL);  
cl::Buffer bufMatB = cl::Buffer(context, CL_MEM_WRITE_ONLY, col*row*sizeof(int), NULL, NULL);  
cl::Buffer bufMatC = cl::Buffer(context, CL_MEM_READ_ONLY, col*row*sizeof(int), NULL, NULL);  
  
// Assign Kernel arguments  
int narg = 0;  
kernel_mmult.setArg(narg++, bufMatA);  
kernel_mmult.setArg(narg++, bufMatB);  
kernel_mmult.setArg(narg++, bufMatC);  
kernel_mmult.setArg(narg++, col);
kernel_mmult.setArg(narg++, row); 

The OpenCL API calls create data buffers in the specified context, defining the read/write abilities of the buffer. Then, these buffers are specified as arguments for the hardware kernel, along with any scalar values that are directly passed, such as col and row in the example above.

The next section of code in the main() function is left unchanged. This implements the primary for loop to perform the specified number of tests (NUM_TESTS), randomly populates the input matrices (matA and matB), and then outputs the matrix values using the printMatrix function. From this point, the main() function runs the matrix multiplication (mmult()) in the hardware accelerator.

In the SDSoC environment, the hardware function is directly called. The hardware function call runs the accelerator as a task, and each of the arguments to the function is transferred between the Arm processor and the PL region. Data transfers are accomplished through data movers, such as a DMA engine, automatically inserted into the system by the sds++ compiler.

In the Vitis environment, you must enqueue the transfer of data from the host to the local memory, enqueue the kernel to be run, and then enqueue the transfer of data from the kernel back to the host, or on to another kernel as the program requires. In this simple example, the data is simply returned to the host.

In the following code snippet, the input matrices are transferred from the host to the device memory, the kernel is run, and the output matrix is transferred back to the host application. The OpenCL API enqueue commands are non-blocking, which means that they return before the actual command is completed. Calling q.finish() blocks furthers execution until all commands in the command queue have completed. This ensures the host waits for the data to be transferred back from the kernel.

// Enqueue Buffers  
q.enqueueWriteBuffer(bufMatA, CL_TRUE, 0, col*row*sizeof(int), matA.data(), NULL, NULL);
q.enqueueWriteBuffer(bufMatB, CL_TRUE, 0, col*row*sizeof(int), matB.data(), NULL, NULL);

// Launch Kernel   
q.enqueueTask(kernel_mmult);  

// Read Data Back from Kernel  
q.enqueueReadBuffer(bufMatC, CL_TRUE, 0, col*row*sizeof(int), matC.data(), NULL, NULL);

q.finish(); 

After this the output matrix is printed to validate the results of the matrix multiplication. When NUM_TESTS have been run, the main function returns.

You can see that it is easy to follow the steps required to migrate your main application from the SDSoC environment to the Vitis environment. This is primarily driven by XRT and the OpenCL APIs that manages the interactions between the main() function and the kernels.