This tutorial demonstrates the acceleration advantages of the The Xilinx® Vitis™ unified software platform Vitis Vision Library and kernel-to-kernel streaming as well as how a single set of high-level code can be implemented on a datacenter (Alveo) or Embedded (MPSoC) platform. Vitis Vision is hardware-accelerated OpenCV functions optimized for Xilinx SoCs and FPGAs; the functions are written completely in C/C++ and targeted for High-level Synthesis (HLS).
Vitis is a unified software platform that uses the same OpenCL API calls to target either an embedded or datacenter platform. This tutorial will utilize OpenCV/OpenCL to demonstrate the performance gains when utilizing the Vitis Vision Library and kernel-to-kernel streaming. This tutorial will also show that the code is platform agnostic. Managing the dataflow in the kernel space is key to achieving high performance accelerated designs.
The C code will be targeted towards AVB application and will include a color space converter (cvtcolor) and scalar (resize) utilizing the Vitis Vision Library. The input source file-based image (can also be video stream mp4) which will be converted into an AXI memory or an AXI stream dataflow with functions from the OpenCV library.
Overall dataflow of acceleration kernels is done in two ways for this tutorial.
(1) Kernel to Memory (K2M2K) dataflow:
Host memory (image) -> kernel memory -> Vitis Vision cvtcolor -> kernel memory -> Vitis Vision resize -> kernel memory -> host memory (see Figure 2)
(2) Kernel to Kernel streaming (K2K) dataflow:
host memory (image) -> Kernel memory -> Vitis Vision cvtcolor -> AXI streaming -> Vitis Vision resize -> kernel memory -> host memory (see Figure 3)
During this tutorial, you will:
This tutorial uses OpenCV and Vitis Vision libraries. You’ll need to download and compile the OpenCV for the targeted Host, either x86 or ARM.
1. The OpenCV library is accessible from GitHub. Detailed instructions can be found in Appendix A.
2. The Vitis Accelerated Libraries are accessible from GitHub. The detailed instructions are provided in the Appendix B.
3. To implement the design, you can either run the Makefile flow (easier, see instructions in next section) or use Vitis IDE (interactive, and much more steps!) to point to the OpenCV and Vitis Vision libraries and add required compiler switches.
4. Make sure to install the latest Vitis software and XRT, the 2019.2 version was used for this Tutorial. When targeting Alveo card, ensure the card and software drivers have been correctly installed by following the instructions in the Getting Started with Alveo Data Center Accelerator Cards Guide (UG1301). When targeting MPSoC card, you’ll require the zcu102 platform.
(A) Alveo - Kernel to Kernel Streaming (K2K) and Kernel to Memory (K2M2K):
Files located here
Directory structure for each dataflow:
image (example jpg files)
data (reference png/jpg files)
src (all source code)
bld_hw (implementation directory, Alveo)
bld_mpsoc (implementation directory, MPSoC/ZCU102)
Build directions
1. cd into bld_hw
2. Edit makefile
a. Edit the paths XILINX_* for your env
b. Edit the path VISION to point to your Vitis Vision Library include
c. Edit your platform of choice SDX_PLATFORM = xilinx_u250_qdma_201910_1
3. Source the setting<64>.sh for Vitis, Vivado and XRT
4. Run make all TARGETS=<sw_emu/hw> to build either for hardware or emulation
5. export XCL_EMULATION_MODE=sw_emu (only for running software emulation)
6. export LD_LIBRARY_PATH=<yourpath>/opencv/lib:$LD_LIBRARY_PATH
7. Run xf_k2k.exe binary_container_1.xclbin to execute
(B) Embedded/MPSoC - Kernel to Kernel Streaming (K2K) and Kernel to Memory (K2M2K):
Zip file located here. See Figure 4 for high-level directory structure. The source files (src) are the same as the ones for Alveo. Only the implementation directory differs from Alveo, considering the specific arguments/options of an embedded device and platform differences.
Build directions
1. Unzip the reference design files
2. Change directory to one of the dataflow folders, for instance kso_k2m2k/bld_mpsoc
3. Set your Vitis 2019.2 environment, including XRT
source /opt/Xilinx/vitis/2019.2/settings64.sh
source /opt/xilinx/Vivado/2019.2/settings64.sh
source /opt/xilinx/xrt/setup.sh
4. Set DEVICE and SYSROOT
export DEVICE=<location of internal_platforms>/zcu102_base_dfx/zcu102_base_dfx.xpfm
export SYSROOT=<location of internal_platforms> /sw/sysroot/aarch64-xilinx-linux
5. Set XFOPENCV manually to point to your local repository of Vitis Vision Library
export XFOPENCV=<local repository>/xf_opencv/L1
6. Run the compilation, keep a log of all the commands and compilation messages
make TARGETS=hw &> make_k2m2k_hw.txt
7. Repeat steps 2 to 6 above for kso_k2m2k_8ppc, kso_k2k and kso_k2k_8ppc
Run the application on ZCU102 board
1. Open an xterm serial terminal ( 115200 Baud, 8bit, 1bit, None, None )
2. Copy the content of kso_k2m2k/bld_mpsoc/sd_card folder into an empty SD Card
3. Eject the SD Card from PC, insert it into ZCU102, power up, monitor the serial terminal
4. Once Linux is booted, type the following commands to run the application:
ls /mnt
ls /mnt/apps
cd /mnt/apps
./xf_k2k binary_container_1.xclbin
5. After the application completes, observe the runtime results (in green), see Figure 5
a.Note both csv files (profile_summary.csv, timeline_trace.csv) in “apps”
b Vitis Analyzer can open these files to view the timeline and profile summary
6. Repeat steps 2 to 5 above for the other dataflows (kso_k2m2k_8ppc, kso_k2k and kso_k2k_8ppc)
Below are the differences when switching from K2M2K to K2K flow.
Add additional includes to Kernels
#include "./common/xf_infra.h" //K2K
#include "./common/xf_axi_sdata.h" //K2K
For k2k streaming inspect lines 2286 – 2289 and 2310 – 2311. An HLS stream interface must be used for inter-kernel communication and the AXIvideo2xfMat must be used to convert the newly created steam interface to xfMat.
hls::stream<ap_axiu<OUTPUT_PTR_WIDTH, 1, 1, 1> >& img_gray
#pragma HLS INTERFACE axis port=img_gray
xf::xfMat2AXIvideo<OUTPUT_PTR_WIDTH,XF_8UC1, HEIGHT, WIDTH, NPC1>(imgOutput0,img_gray);
For k2k streaming inspect lines 36 – 39 and 65 – 66. An HLS stream interface must be used for inter-kernel communication and the AXIvideo2xfMat must be used to convert the xfMat to the newly created steam interface.
hls::stream<ap_axiu<INPUT_PTR_WIDTH, 1, 1, 1> >& img_inp
#pragma HLS INTERFACE axis port=img_inp //K2K offset=slave bundle=gmem1
xf::AXIvideo2xfMat<INPUT_PTR_WIDTH,TYPE,HEIGHT,WIDTH,NPC_T>(img_inp,in_mat); //K2K
Below are the differences when switching from K2M2K to K2K flow.
If you use a single order command queue, then there will be a deadlock, because the second Kernel waits for the firsts kernel to finish. But the first kernel will never finish because it waits for the second kernel to accept values in the stream. You need to use out-of-order queue in the command-queue as shown below so that two kernels can run concurrently.
cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE | CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE);
If the above change is not made the Kernel lock up on the board will occur.
Running xbutilt query gives the following:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compute Unit Status
CU[ 0]: cvtcolor_bgr2gray:cvtcolor_bgr2gray_1@0x1800000 (START)
CU[ 1]: resize_accel:resize_accel_1 @0x1810000 (IDLE)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Modify the kernel setup arguments in the main host code:
cvtcolor_bgr2gray.setArg(0, imageToDevice_cvt);
//K2K cvtcolor_bgr2gray.setArg(1, imageK2KinMem);
clSetKernelArg(cvtcolor_bgr2gray.get(), 1, sizeof(cl_mem), nullptr); //K2K
cvtcolor_bgr2gray.setArg(2, in_height);
cvtcolor_bgr2gray.setArg(3, in_width);
clSetKernelArg(resize_accel.get(), 0, sizeof(cl_mem), nullptr); //K2K
//K2K resize_accel.setArg(0, imageK2KinMem);
resize_accel.setArg(1, imageFromDevice);
resize_accel.setArg(2, in_height);
resize_accel.setArg(3, in_width);
resize_accel.setArg(4, out_height);
resize_accel.setArg(5, out_width);
Below are the differences when switching from K2M2K to K2K flow.
Only for K2K, the kernel linker connects the 2 kernel streaming interfaces together using the --sc switch:
--sc cvtcolor_bgr2gray_1.img_gray:resize_accel_1.img_inp
This section gathers the runtime results after running the k2m2k and k2k dataflows on both targeted platforms, Alveo and MPSoC. During the execution of the application, the host first runs the video functions using the OpenCV library which is an optimized software-only version without fabric acceleration; this runtime is logged as “OpenCV duration (CPU)” in Table 1 to 4 below. Then the host runs the same video functions using the Vitis Vision Library targeting the acceleration platforms, the cvtColor and resize functions being executed in FPGA fabric. These runtimes are reported as “Total kernel runtime (fabric)” on the last line of Table 1 to 4. You will see two more runtimes in Table 1 and 2, one for cvtColor and one for resize; when running the Kernel to Memory application, it is possible to get those independent runtimes from the Host. However, for the Kernel to Kernel streaming, it is not possible to get the intermediate runtimes because the execution of both kernels happens concurrently in the fabric, thus no runtimes can be extracted, and they won’t be reported in Table 3 and 4.
(1) Kernel to memory (K2M2K)
(2) Kernel to streaming (K2K)
The very first observation is the above results may not be what was originally intended when comparing to the host processing time. The OpenCV processing time of the resize and cvtcolor are less than the acceleration time of the kernels in 1 pixel per clock (1ppc) mode. These results are expected since we are only processing one frame in the example. Next observation, the above results show that streaming kernel to kernel cut the execution time in half, which is a great improvement, however the fabric runtime in 1ppc is slower than running the same functions on the host CPU.
When running the kernels in 8ppc mode we can exceed the performance of the CPU based OpenCV function even when processing one frame. Keep in mind that the host runtime can vary depending on workload and system performance (note the differences for Alveo board between k2mk2 and k2k). The kernel runtime is fixed since it is dedicated hardware. Three important observations here for the 8ppc mode: (1) k2m2m or k2k show 7x runtime improvement over the 1ppc mode in fabric, (2) the fabric runtime is faster than host execution (17-60% for k2m2k, 2x for k2k), and (3) the overall improvement achieved with 8ppc and k2k is roughly 14x compared to 1ppc and k2m2k.
The current MPSoC architecture limits the PS-PL data widths to 128-bit (HPC interface) and could potentially impact the overall streaming data flow from host memory to kernel memory; the Alveo doesn’t have the same limit and can match the 512-bit AXIS through the whole dataflow. In our specific case, this limitation has no impact on the runtime, the total data bandwidth remains below the maximum supported by a single HPC interface, however something to keep in mind for other acceleration projects. Vitis and HLS take care of the data packing to efficiently transfer the pixels via the 128-bit data interface: two consecutive beats per data are required to transfer 192 bits (3 colors * 8-bit/color * 8ppc = 192 bits).
At the time of this article we have only completed a single frame. To show the real advantages of the acceleration kernels in OpenCV using the Vitis Vision Library, there are two ways of doing this. The first is when you have many functions that are chained back to back. A CPU in this scenario must process each frame serially as it goes through the OpenCV functions manipulating the single frame. In the acceleration kernel space, the frame being processed by the first kernel is then immediately passed along to the next kernel during execution (see Kernel to Streaming timing diagram above, Figure 8). The more kernels in the acceleration domain the more parallel processing occurs. You could also replicate the chain of kernels multiple times to process multiple frames at a time. Other functions in the OpenCV library can be more CPU intensive which would allow for a higher performance gain in the fabric.
The second way to significantly decrease the kernel processing time is to buffer many frames in host memory and the transfer to the kernel space for execution. This will keep the kernels processing continuously instead of waiting for a single frame being sent over one at a time. In the end it all comes down to memory management and how to keep the kernels/pipe fully utilized. You could also replicate the accelerated data path multiple times in the fabric which would allow multiple frames to be processed in parallel.
We are currently working on a design that will achieve the above enhancements.
Finally, we demonstrated that the exact same source code could be used to target either a datacenter/Alveo or an embedded/MPSoc platforms, which has the advantage to move a design originally meant for one platform to another one without impacting your schedule.
Some additional ideas to improve performance:
(a) Optimize the existing kernel
a. analyze the timelines
b. increase the fabric target frequency (Alveo @300MHz, MPSoC @150MHz)
(b) Create your own C/C++ module, to replace or expand the capabilities of the existing ones.
(c) Reuse your own IP cores, either legacy RTL code or C/C++ HLS block designs.
(d) Data streaming into the accelerated cards could be rearchitected to use the QDMA instead of XDMA. Use the QDMA shell that would allow you to stream directly from the host to the kernel and back. This would save on additional memory reads to and from the kernel DDR.
Depending on the target platform, the user must compile OpenCV3.4.3 for x86-based Host (Alveo flow) and run the instructions below, or use the pre-compiled ARM-based library (MPSoC flow) already available with Vitis technology internal_platforms.
1. Open a fresh terminal (xterm), don't set anything related to Vivado or Vitis, else CMake will fail.
2. Change directory to a working directory of your choice –> for instance: cd ~/src_opencv
3. git clone --branch 3.4.3 https://github.com/opencv/opencv.git
4. git clone --branch 3.4.3 https://github.com/opencv/opencv_contrib.git
5. mkdir build -> this will create a sub-folder “build” -> ~/src_opencv/build
6. cd build
7. Run this command to create the Makefile required to compile the OpenCV library:
cmake -D CMAKE_BUILD_TYPE=Release -OPENCV_EXTRA_MODULES_PATH=~/src_opencv/opencv_contrib/modules -D CMAKE_INSTALL_PREFIX=~/opencv ~/src_opencv/opencv
8. From the “build” directory type these two commands:
9. Once the compilation/installation is complete, you will see all the openCV files in ~/opencv
Download the Vitis Vision Library. This library is made available through GitHub. Run the following git clone command to clone the Vitis Vision Library repository to your local disk:
git clone https://github.com/Xilinx/Vitis_Libraries.git
For more information, review UG1233 - Xilinx OpenCV User Guide (page 9), (/content/xilinx/en/support/documentation/sw_manuals/xilinx2019_1/ug1233-xilinx-opencv-user-guide.pdf)
1. Vitis Libraries 2019.2 Release: https://github.com/Xilinx/Vitis_Libraries.git
2. SDAccel - Getting Started Examples: https://github.com/Xilinx/SDAccel_Examples/tree/master/getting_started
3. 2019.1 SDAccel™ Development Environment Tutorials: https://github.com/Xilinx/SDAccel-Tutorials
4. SDAccel-Tutorials/docs/sdaccel-getting-started/: https://github.com/Xilinx/SDAccel-Tutorials/tree/master/docs/sdaccel-getting-started
5. SDAccel / RTL Kernel / VADD / host code: https://github.com/Xilinx/SDAccel_Examples/blob/master/getting_started/rtl_kernel/rtl_vadd/src/host.cpp
6. SDAccel – Tutorial - Getting Started with RTL Kernels: https://github.com/Xilinx/SDAccel-Tutorials/blob/master/docs/getting-started-rtl-kernels/README.md
7. 2019.1 SDAccel™ Development Environment Tutorials - Mixing C++ and RTL Kernels https://github.com/Xilinx/SDAccel-Tutorials/tree/master/docs/mixing-c-rtl-kernels
8. SDAccel Examples – Concurrent kernel execution (Makefile): https://github.com/Xilinx/SDAccel_Examples/blob/master/getting_started/host/concurrent_kernel_execution_c/Makefile
9. SDAccel Examples – Xilinx xfopencv - SDAccel examples: https://github.com/Xilinx/xfopencv/tree/master/examples_sdaccel
10. SDSoC - Tutorials - Migrate OpenCV to xfOpenCV Labs: https://github.com/Xilinx/SDSoC-Tutorials/tree/master/opencv-to-xfopencv-migration-tutorial
11. xfOpenCV 2019.1 Release: https://github.com/Xilinx/xfopencv/releases
12. SDx Release Notes (interesting links to XMA Xilinx Media Acceleration): UG1238
13. Using Multiple Compute Units :
https://github.com/Xilinx/SDAccel-Tutorials/tree/master/docs/using-multiple-cu
14. Install OpenCV in Linux
https://docs.opencv.org/3.3.0/d7/d9f/tutorial_linux_install.html
15. Cross compilation for ARM based Linux systems
Bill George is located near Toronto, Ontario and serves as an AVB focused Field Applications Engineer (FAE) for AMD. Bill has been in the FPGA industry for over 19 years, where the last 11 years have been at AMD with a focus on memory interfaces, acceleration, and Vivado-HLS. Bill and his wife have three kids and enjoy boating, traveling, music and off-roading.
Benoit Payette is located near Montreal, Quebec, Canada and supports local customers as an AMD Field Applications Engineer (FAE) for AVB, ProAV and ISM markets. Benoit guides engineers through technical solutions and pushes adoption of new techniques and methodologies. Prior to this, he was a Strategic Apps Engineer (SAE) for 16 years where he worked on custom designs, timing closure and to improve AMD solutions. Benoit holds two patents, wrote a few application notes, and gained expertise with VCO replacement and video interconnectivity solutions. Benoit loves to grow his mustache and beard annually during Movember, much to the chagrin of his wife Chantal Racette. Nevertheless, their relationship still holds strong after 27 years and they are proud parents of three boys and one girl. The rest of the time, Benoit enjoys cycling/camping, traveling the world, playing adventure games, and he tries to keep up with younger ball-hockey players.