Profiling the Application
The Vitis™ core development kit generates various system and kernel resource performance reports during compilation. These reports help you establish a baseline of performance on your application, identify bottlenecks, and help to identify target functions that can be accelerated in hardware kernels as discussed in Methodology for Architecting a Device Accelerated Application. The Xilinx Runtime (XRT) also collects profiling data during application execution in both emulation and system mode configurations. Examples of the reported data includes:
- Host and device timeline events
- OpenCL™ API call sequence
- Kernel execution sequence
- FPGA trace data including AXI transactions
- Kernel start and stop signals
Together the reports and profiling data can be used to isolate performance bottlenecks in the application and optimize the design to improve performance. Optimizing an application requires optimizing both the application host code and any hardware accelerated kernels. The host code must be optimized to facilitate data transfers and kernel execution, while the kernel should be optimized for performance and resource usage.
There are four distinct areas to be considered when performing algorithm optimization in Vitis: System resource usage and performance, kernel optimization, host optimization, and data transfer optimization. The following Vitis reports and graphical tools support your efforts to profile and optimize these areas:
- Design Guidance
- System Estimate Report
- HLS Report
- Profile Summary Report
- Application Timeline
- Waveform View and Live Waveform Viewer
Reports are automatically generated after running the active build, either from the command line as described in Running an Application, or from the Vitis integrated design environment (IDE). Separate sets of reports are generated for all three build targets and can be found in the respective report directories. Refer to Directory Structure for more information on locating these reports.
Reports can be viewed in a web browser, a spreadsheet viewer, or from the Vitis IDE. To access these reports from the Vitis IDE, ensure the Assistant view is visible and double-click the desired report.
The following topics briefly describe the various reports and graphical visualization tools and how they can be used to profile your design.
Baselining Functionalities and Performance
It is very important to understand the performance of your application before you start any optimization effort. This is achieved by establishing a baseline for the application in terms of functionalities and performance.
Identify Bottlenecks
The first step is to identify the bottlenecks of the your application running
on your target platform. The most effective way is to run the application with profiling tools,
like valgrind, callgrind, and
GNU gprof. The profiling data generated by these tools show the
call graph with the number of calls to all functions and their execution time.
Run Software and Hardware Emulation
Run software and hardware emulation on the accelerated application as described in Running an Application, to verify functional correctness, and to generate profiling data on the host code and the kernels. Analyze the kernel compilation reports, profile summary, timeline trace, and device hardware transactions to understand the baseline performance estimate for timing interval, and latency and resource utilization, such as DSP and block RAM.
Build and Run the Application
The last step in baselining is building and running the application on an FPGA acceleration card, like one of the Alveo™ Data Center accelerator cards. Analyze the reports from the system compilation, and the profiling data from application execution to see the actual performance and resource utilization on hardware.
Design Guidance
The Vitis core development kit has a comprehensive design guidance tool that provides immediate, actionable guidance to the software developer for issues detected in their designs. These issues might be related to the source code, or due to missed tool optimizations. Also, the rules are generic rules based on an extensive set of reference designs. Therefore, these rules might not be applicable for your specific design. It is up to you to understand the specific guidance rules and take appropriate action based on your specific algorithm and requirements.
Guidance is generated from the Vivado
HLS, Vitis profiler, and Vivado Design Suite when invoked by the v++ compiler. The generated design guidance can have several severity
levels; errors, advisories, warnings, and critical warnings are provided during software
emulation, hardware emulation, and system builds. The profile design guidance helps you
interpret the profiling results which allows you to focus on improving performance.
The guidance includes hyperlinks, examples, and links to documentation. This helps improves productivity by quickly highlighting issues and directing you to additional information in using the Vitis technology.
Design guidance is automatically generated after building or running a design in the Vitis IDE.
You can open the Guidance report as discussed in Using the Vitis Analyzer. To access the Guidance report, open the Compile summary, the Link summary, or the Run summary, and open the Guidance report. With the Guidance report open, hovering over the guidance highlights recommended resolutions.
The following image shows an example of the Guidance report displayed in the Vitis analyzer. Clicking a link displays an expanded view of the actionable guidance.
There is one HTML guidance report for each run of the v++ command, including compile and link. The report files
are located in the --report_dir under the specific
output name. For example:
- v++_compile_<output>_guidance.html for
v++compilation - v++_link_<output>_guidance.html for
v++linking
Opening the Guidance Report
When kernels are compiled and when the FPGA binary is linked, guidance
reports are generated automatically by the v++
command. You can view these reports in the Vitis
analyzer by opening the <output_filename>.compile.summary or the <output_filename>.link.summary for the
application project. The <output_filename>
is the output of the v++ command.
As an example, launch the Vitis analyzer and open the report using this command:
vitis_analyzer <output_filename>.link.summary
When the Vitis analyzer opens, it displays the link summary report, as well as the compile summary, and a collection of reports generated during the compile and link processes. Both the compile and link steps generate Guidance reports to view by clicking the Build heading on the left-hand side. Refer to Using the Vitis Analyzer for more information.
Interpreting Guidance Data
The Guidance view places each entry in a separate row. Each row might contain the name of the guidance rule, threshold value, actual value, and a brief but specific description of the rule. The last field provides a link to reference material intended to assist in understanding and resolving any of the rule violations.
In the GUI Guidance view, guidance rules are grouped by categories and unique IDs in the Name column and annotated with symbols representing the severity. These are listed individually in the HTML report. In addition, as the HTML report does not show tooltips, a full Name column is included in the HTML report as well.
The following list describes all fields and their purpose as included in the HTML guidance reports.
- Id
- Each guidance rule is assigned a unique ID. Use this id to uniquely identify a specific message from the guidance report.
- Name
- The Name column displays a mnemonic name uniquely identifying the guidance rule. These names are designed to assist in memorizing specific guidance rules in the view.
- Severity
- The Severity column allows the easy identification of the importance of a guidance rule.
- Full Name
- The Full Name provides a less cryptic name compared to the mnemonic name in the Name column.
- Categories
- Most messages are grouped within different categories. This allows the GUI to display groups of messages within logical categories under common tree nodes in the Guidance view.
- Threshold
- The Threshold column displays an expected threshold value, which determines whether or not a rule is met. The threshold values are determined from many applications that follow good design and coding practices.
- Actual
- The Actual column displays the values actually encountered on the specific design. This value is compared against the expected value to see if the rule is met.
- Details
- The Details column provides a brief but specific message describing the specifics of the current rule.
- Resolution
- The Resolution column provides a pointer to common ways the model source code or tool transformations can be modified to meet the current rule. Clicking the link brings up a pop-up window or the documentation with tips and code snippets that you can apply to the specific issue.
System Estimate Report
The process step with the longest execution time includes building the hardware system and the FPGA binary to run on Xilinx devices. Build time is also affected by the target device and the number of compute units instantiated onto the FPGA fabric. Therefore, it is useful to estimate the performance of an application without needing to build it for the system hardware.
The System Estimate report provides estimates of FPGA resource usage and the estimated frequency at which the hardware accelerated kernels can operate. The report is automatically generated for hardware emulation and system hardware builds. The report contains high-level details of the user kernels, including resource usage and estimated frequency. This report can be used to guide design optimization.
You can also force the generation of the System Estimate report with the following option:
v++ .. --report_level estimate
An example report is shown in the figure:
Opening the System Estimate Report
The System Estimate report can be opened in the Vitis analyzer tool, intended for viewing reports from the Vitis compiler when the application is built, and the XRT library when the application is run. You can launch the Vitis analyzer and open the report using the following command:
vitis_analyzer <output_filename>.link.summary
The <output_filename> is the
output of the v++ command. This opens the Link
Summary for the application project in the Vitis
analyzer tool. Then, select the System Estimate report. Refer to Using the Vitis Analyzer for more information.
Interpreting the System Estimate Report
The System Estimate report generated by the v++ command provides information on every binary container in the
application, as well as every compute unit in the design. The report is structured as
follows:
- Target device information
- Summary of every kernel in the application
- Detailed information on every binary container in the solution
The following example report file represents the information generated for the estimate report:
-------------------------------------------------------------------------------
Design Name: mmult.hw_emu.xilinx_u200_xdma_201830_2
Target Device: xilinx:u200:xdma:201830.2
Target Clock: 300.000000MHz
Total number of kernels: 1
-------------------------------------------------------------------------------
Kernel Summary
Kernel Name Type Target OpenCL Library Compute Units
----------- ---- ------------------ -------------------------------------- -------------
mmult c fpga0:OCL_REGION_0 mmult.hw_emu.xilinx_u200_xdma_201830_2 1
-------------------------------------------------------------------------------
OpenCL Binary: mmult.hw_emu.xilinx_u200_xdma_201830_2
Kernels mapped to: clc_region
Timing Information (MHz)
Compute Unit Kernel Name Module Name Target Frequency Estimated Frequency
------------ ----------- ----------- ---------------- -------------------
mmult_1 mmult mmult 300.300293 411.015198
Latency Information (clock cycles)
Compute Unit Kernel Name Module Name Start Interval Best Case Avg Case Worst Case
------------ ----------- ----------- -------------- --------- -------- ----------
mmult_1 mmult mmult 826 ~ 829 825 827 828
Area Information
Compute Unit Kernel Name Module Name FF LUT DSP BRAM URAM
------------ ----------- ----------- ----- ----- ---- ---- ----
mmult_1 mmult mmult 81378 35257 1036 2 0
-------------------------------------------------------------------------------
Design and Target Device Summary
All design estimate reports begin with an application summary and information about the target device. The device information is provided in the following section of the report:
-------------------------------------------------------------------------------
Design Name: mmult.hw_emu.xilinx_u200_xdma_201830_2
Target Device: xilinx:u200:xdma:201830.2
Target Clock: 300.000000MHz
Total number of kernels: 1
-------------------------------------------------------------------------------
For the design summary, the information provided includes the following:
- Target Device
- Name of the Xilinx device on the target platform that runs the FPGA binary built by the Vitis compiler.
- Target Clock
- Specifies the target operating frequency for the compute units (CUs) mapped to the FPGA fabric.
Kernel Summary
This section lists all of the kernels defined for the application project. The following example shows the kernel summary:
Kernel Summary
Kernel Name Type Target OpenCL Library Compute Units
----------- ---- ------------------ -------------------------------------- -------------
mmult c fpga0:OCL_REGION_0 mmult.hw_emu.xilinx_u200_xdma_201830_2 1
In addition to the kernel name, the summary also provides the execution target and type of the input source. Because there is a difference in compilation and optimization methodology for OpenCL™, C, and C/C++ source files, the type of kernel source file is specified.
The Kernel Summary section is the last summary information in the report. From here, detailed information on each compute unit binary container is presented.
Timing Information
For each binary container, the detail section begins with the execution target of all compute units (CUs). It also provides timing information for every CU. As a general rule, if the estimated frequency for the FPGA binary is higher than the target frequency, the CU will be able to run in the device. If the estimated frequency is below the target frequency, the kernel code for the CU needs to be further optimized to run correctly on the FPGA fabric. This information is shown in the following example:
OpenCL Binary: mmult.hw_emu.xilinx_u200_xdma_201830_2
Kernels mapped to: clc_region
Timing Information (MHz)
Compute Unit Kernel Name Module Name Target Frequency Estimated Frequency
------------ ----------- ----------- ---------------- -------------------
mmult_1 mmult mmult 300.300293 411.015198
It is important to understand the difference between the target and estimated frequencies. CUs are not placed in isolation into the FPGA fabric. CUs are placed as part of a valid FPGA design that can include other components defined by the device developer to support a class of applications.
Because the CU custom logic is generated one kernel at a time, an estimated frequency that is higher than the target frequency indicates that the CU can run at the higher estimated frequency. Therefore, CU should meet timing at the target frequency during implementation of the FPGA binary.
Latency Information
The latency information presents the execution profile of each CU in the binary container. When analyzing this data, it is important to recognize that all values are measured from the CU boundary through the custom logic. In-system latencies associated with data transfers to global memory are not reported as part of these values. Also, the latency numbers reported are only for CUs targeted at the FPGA fabric. The following is an example of the latency report:
Latency Information (clock cycles)
Compute Unit Kernel Name Module Name Start Interval Best Case Avg Case Worst Case
------------ ----------- ----------- -------------- --------- -------- ----------
mmult_1 mmult mmult 826 ~ 829 825 827 828
The latency report is divided into the following fields:
- Start interval
- Best case latency
- Average case latency
- Worst case latency
The start interval defines the amount of time that has to pass between invocations of a CU for a given kernel.
The best, average, and worst case latency numbers refer to how much time it takes the CU to generate the results of one ND Range data tile for the kernel. For cases where the kernel does not have data dependent computation loops, the latency values will be the same. Data dependent execution of loops introduces data specific latency variation that is captured by the latency report.
- OpenCL kernels that do not have
explicit
reqd_work_group_size(x,y,z) - Kernels that have loops with variable bounds
Area Information
Although the FPGA can be thought of as a blank computational canvas, there are a limited number of fundamental building blocks available in each FPGA. These fundamental blocks (FF, LUT, DSP, block RAM) are used by the Vitis compiler to generate the custom logic for each CU in the design. The quantity of fundamental resources needed to implement the custom logic for a single CU determines how many CUs can be simultaneously loaded into the FPGA fabric. The following example shows the area information reported for a single CU:
Area Information
Compute Unit Kernel Name Module Name FF LUT DSP BRAM URAM
------------ ----------- ----------- ----- ----- ---- ---- ----
mmult_1 mmult mmult 81378 35257 1036 2 0
-------------------------------------------------------------------------------
HLS Report
The HLS report provides details about the high-level synthesis (HLS) process of a user kernel and is generated in hardware emulation and system builds. This process translates the C/C++ and OpenCL kernel into the hardware description language used for implementing the kernel logic on the FPGA. The report provides estimated FPGA resource usage, operating frequency, latency, and interface signals of the custom-generated hardware logic. These details provide many insights to guide kernel optimization.
When running from the command line, this report can be found in the following directory:
_x/<kernel_name>.<target>.<platform>/<kernel_name>/<kernel_name>/solution/syn/report
The HLS report can be opened from the Vitis analyzer by opening the Compile Summary, or the Link Summary as described in Using the Vitis Analyzer. An example of the HLS report is shown.
Generating and Opening the HLS Report
--save-temps option must be specified to preserve the
intermediate files produced by Vivado HLS, including
the reports. The HLS report and HLS guidance are only generated for hardware emulation
and system builds for C and OpenCL kernels. It is
not generated for software emulation or RTL kernel.The HLS report can be viewed through the Vitis analyzer by opening the <output_filename>.compile.summary or the <output_filename>.link.summary for the
application project. The <output_filename>
is the output of the v++ command.
You can launch the Vitis analyzer and open the report using the following command:
vitis_analyzer <output_filename>.compile.summary
When the Vitis analyzer opens, it displays the compile summary and a collection of reports generated during the compile process. The HLS report can be viewed by clicking the Build heading on the left-hand side. Refer to Using the Vitis Analyzer for more information.
Interpreting the HLS Report
The left pane of the HLS report shows the module hierarchy. Each module generated as part of the HLS run is represented in this hierarchy. You can select any of these modules to present the synthesis details of the module in the right-side of the Synthesis Report window. The Synthesis Report is separated into several sections:
- General Information
- Performance Estimates (timing and latency)
- Utilization Estimates
- Interface Information
If this information is part of a hierarchical block, it will sum up the information of the blocks contained in the hierarchy. Therefore, the hierarchy can also be navigated from within the report when it is clear which instance contributes to the overall design.
Profile Summary Report
When properly configured, the Vitis Runtime library collects profiling data on host applications and kernels. After the application finishes execution, the Profile Summary report is saved as a .csv file in the directory where the compiled host code is executed.
The Profile Summary provides annotated details regarding the overall application performance. All data generated during the execution of the application is grouped into categories. The Profile Summary lets you examine kernel execution and data transfer statistics.
An example of the Profile Summary report is shown below.
The report has multiple tabs that can be selected. A description of each tab is given here:
- Top Operations
- Kernels and Global Memory. This tab shows a summary of top operations. It displays the profile data for top data transfers between FPGA and device memory.
- Kernels & Compute Units
- Displays the profile data for all kernels and compute units.
- Data Transfers
- Host and Global Memory. This displays the profile data for all read and write transfers between the host and device memory through the PCIe link. It also displays data transfers between kernels and global memory, if enabled.
- OpenCL APIs
- Displays the profile data for all OpenCL C host API function calls executed in the host application.
- Kernel Internals
- This tab is reported during HW emulation if you have enabled
launch_waveformin the[Emulation]section of the xrt.ini as described in xrt.ini File. The generated waveform data (.wdb) are reported in profile_kernels.csv and timeline_kernels.csv files, and the Kernel Internals tab is populated with this information. The reported information applies to C/C++ and OpenCL kernels, and is not reported for RTL kernels.
For details on the profile summary, see Interpreting the Profile Summary.
Generating and Opening the Profile Summary Report
Capturing the data required for the Profile Summary requires a few steps prior to actually running the application.
- The FPGA binary (xclbin) file is
configured for capturing profiling data by default. However, using the
v++ --profile_kerneloption during the linking process enables a greater level of detail in the profiling data captured. See the Vitis Compiler Command for more information on the--profile_kerneloption. - The runtime requires the presence of an xrt.ini file, as described in xrt.ini File, that includes the keyword for capturing profiling data:
[Debug] profile = true - To enable the profiling of Kernel Internals data, you must also add the
launch_waveformtag in the[Emulation]section of the xrt.ini:[Emulation] launch_waveform = batch
With profiling enabled in the FPGA binary and in the xrt.ini file, the runtime creates the profile_summary.csv report file when running the application, and also creates the profile_kernels.csv and timeline_kernels.csv files when Kernel Internals is enabled.
The CSV report can be viewed in a spreadsheet tool or utility, or can be opened in the Vitis analyzer tool, intended for viewing reports from the Vitis compiler when the application is built, and the XRT library when the application is run. You can launch the Vitis analyzer and open the report using the following command:
vitis_analyzer profile_summary.csv
Related Information
Interpreting the Profile Summary
The Profile Summary includes a number of useful statistics for your OpenCL application. This provides a general idea of the functional bottlenecks in your application. The Profile Summary consists of four sections with the following information:
- Top Operations
- Top Data Transfer: Kernels and Global
Memory: This table displays the profile data for top data transfers
between FPGA and device memory.
- Device: Name of device
- Compute Unit: Name of compute unit
- Number of Transfers: Sum of write and read AXI transactions monitored on device
- Average Bytes per Transfer: (Total read bytes + Total write bytes) / (Total read AXI transactions + Total write AXI transactions)
- Transfer Efficiency (%):
(Average bytes per transfer) / min (4K, (Memory bit width / 8 * 256))
AXI4 specification limits the maximum burst length to 256 and maximum burst size to 4K bytes.
- Total Data Transfer (MB): (Total read bytes + Total write bytes) / 1.0E6
- Total Write (MB): (Total write bytes) / 1.0E6
- Total Read (MB): (Total read bytes) / 1.0E6
- Transfer Rate (MB/s): (Total data transfer) / (Compute unit total time)
- Top Kernel Execution
- Kernel Instance Address: Host address of kernel instance (in hex)
- Kernel: Name of kernel
- Context ID: Context ID on host
- Command Queue ID: Command queue ID on host
- Device: Name of device
where kernel was executed (format:
<device>-<ID>) - Start Time (ms): Start time of execution (in ms)
- Duration (ms): Duration of execution (in ms)
- Global Work Size: NDRange of kernel
- Local Work Size: Work group size of kernel
- Top Memory Writes: Host and Device Global
Memory
- Buffer Address: Host address of buffer (in hex)
- Context ID: Context ID on host
- Command Queue ID: Command queue ID on host
- Start Time (ms) : Start time of write transfer (in ms)
- Duration (ms): Duration of write transfer (in ms)
- Buffer Size (KB): Size of write transfer (in KB)
- Writing Rate (MB/s): Writing rate = (Buffer size) / (Duration)
- Top Memory Reads: Host and Device Global
Memory
- Buffer Address: Host address of buffer (in hex)
- Context ID: Context ID on host
- Command Queue ID: Command queue ID on host
- Start Time (ms): Start time of read transfer (in ms)
- Duration (ms): Duration of read transfer (in ms)
- Buffer Size (KB): Size of read transfer (in KB)
- Reading Rate (MB/s): Reading rate = (Buffer size) / (Duration)
- Top Data Transfer: Kernels and Global
Memory: This table displays the profile data for top data transfers
between FPGA and device memory.
- Kernels & Compute Units
- Kernel Execution (includes estimated device
times): This displays the profile data summary for all kernel functions
scheduled and executed.
- Kernel: Name of kernel
- Number of Enqueues: Number of times kernel is enqueued
- Total Time (ms): Sum of runtimes of all enqueues (measured from START to END in OpenCL execution model)
- Minimum Time (ms): Minimum runtime of all enqueues
- Average Time (ms): (Total time) / (Number of enqueues)
- Maximum Time (ms): Maximum runtime of all enqueues
- Compute Unit Utilization (includes
estimated device times): This displays the summary profile data for all
compute units on the FPGA.
- Device: Name of device
(format:
<device>-<ID>) - Compute Unit: Name of compute unit
- Kernel: Kernel this compute unit is associated with
- Global Work Size: NDRange of kernel (format is x:y:z)
- Local Work Size: Local work group size (format is x:y:z)
- Number of Calls: Number of times the compute unit is called
- Total Time (ms): Sum of runtimes of all calls
- Minimum Time (ms): Minimum runtime of all calls
- Average Time (ms): (Total time) / (Number of work groups)
- Maximum Time (ms): Maximum runtime of all calls
- Clock Frequency (MHz): Clock frequency used for a given accelerator (in MHz)
- Device: Name of device
(format:
- Kernel Execution (includes estimated device
times): This displays the profile data summary for all kernel functions
scheduled and executed.
- Data Transfers
- Data Transfer: Host and Global
Memory: This displays the profile data for all read and write transfers
between the host and device memory via PCI Express® link.
- Context:Number of Devices: Context ID and number of devices in context
- Transfer Type: READ or WRITE
- Number of Transfers:
Number of host data transfersNote: Can contain
printftransfers - Transfer Rate (MB/s):
(Total bytes sent) / (Total time in µs)
where Total time includes software overhead
- Average Bandwidth Utilization
(%): (Transfer rate) / (Maximum transfer rate)
where Maximum transfer rate = (256 / 8 bytes) * (300 MHz) = 9.6 GB/s
- Average Size (KB): (Total KB sent) / (Number of transfers)
- Total Time (ms): Sum of transfer times
- Average Time (ms): (Total time) / (Number of transfers)
- Data Transfer: Kernels and Global
Memory: This displays the profile data for all read and write transfers
between the FPGA and device memory.
- Device: Name of device
- Compute Unit/Port Name: <Name of compute unit>/<Name of port>
- Kernel Arguments: List of arguments connected to this port
- DDR Bank: DDR bank number this port is connected to
- Transfer Type: READ or WRITE
- Number of Transfers:
Number of AXI transactions monitored on deviceNote: Might contain
printftransfers) - Transfer Rate (MB/s):
(Total bytes sent) / (Compute unit total time)
- Compute unit total time = Total execution time of compute unit
- Total bytes sent = sum of bytes across all transactions
- Average Bandwidth Utilization
(%): (Transfer rate) / (0.6 * Maximum transfer rate)
where Maximum transfer rate = (512 / 8 bytes) * (300 MHz) = 19200 MB/s
- Average Size (KB): (Total KB sent) / (number of AXI transactions)
- Average Latency (µs): (Total latency of all transaction) / (Number of AXI transactions)
- Data Transfer: Host and Global
Memory: This displays the profile data for all read and write transfers
between the host and device memory via PCI Express® link.
- OpenCL API Calls: This displays
the profile data for all OpenCL host API function
calls executed in the host application.
- API Name: Name of API
function (for example,
clCreateProgramWithBinary,clEnqueueNDRangeKernel) - Number of Calls: Number of calls to this API
- Total Time (ms): Sum of runtimes of all calls
- Minimum Time (ms): Minimum runtime of all calls
- Average Time (ms): (Total time) / (Number of calls)
- Maximum Time (ms): Maximum runtime of all calls
- API Name: Name of API
function (for example,
- Kernel Internals
- Compute Units: Running Time and
Stalls: Reports the running time for compute units in microseconds, and
reports stall time as a percent of the running time. TIP: The Kernel Internals tab reports time in microseconds (µs), while the rest of the Profile Summary reports time in milliseconds (ms).
- Compute Unit: Indicates the compute unit instance name.
- Running Time (µs): Reports the total running time for the CU.
- Intra-Kernel Stream Stalls (%): Reports the percentage of running time consumed in stalls when streaming data between kernels.
- External Memory Stalls (%): Reports the percentage of running time consumed in stalls for memory transfers outside the CU.
- External Stream Stalls (%): Reports the percentage of running time consumed in stalls when streaming data to or from outside the CU.
- Functions: Running Time and
Stalls: Reports the running time for executing functions inside the CU
in microseconds, and reports stall time as a percent of the running time.
- Compute Unit: Indicates the compute unit instance name.
- Function: Indicates the function name inside the CU.
- Running Time (µs): Reports the total running time for the function.
- Intra-Kernel Stream Stalls (%): Reports the percentage of running time consumed in stalls when streaming data between kernels.
- External Memory Stalls (%): Reports the percentage of running time consumed in stalls for memory transfers outside the CU.
- External Stream Stalls (%): Reports the percentage of running time consumed in stalls when streaming data to or from outside the CU.
- Compute Units: Port Data
Transfer: Reports the data transfer for specific ports on the compute
unit.
- Compute Unit: Indicates the compute unit instance name.
- Port: Indicates the port name on the compute unit.
- Write Time (µs): Specifies the total data write time on the port.
- Outstanding Write (%): Specifies the percentage of the runtime consumed in the write process.
- Read Time (µs): Specifies the total data read time on the port.
- Outstanding Read (%): Specifies the percentage of the runtime consumed in the read process.
- Compute Units: Running Time and
Stalls: Reports the running time for compute units in microseconds, and
reports stall time as a percent of the running time.
Application Timeline
The Application Timeline collects and displays host and kernel events on a common timeline to help you understand and visualize the overall health and performance of your systems. The graphical representation lets you see issues regarding kernel synchronization and efficient concurrent execution. The displayed events include:
- OpenCL API calls from the host code.
- Device trace data including compute units, AXI transaction start/stop.
- Host events and kernel start/stops.
While this is useful for debugging and profiling the application, the timeline
and device trace data are not collected by default, which can affect performance by
adding time to the application execution. However, the trace data is collected with
dedicated resources in the kernel, and does not affect kernel functionality. The data is
offloaded only at the end of the run (v++ --trace_memory option).
The following is a snapshot of the Application Timeline window which displays host and device events on a common timeline. Host activity is displayed at the top of the image and kernel activity is shown on the bottom of the image. Host activities include creating the program, running the kernel and data transfers between global memory and the host. The kernel activities include read/write accesses and transfers between global memory and the kernel(s). This information helps you understand details of application execution and identify potential areas for improvements.
Timeline data can be enabled and collected through the command line flow. However, viewing must be done in the Vitis analyzer as described in Using the Vitis Analyzer.
Generating and Opening the Application Timeline
To generate the Application Timeline report, you must complete the following steps to enable timeline and device trace data collection in the command line flow:
- Instrument the FPGA binary during linking, by adding
Acceleration Monitors and AXI Performance Monitors to kernels using the
v++ --profile_kerneloption. This option has three distinct instrumentation options (data,stall, andexec), as described in the Vitis Compiler Command. As an example, add--profile_kernelto thev++linking command line:v++ -g -l --profile_kernel data:all:all:all ... - After the kernels are instrumented during the build process, data
gathering must also be enabled during the application runtime execution by
editing the xrt.ini file. Refer to xrt.ini File for more information. The following xrt.ini file will enable maximum information gathering when the application is run:
[Debug] profile=true timeline_trace=true data_transfer_trace=coarse stall_trace=allTIP: If you are collecting a large amount of trace data, you might need to use the--trace_memorywith the v++ command, and thetrace_buffer_sizekeyword in the xrt.ini.After running the application, the Application Timeline data is captured in a CSV file called timeline_trace.csv.
- The CSV report can be viewed in a spreadsheet tool or utility,
or can be opened in the Vitis analyzer
tool, intended for viewing reports from the Vitis compiler when the application is built, and the XRT library
when the application is run. You can launch the Vitis analyzer and open the report using the following
command:
vitis_analyzer timeline_trace.csv
Interpreting the Appication Timeline
The Application Timeline window displays host and device events on a common timeline. This information helps you understand details of application execution and identify potential areas for improvements. The Application Timeline report has two main sections: Host and Device. The Host section shows the trace of all the activity originating from the host side. The Device section shows the activity of the CUs on the FPGA.
The report has the following structure:
- Host
- OpenCL API Calls
- All OpenCL API calls are traced here. The activity time is measured from the host perspective.
- Device "name"
- Binary Container "name"
- Binary container name.
Waveform View and Live Waveform Viewer
The Vitis core development kit can generate a Waveform view when running hardware emulation. It displays in-depth details at the system-level, CU level, and at the function level. The details include data transfers between the kernel and global memory and data flow through inter-kernel pipes. These details provide many insights into performance bottlenecks from the system-level down to individual function calls to help optimize your application.
The Live Waveform Viewer is similar to the Waveform view, however, it provides
even lower-level details with some degree of interactivity. The Live Waveform Viewer can
also be opened using the Vivado logic simulator,
xsim.
Waveform View and Live Waveform Viewer data are not collected by default because it requires the runtime to generate simulation waveforms during hardware emulation, which consumes more time and disk space. Refer to Generating and Opening the Waveform Reports for instructions on enabling these features.
You can also open the waveform database (.wdb) file with the Vivado logic simulator through the Linux command line:
xsim -gui <filename.wdb> &
Generating and Opening the Waveform Reports
Follow these instructions to enable waveform data collection from the command line during hardware emulation and open the viewer:
- Enable debug code generation during compilation and linking using
the
-goption.v++ -c -g -t hw_emu ... - Create an xrt.ini file in the
same directory as the host executable with the following contents (see xrt.ini File for more
information):
[Debug] profile=true timeline_trace=true [Emulation] launch_waveform=batchFor Live Waveform Viewer,launch_waveformis as follows:[Emulation] launch_waveform=guiTIP: If Live Waveform Viewer is enabled, the simulation waveform opens during the hardware emulation run. - Run the hardware emulation build of the application as described in Running an Application. The hardware transaction data is collected in the waveform database file, <hardware_platform>-<device_id>-<xclbin_name>.wdb. Refer to Directory Structure for the location of this file.
- Open the Waveform view in the Vitis analyzer as described in Waveform View and Live Waveform Viewer.
Interpreting Data in the Waveform Views
The following image shows the Waveform view:
The Waveform and Live Waveform views are organized hierarchically for easy navigation.
- The Waveform view is based on the actual waveforms generated during hardware emulation (Kernel Trace). This allows the viewer to descend all the way down to the individual signals responsible for the abstracted data. However, because the Waveform view is generated from the post-processed data, no additional signals can be added to the report, and some of the runtime analysis cannot be visualized, such as DATAFLOW transactions.
- The Live Waveform viewer is displaying the Vivado logic simulator (
xsim) run, so you can add extra signals and internals of the register transfer (RTL) design to the live view. Refer to the Vivado Design Suite User Guide: Logic Simulation (UG900) for information on working with the Waveform viewer.
The hierarchy of the Waveform and Live Waveform views include the following:
- Device "name"
- Target device name.
- Binary Container "name"
- Binary container name.
- Memory Data Transfers
- For each DDR Bank, this shows the trace of all the read and write request transactions arriving at the bank from the host.
- Kernel "name" 1:1:1
- For each kernel and for each compute
unit of that kernel, this section breaks down the
activities originating from the compute unit.
- Compute Unit: "name"
- Compute unit name.
- CU Stalls (%)
- Stall signals are provided by
the Vivado
HLS tool to inform you when a portion of the
circuit is stalling because of external memory
accesses, internal streams (that is, dataflow), or
external streams (that is, OpenCL pipes). The
stall bus shown in detailed kernel trace compiles
all of the lowest level stall signals and reports
the percentage that are stalling at any point in
time. This provides a factor of how much of the
kernel is stalling at any point in the simulation.
For example, if there are 100 lowest level stall signals and 10 are active on a given clock cycle, then the CU Stall percentage is 10%. If one goes inactive, then it is 9%.
- Data Transfers
- This shows the read/write data transfer accesses originating from each Master AXI port of the compute unit to the DDR.
- User Functions
- This information is available
for the HLS kernels and shows the user functions.
- Function: "name"
- Function name.
- Function: "name"
- Function name.
- Function: "name"
- Function name.