Estimating Performance

Profiling and Instrumenting Code to Measure Performance

The first major task in profiling and instrumenting code is to identify portions of application code that are suitable for implementation in hardware, and that significantly improve overall performance when run in hardware. Compute intensive regions of code are good candidates for hardware acceleration, especially when it is possible to stream data between hardware, the CPU, and memory to overlap the computation with the communication. Software profiling is a standard way to identify the most CPU-intensive portions of your program. An example of a function that would not do well for acceleration is one that takes more time to transfer data to/from the accelerator than to compute the result. The SDSoC™ environment includes all performance and profiling capabilities that are included in the Xilinx® SDK tool, including gprof, the non-intrusive Target Communication Framework (TCF) profiler, and the Performance Analysis perspective within Eclipse.

To run the TCF Profiler for a standalone application, use the following steps:

  1. Set the active build configuration to Debug by right-clicking the project in the Project Explorer and selecting Build Configurations > Set Active > Debug.
  2. Launch the debugger by right-clicking the project name in the Project Explorer and selecting Debug As > Launch on hardware (SDx Application Debugger).
    Note: The board must be connected to your computer and powered on. The application automatically breaks at the entry to main().
  3. Launch the TCF Profiler by selecting Window > Show View > Other. In the window that is produced, expand Debug, and select TCF profiler.
  4. To start the TCF Profiler, click the green Start button at the top of the TCF Profiler tab.
  5. Enable Aggregate per function in the Profiler Configuration dialog box.
  6. To start the profiling, click the Resume button or press F8. The program runs to completion and breaks at the exit() function.
  7. View the results in the TCF Profiler tab.

Profiling provides a statistical method for finding highly used regions of code based on sampling the CPU program counter and correlating to the program in execution. Another way to measure program performance is to instrument the application to determine the actual duration between different parts of a program in execution.

Using the TCF Profiler provides more in-depth information related to either a standalone or a Linux OS application. As seen in the previous steps, no additional compilation flags were needed to use the Profiler.

Note: This type of profiling for hardware requires a JTAG connection.

The sds_lib library included in the SDSoC environment provides a simple, source code annotation-based, time-stamping API that can be used to measure application performance, as shown in the following example:

 * @return value of free-running 64-bit Zynq(TM) global counter
unsigned long long sds_clock_counter(void);

Using this API to collect timestamps and differences between them, you can determine duration of key parts of your program. For example, you can measure data transfer or overall round trip execution time for hardware functions, as shown in the following code snippet:

class perf_counter
     uint64_t tot, cnt, calls;
     perf_counter() : tot(0), cnt(0), calls(0) {};
     inline void reset() { tot = cnt = calls = 0; }
     inline void start() { cnt = sds_clock_counter(); calls++; };
     inline void stop() { tot += (sds_clock_counter() - cnt); };
     inline uint64_t avg_cpu_cycles() { return (tot / calls); };

extern void f();
void measure_f_runtime()
     perf_counter f_ctr;
     std::cout << "Cpu cycles f(): " << f_ctr.avg_cpu_cycles()
     	       << std::endl;

The performance estimation feature within the SDSoC environment employs this API by automatically instrumenting functions selected for hardware implementation, measuring actual runtimes by running the application on the target, and then comparing actual times with estimated times for the hardware functions.

Note: While off-loading CPU-intensive functions is one of the most reliable heuristics to partition your application, it is not guaranteed to improve system performance without algorithmic modification to optimize memory accesses. A CPU almost always has much faster random access to external memory than you can achieve from programmable logic, due to multi-level caching and a faster clock speed (typically 2x to 8x faster than programmable logic). Extensive manipulation of pointer variables over a large address range, for example, a sort routine that sorts indices over a large index set, while very well-suited for a CPU, could become a liability when moving a function into programmable logic. This does not mean that such compute functions are not good candidates for hardware, only that code or algorithm restructuring could be required. This is a known issue for DSP and GPU coprocessors.

SDSCC/SDS++ Performance Estimation Flow Options

A full bitstream compile can take much more time than a software compile, so the sds++/sdscc (referred to as sds++) applications provide performance estimation options to compute the estimated runtime improvement for a set of hardware function calls.

In the Application Project Settings pane, to invoke the estimator, select the Estimate Performance check box. This enables performance estimation for the current build configuration and builds the project.

Figure: Setting Estimate Performance in Application Project Settings

Estimating the speed-up is a two phase process:

  1. The SDSoC environment compiles the hardware functions and generates the system. Instead of synthesizing the system to bitstream, the sds++ computes an estimate of the performance based on estimated latencies for the hardware functions and data transfer time estimates for the callers of hardware functions.
  2. In the generated Performance Report, to determine a performance baseline and the performance estimate, select Click Here to run an instrumented version of the software on the target.

See the SDSoC Environment Getting Started Tutorial (UG1028) for a tutorial on how to use the Performance Report.

You can also generate a performance estimate from the command line. As a first pass to gather data about software runtime, use the -perf-funcs option to specify functions to profile and -perf-root to specify the root function encompassing calls to the profiled functions.

The sds++ system compiler then automatically instruments these functions to collect runtime data when the application is run on a board. When you run an instrumented application on the target, the program creates a file on the SD card called swdata.xml, which contains the runtime performance data for the run.

Copy the swdata.xml to the host, and run a build that estimates the performance gain on a per hardware function caller basis and for the top-level function specified by the –perf-root function in the first pass run. Use the –perf-est option to specify swdata.xml as input data for this build.

The following table specifies the sds++ system compiler options normally used to build an application.

Table 1. Commonly used sds++ options
Option Description
-perf-funcs function_name_list Specifies a comma separated list of all functions to be profiled in the instrumented software application.
-perf-root function_name Specifies the root function encompassing all calls to the profiled functions. The default is the function main.
-perf-est data_file Specifies the file containing runtime data generated by the instrumented software application when run on the target. Estimate performance gains for hardware accelerated functions. The default name for this file is swdata.xml.
-perf-est-hw-only Runs the estimation flow without running the first pass to collect software run data. Using this option provides hardware latency and resource estimates without providing a comparison against baseline.
After running the sd_card image on the board for collecting profile data, type cd /; sync; umount /mnt;. This ensures that the swdata.xml file is written out to the SD card.