Many of today's workloads and applications such as AI, data analytics, and live video transcoding, and genomic analytics, require an increasing amount of bandwidth. Traditional DDR memory solutions have not been able to keep up with the growing compute and memory bandwidth-intensive workloads are becoming data movement and access bottlenecks. This figure shows the compute capacity growth vs traditional DDR bandwidth growth.
High-bandwidth memory (HBM) helps alleviate this bottleneck by providing more storage capacity and data bandwidth using system in package (SiP) memory technology to stack DRAM chips vertically and using a wide (1024-bit) interface.
The Virtex UltraScale+ HBM-enabled devices (VU+ HBM) close the bandwidth gap with greatly improved bandwidth capabilities up to 460GB/s delivered by two HBM2 stacks. These devices also include up to 2.85 million logic cells and up to 9,024 DSP slices capable of delivering 28.1 peak INT8 TOPs. For more details on how Xilinx's VU+ HBM devices are accelerating applications refer to WP508.
The purpose of this article is to discuss what design aspects can negatively impact memory bandwidth, what options we have available to improve the bandwidth, and then one way to profile the HBM bandwidth to illustrate the trade-offs. These same techniques can be used to profile HBM bandwidth on the Alveo U280, VCU128, and any Xilinx UltraScale+ HBM device. It can also be used on any accelerated application using a pre-existing DSA or custom DSAs. We’ll explain the process for creating a custom DSA in Vivado and how to use Xilinx® Vitis™ unified software platform to create C/C++ Kernels and memory traffic to profile the HBM stacks.
Before discussing what impacts memory bandwidth let's explain how bandwidth is calculated. Using VU+ HBM as an example, with 2 HBM2 stacks available these devices can provide a theoretical bandwidth up to 460GB/s:
2xHBM2 stacks Each stack has 16 Channels
Each stack has 16 Channels
Each channel is 64-bits of Data (DQ) bits Data can be transferred up to 1800Mbps
Theoretical Bandwidth = 2x16x64x1800Mbps=3.686Tb/s or 460GB/s
Anyone who has worked with external DRAM interfaces knows achieving theoretical bandwidth is not possible. In fact, depending on several different factors, it can be difficult to even come close. Here are several of the top contributing factors that can negatively impact your effective bandwidth.
In VU+ HBM, there is a hardened AXI Switch which enables access from any of the 32 AXI channels to any of the HBM pseudo channels and addressable memory.
There are many advantages to having a hardened switch such as flexible addressing and reduction of design complexity and routing congestion. WP485 does a good job of highlighting many of the advantages if you're interested. To enable flexible addressing across the entire HBM stacks the hardened AXI switch contains switch boxes broken up across 4 masters x 4 slaves.
This facilitates the flexible addressing but there is a limitation that can impact memory bandwidth. There are only 4 horizontal paths available which depending on which AXI channel is access which addressable memory location in the HBM stack can greater limit your achievable bandwidth due to arbitration.
Now that we know what some of the contributing factors to poor memory bandwidth let's discuss some options available to mitigate them.
Consider changing your command and addressing patterns. Since random accesses and short bursts of read/write transactions result in the worst bandwidth see if you can alter this on the user application. This will get you the biggest bang for your buck.
If you’re unable to change your traffic pattern the HBM Memory Controller IP has several options available that may help:
The figures below are taken from our VCU128 HBM Performance and Latency demo and attempt to highlight the bandwidth/throughput results from several different AXI Switch configurations.
New to Vivado is the HBM monitor which, similar to SysMon, can display the die temperature of each HBM2 die stack individually. It also can display the bandwidth on a per MC or Psuedo Channel (PC) basis.
For this test, read-only traffic is sent across all MC's. Only MC0 was added to the HBM monitor and it reports that the Read bandwidth is 26.92GBps. This is around 90% efficiency with a theoretical bandwidth being 30GBps
To profile your hardware design and HBM configuration properly start with the default HBM settings and capture the read/write throughput as your baseline. Then regenerate new .bit files using each of and combinations of HBM MC options discussed earlier to determine which provides the highest throughput. Note, that how the AXI Switch is configured can also impact the HBM bandwidth and throughput and should be considered profiling as well.
A future update to this article will provide profiling results from using various different MC options. We will also explore using the AXI Performance Monitors for profiling bandwidth to the HBM AXI channels.
If you’re using a pre-existing design and the Vitis tool, you will need to modify the hardware platform design using a custom DSA flow. This flow will be described later in the article.
To profile the HBM bandwidth create or use an existing design or application. To profile different HBM configurations you will need access to the hardware design in order to modify the HBM IP core and then generate new bitstreams and new .xsa/.dsa files that are used in the Vitis tool for software development.
What is Vitis technology you ask? Vitis is a unified software tool that provides a framework for developing and delivering FPGA accelerated data center applications using standard programming languages and for creating software platforms targeting embedded processors.
For existing designs refer to Github, the SDAccel Example repositories, the U280 product page and the VCU128 product page which contains targeted reference designs (TRDs). If you are targeting a custom platform, or even the U280 or VCU128, and need to create a custom hardware platform design this can also be done.
Why do I need to create a custom hardware platform for the Alveo U280 if dsa’s already exist? As workload algorithms evolve, reconfigurable hardware enables Alveo to adapt faster than fixed-function accelerator card product cycles. Having the flexibility to customize and reconfigure the hardware gives Alveo a unique advantage over competition. In the context of this tutorial, we want to customize and generate several new hardware platforms using different HBM IP core configurations to profile the impacts on memory bandwidth to determine which provides the best results.
There are several ways to build a custom hardware platform but the quickest is to use Vivado IP Integrator (IPI). I’ll walk you through one way to do this using Microblaze to generate the HBM memory traffic in software. This could also be done in HLS, SDAccel, or in the Vitis tool with hardware accelerated memory traffic. Using Microblaze as the traffic generator makes it easy to control the traffic pattern including memory address locations and we can use a default memory test template to modify and create loops and various patterns to help profile the HBM bandwidth effectively.
The steps to build a design in the Vitis tool or SDK are similar and will include something like this:
Add MicroBlaze, UART and any additional peripheral IP needed
8. Select workspace
9. Create new application project and Board Support Package
10. Click Next, Select Create from hardware, click “+” and point to .xsa
11 Click Next, select CPU Microblaze, Language C
12. Click Next, select “Memory Tests” and click Finish
13. Build and run memory test on target
The memory test template is a good starting point for generating traffic as it will run through all AXI channels enabled in your design and the HBM memory range and traffic patterns can be easily modified
Note, A future update to this article will include a reference design which can be used for HBM profiling.
This article has explained why HBM is needed to keep up with the growing DDR bandwidth demand and hopefully has educated you on what can impact DRAM bandwidth, options available to maximize your bandwidth, and how to monitor and profile your results.
Using Vitis technology to generate and accelerate HBM traffic is a quick and easy way to verify your bandwidth requirements are met and to profile various different HBM configurations to determine which is optimal for your system.
Stay tuned to this article for future updates including reference designs, software accelerated traffic, custom hardware DSA's to profile different MC options, and bandwidth profiled results in the HBM monitor.
Citations
UG1352 – Get Moving with Alveo
WP485 – Virtex UltraScale+ HBM FPGA: A Revolutionary Increase in Memory Performance
PG276 – AXI High Bandwidth memory Controller v1.0
Chris Riley is a FAE based in Colorado with particular expertise in all things memory-related. He has spent his entire career at AMD troubleshooting technical issues for customers and still enjoys it (imagine that!). In his spare time, he enjoys spending time and traveling with his wife and two young kids. He is a ski bum at heart, and can spend hours talking shop and all things ski. He has also become obsessed with mountain biking which occupies any remaining free time when there’s no snow on the ground.