AI Engine: Meeting the Compute Demands of Next-Generation Applications

In many dynamic and evolving markets, such as 5G cellular, data center, automotive, and industrial, applications are pushing for ever increasing compute acceleration while remaining power efficient. With Moore's Law and Dennard Scaling no longer following their traditional trajectories, moving to the next-generation silicon node alone cannot deliver the benefits of lower power and cost with better performance as with previous generations.

Responding to this non-linear increase in demand by next-generation applications, like wireless beamforming and machine learning inference, AMD has developed an innovative processing technology, the AI Engine, as part of the AMD Versal™ architecture.

AI Engine Architecture​

AI Engines are architected as 2D arrays consisting of multiple AI Engine tiles and allow for a very scalable solution across the Versal portfolio, ranging from 10s to 100s of AI Engines in a single device, servicing the compute needs of a breadth of applications. Benefits include:​

Multiple Programming Options

For high-performance DSP applications, the following methods are available for coding AI Engines (for more information please visit: AMD Vitis™ AI Engine DSP Design)

  • C-Based Flow using DSP Library Functions and API coding
  • Model-Based Design (using Vitis Model Composer in MathWorks Simulink)
  • Intrinsics 
For AI/ML applications:
  • Robust libraries for AI/ML framework developers
Deterministic​
  • Dedicated instruction and data memories​
  • Dedicated connectivity paired with DMA engines for scheduled data movement using connectivity between AI Engine tiles ​
Efficiency​
  • For high-performance DSP applications, AI Engines can deliver dynamic power reduction and substantial resource savings vs. traditional programmable logic only implementation​
AI Engine Tile​

Each AI Engine tile consists of a very long instruction word (VLIW), single instruction multiple data (SIMD) vector processor optimized for machine learning and advanced signal processing applications. The AI Engine processor can run up to 1.3 GHz, enabling very efficient, high-throughput and low-latency functions.

As well as the VLIW vector processor, each tile contains program memory to store the necessary instructions; local data memory for storing data, weights, activations and coefficients; and a RISC scalar processor and different modes of interconnect to handle different types of data communication.

Heterogeneous Workloads: Signal Processing and Machine Learning Inference Acceleration​

AMD offers two types of AI Engines: AIE and AIE-ML (AI Engine for machine learning), both offering significant performance improvements over previous generation FPGAs. AIE accelerates a more balanced set of workloads including ML inference applications and high-performance DSP signal processing workloads like beamforming, radar, and other workloads requiring a massive amount of filtering and transforms. With enhanced AI vector extensions and the introduction of shared memory tiles within the AI Engine array, AIE-ML offers superior performance over AIE for ML inference-focused applications, while AIE can offer better performance over AIE-ML for certain types of advanced signal processing.

AI Engine Diagram Workloads
AI Engine Tile​

AIE accelerates a balanced set of workloads, including ML inference applications and advanced signal processing workloads like beamforming, radar, FFTs, and filters.

Support for many workloads/applications​
  • High-performance DSP for communications, radar, test & measurement, industrial/automotive applications
  • Video and image processing
  • Machine learning inference
Native support for real, complex, floating-point data types
  • INT8/16/32 fixed point
  • CINT16 and CINT32 complex fixed point
  • FP32 floating data point
Dedicated HW features for FFT and FIR implementations​
  • 128 INT8 MACs per tile​

See AMD Versal AI Engine Architecture Manual to learn more.​

OPs per AIE Tile
INT4
256
INT8
256
INT16
64
CINT16
16 16
BFLOAT16*
16
FP32
16 16

*BFLOAT16 implemented using FP32 vector processor.

AI Engine-ML Tile​

The AI Engine-ML architecture is optimized for machine learning, enhancing both the compute core and memory architecture. Capable of both ML and advanced signal processing, these optimized tiles de-emphasize INT32 and CINT32 support (common in radar processing) to enhance ML-focused applications.​

AIE-ML will be available in two versions: AIE-ML, which doubles the compute compared to AIE, and AIE-MLv2, which doubles the compute compared to AIE-ML and adds extra bandwidth between the stream interconnects.

Extended native support for ML data types​
  • BFLOAT16
  • FP8 (AIE-MLv2 only)
  • FP16 (AIE-MLv2 only)
  • MX4 (AIE-MLv2 only)
  • MX6 (AIE-MLv2 only)
  • MX9 (AIE-MLv2 only)
Increased ML compute with reduced latency
  • 256 INT8 MACs/cycle per tile in AIE-ML
  • 512 INT8 MACs/cycle per tile in AIE-MLv2
Increased array memory to localize data​
  • Doubled local data memory per tile (64 kB)
  • Memory tiles (512 kB) for high Bandwidth-shared memory access​
OPs per AIE-ML Tile
INT4
1024
INT8
512
INT16
128
CINT16
16
BFLOAT16
256
FP32**
42

**SW emulation for AIE-ML FP32 support.

Part of a Heterogeneous Platform ​

The AI Engine, along with programmable logic and a processing system, form a tightly integrated heterogeneous architecture in Versal adaptive SoCs that can be changed at both the hardware and software levels to dynamically adapt to the needs of a wide range of applications and workloads.

Built from the ground up to be natively software programmable, the Versal architecture features a flexible, multi-terabit per-second programmable network on chip (NoC) to seamlessly integrate all components and key interfaces, making the platform available at boot and easily programmed by software developers, data scientists, and hardware developers alike.

Applications

AI Engines for Heterogeneous Workloads—Ranging from Wireless Processing to Machine Learning in the Cloud, Network, and Edge​

Data Center Compute​

Image and video analysis is central to the explosion of data in the data center. The convolutional neural network (CNN) nature of these workloads requires intense amounts of computation – often reaching multiple teraOPS. AI Engines have been optimized to deliver this computational density cost effectively and power efficiently.

AI Engine Development Flows

AI Engines are built from the ground up to be software programmable and hardware adaptable. There are two distinct design flows for developers to unleash the performance of these compute engines with the ability to compile in minutes and rapidly explore different microarchitectures. The two design flows consist of:

  • The Vitis™ Unified IDE for C/C++ style programming, suited for software and hardware developers
  • Vitis Model Composer for a model-based design flow that operates as a plugin within MathWorks Simulink®
  • Vitis AI for an AI/ML framework-based flow, targeting AI and data scientists

AI Engine arrays can also enable the implementation of high-performance DSP functions in a resource- and power-optimized manner. Use of AI Engines in conjunction with the FPGA fabric resources can enable very efficient implementation of high-performance DSP applications. Learn how to use the AMD Vitis tool flow to unlock the hardware acceleration capabilities of AI Engines for DSP applications: AMD Vitis AI Engine DSP Design

AI Engine Libraries for Software/Hardware Developers and Data Scientists

With the Vitis Acceleration library, AMD provides pre-built kernels that enable:

  • Shorter development cycles
  • Portability across AI Engine architectures—e.g., AIE to AIE-ML
  • Faster learning and adoption of AI Engine technology
  • Designers to focus on their own proprietary algorithms

Software and hardware developers directly program the vector processor-based AI Engines and can call on pre-built libraries with C/C++ code where appropriate.

AI data scientists stay in their familiar framework environments, such as PyTorch or TensorFlow, and call pre-built ML overlays by way of Vitis AI without having to directly program the AI Engines.

The libraries are open source and available on GitHub: https://github.com/Xilinx/Vitis_Libraries.

Data Flow Programming for the Software/Hardware Developer

The AI Engine architecture is based on a data flow technology. Processing elements come in arrays of 10 to 100 tiles–creating a single program across compute units. For a designer to embed directives to specify parallelism across these tiles would be tedious and nearly impossible. To overcome this difficulty, AI Engine design is performed in two stages: single kernel development followed by Adaptive Data Flow (ADF) graph creation, which connects multiple kernels into an overall application.

Vitis Unified IDE provides a single IDE cockpit that enables AI Engine kernel development using  C/C++ programming code and ADF graph design. Specifically, designers can:

  • Develop kernels in C/C++ and describe specific compute functions using Vitis libraries
  • Connect kernels via ADF graphs using Vitis AI Engine tools

A single kernel runs on a single AI Engine tile by default. However, multiple kernels can run on the same AI Engine tile, sharing the processing time where the application allows.

A conceptual example is shown below:

  • AI Engine kernels are developed in C/C++
  • Kernels in programmable logic (PL) are written in RTL or Vitis HLS (high level synthesis) 
  • The data flow between kernels in both the PL and AI Engines is performed via an ADF graph
Integrating the AI Engine Design into a Complete System

Within the Vitis Unified IDE, the AI Engine design can be included into a larger complete system that combines all aspects of the design into an integrated flow where simulation, hardware emulation, debug, and deployment are possible.

  • Dedicated compilers target different heterogeneous engines of the Versal platform, including the processing system (Arm® subsystem), programmable logic, and both DSP and AI Engines.
  • A system compiler then links these individual blocks of code together and creates all the interconnections for optimizing the data movement between them and any custom memory hierarchies. The tool suite also integrates the x86 toolchain for PCIe® based systems.
  • To deploy your application, Xilinx Runtime software (XRT) provides platform-independent and OS-independent APIs for managing the device configuration, memory and host-to-device data transfers, and accelerator execution. 
  • Once you have assembled your first prototype, you can simulate your application using a fast transaction-level simulator or a cycle-accurate simulator and use a performance analyzer to optimize your application for best partitioning and performance.
  • When you are happy with the results, you can deploy on the Versal platform.

Portfolio

Versal™ AI Core Series

The AMD Versal AI Core Series delivers breakthrough AI inference and wireless acceleration with AI Engines that deliver outstanding compute performance. Featuring the highest compute in the Versal portfolio, applications for Versal AI Core adaptive SoCs include data center compute, wireless beamforming, video and image processing, and wireless test equipment.

Versal™ AI Edge Series

The AMD Versal AI Edge Series delivers high performance, low latency AI inference for intelligence in automated driving, predictive factory and healthcare systems, multi-mission payloads in aerospace and defense, and a breadth of other applications. More than just AI, the Versal AI Edge Series accelerates the whole application from sensor to AI to real-time control, all while meeting critical safety and security requirements.

Versal Premium Series

Engineered for the most demanding compute and data movement applications in wired communications, data center compute, test and measurement, and aerospace and defense, the AMD Versal Premium Series delivers outstanding adaptive signal processing capacity by integrating AI Engines, which combines programmable logic, DSP Engines, and hard IP blocks for Ethernet and High-Speed Crypto.

AMD Versal AI Edge VEK280
Versal AI Edge Series VEK280 Evaluation Kit

The VEK280 Evaluation Kit, equipped with the Versal AI Edge VE2802 adaptive SoC, offers AIE-ML and DSP hardware acceleration engines, along with multiple high-speed connectivity options.. This kit is optimized for ML inference applications in markets such as automotive, vision, aerospace and defense, industrial, scientific, and medical.

AI Engine Diagram Versal AI Kit
Versal AI Core Series VCK190 Evaluation Kit

The VCK190 Evaluation Kit enables designers to develop solutions using AI and DSP Engines capable of delivering over 100X greater compute performance compared to current server class CPUs. With a breadth of connectivity options and standardized development flows, the Versal AI Core Series VC1902 device provides the Versal portfolio's highest AI inference and signal processing throughput for cloud, network, and edge applications.

Get Started

The AMD Vitis unified software platform provides comprehensive core development kits and libraries that use hardware-acceleration technology.

Visit the Vitis GitHub and AI Engine Development repositories to access a variety of AI Engine tutorials and learn more about the technology features and design methodology.

AI Engine tools, both compiler and simulator, are integrated within the Vitis IDE and require an additional dedicated license. Contact your local AMD sales representative for more information on how to access the AI Engine tools and license or visit the Contact Sales form. 

AMD Vitis Model Composer is a model-based design tool that enables rapid design exploration within the Simulink® and MATLAB® environments. It facilitates AI Engine ADF graph development and testing at the system level, allowing users to incorporate RTL and HLS blocks with AI Engine kernels and/or graphs in the same simulation. Leveraging the signal generation and visualization features within the Simulink and MATLAB tool enables DSP engineers to design and debug in a familiar environment. To learn how to use Versal AI Engines with Vitis Model Composer, visit the AI Engine resource page​.

Based on the Versal AI Core Series, the VCK190 kit enables designers to develop solutions using AI Engines and DSP Engines. The evaluation kit has everything you need to jump-start your designs.

Also available is the PCIe®-based VCK5000 development card, featuring the Versal AI Core device with AI Engines, built for high-throughput AI inference in the data center.

For AIE-ML development, the VEK280 Evaluation Kit, based on the Versal AI Edge Series, will enable developers for DSP and ML applications.

AMD training and learning resources provide the practical skills and fundamental knowledge you need to be fully productive in your next Versal adaptive SoC development project. Courses include:

From solution planning to system integration and validation, AMD provides tailored views of the extensive list of Versal adaptive SoC documentation to maximize the productivity of user designs. Visit the Versal adaptive SoC design process hubs to get the latest content for your design needs and explore AI Engine capabilities and design methodologies.

Resources

Stay Informed

Join the adaptive SoC and FPGA notification list to receive the latest news and updates.