AI Engine Technology

AI Engine: Meeting the Compute Demands of Next-Generation Applications

In many dynamic and evolving markets, such as 5G cellular, data center, automotive, and industrial, applications are pushing for ever increasing compute acceleration while remaining power efficient. With Moore's Law and Dennard Scaling no longer following their traditional trajectories, moving to the next-generation silicon node alone cannot deliver the benefits of lower power and cost with better performance as with previous generations.

Responding to this non-linear increase in demand by next-generation applications, like wireless beamforming and machine learning inference, AMD has developed an innovative processing technology, the AI Engine, as part of the AMD Versal™ architecture.

AI Engine Architecture

AI Engines are architected as 2D arrays consisting of multiple AI Engine tiles and allow for a very scalable solution across the Versal portfolio, ranging from 10s to 100s of AI Engines in a single device, servicing the compute needs of a breadth of applications. Benefits include:

Multiple Programming Options

For high-performance DSP applications, the following methods are available for coding AI Engines (for more information please visit: AMD Vitis™ AI Engine DSP Design)

C-Based Flow using DSP Library Functions and API coding
Model-Based Design (using Vitis Model Composer in MathWorks Simulink)
Intrinsics

For AI/ML applications:

Robust libraries for AI/ML framework developers

Deterministic

Dedicated instruction and data memories
Dedicated connectivity paired with DMA engines for scheduled data movement using connectivity between AI Engine tiles

Efficiency

For high-performance DSP applications, AI Engines can deliver dynamic power reduction and substantial resource savings vs. traditional programmable logic only implementation

View Diagram

AI Engine Tile

Each AI Engine tile consists of a very long instruction word (VLIW), single instruction multiple data (SIMD) vector processor optimized for machine learning and advanced signal processing applications. The AI Engine processor can run up to 1.3 GHz, enabling very efficient, high-throughput and low-latency functions.

As well as the VLIW vector processor, each tile contains program memory to store the necessary instructions; local data memory for storing data, weights, activations and coefficients; and a RISC scalar processor and different modes of interconnect to handle different types of data communication.

View Diagram

Heterogeneous Workloads: Signal Processing and Machine Learning Inference Acceleration

AMD offers two types of AI Engines: AIE and AIE-ML (AI Engine for machine learning), both offering significant performance improvements over previous generation FPGAs. AIE accelerates a more balanced set of workloads including ML inference applications and high-performance DSP signal processing workloads like beamforming, radar, and other workloads requiring a massive amount of filtering and transforms. With enhanced AI vector extensions and the introduction of shared memory tiles within the AI Engine array, AIE-ML offers superior performance over AIE for ML inference-focused applications, while AIE can offer better performance over AIE-ML for certain types of advanced signal processing.

AI Engine Tile

AIE accelerates a balanced set of workloads, including ML inference applications and advanced signal processing workloads like beamforming, radar, FFTs, and filters.

Support for many workloads/applications

High-performance DSP for communications, radar, test & measurement, industrial/automotive applications
Video and image processing
Machine learning inference

Native support for real, complex, floating-point data types

INT8/16/32 fixed point
CINT16 and CINT32 complex fixed point
FP32 floating data point

Dedicated HW features for FFT and FIR implementations

128 INT8 MACs per tile

See AMD Versal AI Engine Architecture Manual to learn more.

View Diagram

OPs per AIE Tile

INT4

256

INT8

256

INT16

CINT16

16 16

BFLOAT16*

FP32

16 16

*BFLOAT16 implemented using FP32 vector processor.

AI Engine-ML Tile

The AI Engine-ML architecture is optimized for machine learning, enhancing both the compute core and memory architecture. Capable of both ML and advanced signal processing, these optimized tiles de-emphasize INT32 and CINT32 support (common in radar processing) to enhance ML-focused applications.

AIE-ML will be available in two versions: AIE-ML, which doubles the compute compared to AIE, and AIE-MLv2, which doubles the compute compared to AIE-ML and adds extra bandwidth between the stream interconnects.

Extended native support for ML data types

BFLOAT16
FP8 (AIE-MLv2 only)
FP16 (AIE-MLv2 only)
MX4 (AIE-MLv2 only)
MX6 (AIE-MLv2 only)
MX9 (AIE-MLv2 only)

Increased ML compute with reduced latency

256 INT8 MACs/cycle per tile in AIE-ML
512 INT8 MACs/cycle per tile in AIE-MLv2

Increased array memory to localize data

Doubled local data memory per tile (64 kB)
Memory tiles (512 kB) for high Bandwidth-shared memory access

View Diagram

OPs per AIE-ML Tile

INT4

1024

INT8

512

INT16

128

CINT16

BFLOAT16

256

FP32**

**SW emulation for AIE-ML FP32 support.

Part of a Heterogeneous Platform

The AI Engine, along with programmable logic and a processing system, form a tightly integrated heterogeneous architecture in Versal adaptive SoCs that can be changed at both the hardware and software levels to dynamically adapt to the needs of a wide range of applications and workloads.

Built from the ground up to be natively software programmable, the Versal architecture features a flexible, multi-terabit per-second programmable network on chip (NoC) to seamlessly integrate all components and key interfaces, making the platform available at boot and easily programmed by software developers, data scientists, and hardware developers alike.

View Diagram

Applications

AI Engines for Heterogeneous Workloads—Ranging from Wireless Processing to Machine Learning in the Cloud, Network, and Edge

Data Center Compute

Image and video analysis is central to the explosion of data in the data center. The convolutional neural network (CNN) nature of these workloads requires intense amounts of computation – often reaching multiple teraOPS. AI Engines have been optimized to deliver this computational density cost effectively and power efficiently.

5G Wireless Processing

5G can provide unprecedented throughput at extremely low latency, necessitating a significant increase in signal processing. AI Engines can execute this real-time signal processing in the radio unit (RU) and distributed unit (DU) at lower power, such as with the sophisticated beamforming techniques used in massive MIMO panels to increase network capacity.

ADAS and Automated Drive

CNNs are a class of deep, feed-forward artificial neural networks most commonly applied to analyzing visual imagery. CNNs have become essential as computers are used for everything from autonomous driving vehicles to video surveillance. AI Engines provide the necessary compute density and efficiency required for small form factors with tight thermal envelopes.

Aerospace & Defense

Merging powerful vector-based DSP Engines with AI Engines in a small form factor enables a breadth of systems in A&D, including phased array radar, early warning (EW), MILCOM, and unmanned vehicles. Supporting heterogeneous workloads ranging from signal processing, signal conditioning, and AI inference for multi-mission payloads, AI Engines deliver the compute efficiency to meet the aggressive size, weight, and power (SWaP) requirements of these mission-critical systems.

AI Engine Development Flows

AI Engines are built from the ground up to be software programmable and hardware adaptable. There are two distinct design flows for developers to unleash the performance of these compute engines with the ability to compile in minutes and rapidly explore different microarchitectures. The two design flows consist of:

The Vitis™ Unified IDE for C/C++ style programming, suited for software and hardware developers
Vitis Model Composer for a model-based design flow that operates as a plugin within MathWorks Simulink®
Vitis AI for an AI/ML framework-based flow, targeting AI and data scientists

AI Engine arrays can also enable the implementation of high-performance DSP functions in a resource- and power-optimized manner. Use of AI Engines in conjunction with the FPGA fabric resources can enable very efficient implementation of high-performance DSP applications. Learn how to use the AMD Vitis tool flow to unlock the hardware acceleration capabilities of AI Engines for DSP applications: AMD Vitis AI Engine DSP Design

View Diagram

AI Engine Libraries for Software/Hardware Developers and Data Scientists

With the Vitis Acceleration library, AMD provides pre-built kernels that enable:

Shorter development cycles
Portability across AI Engine architectures—e.g., AIE to AIE-ML
Faster learning and adoption of AI Engine technology
Designers to focus on their own proprietary algorithms

Software and hardware developers directly program the vector processor-based AI Engines and can call on pre-built libraries with C/C++ code where appropriate.

AI data scientists stay in their familiar framework environments, such as PyTorch or TensorFlow, and call pre-built ML overlays by way of Vitis AI without having to directly program the AI Engines.

The libraries are open source and available on GitHub: https://github.com/Xilinx/Vitis_Libraries.

Data Flow Programming for the Software/Hardware Developer

The AI Engine architecture is based on a data flow technology. Processing elements come in arrays of 10 to 100 tiles–creating a single program across compute units. For a designer to embed directives to specify parallelism across these tiles would be tedious and nearly impossible. To overcome this difficulty, AI Engine design is performed in two stages: single kernel development followed by Adaptive Data Flow (ADF) graph creation, which connects multiple kernels into an overall application.

Vitis Unified IDE provides a single IDE cockpit that enables AI Engine kernel development using C/C++ programming code and ADF graph design. Specifically, designers can:

Develop kernels in C/C++ and describe specific compute functions using Vitis libraries
Connect kernels via ADF graphs using Vitis AI Engine tools

A single kernel runs on a single AI Engine tile by default. However, multiple kernels can run on the same AI Engine tile, sharing the processing time where the application allows.

A conceptual example is shown below:

AI Engine kernels are developed in C/C++
Kernels in programmable logic (PL) are written in RTL or Vitis HLS (high level synthesis)
The data flow between kernels in both the PL and AI Engines is performed via an ADF graph

View Diagram

Integrating the AI Engine Design into a Complete System

Within the Vitis Unified IDE, the AI Engine design can be included into a larger complete system that combines all aspects of the design into an integrated flow where simulation, hardware emulation, debug, and deployment are possible.

Dedicated compilers target different heterogeneous engines of the Versal platform, including the processing system (Arm® subsystem), programmable logic, and both DSP and AI Engines.
A system compiler then links these individual blocks of code together and creates all the interconnections for optimizing the data movement between them and any custom memory hierarchies. The tool suite also integrates the x86 toolchain for PCIe® based systems.
To deploy your application, Xilinx Runtime software (XRT) provides platform-independent and OS-independent APIs for managing the device configuration, memory and host-to-device data transfers, and accelerator execution.
Once you have assembled your first prototype, you can simulate your application using a fast transaction-level simulator or a cycle-accurate simulator and use a performance analyzer to optimize your application for best partitioning and performance.
When you are happy with the results, you can deploy on the Versal platform.

View Diagram

Portfolio

Versal™ AI Core Series

The AMD Versal AI Core Series delivers breakthrough AI inference and wireless acceleration with AI Engines that deliver outstanding compute performance. Featuring the highest compute in the Versal portfolio, applications for Versal AI Core adaptive SoCs include data center compute, wireless beamforming, video and image processing, and wireless test equipment.

Discover Versal AI Core Series

Versal™ AI Edge Series

The AMD Versal AI Edge Series delivers high performance, low latency AI inference for intelligence in automated driving, predictive factory and healthcare systems, multi-mission payloads in aerospace and defense, and a breadth of other applications. More than just AI, the Versal AI Edge Series accelerates the whole application from sensor to AI to real-time control, all while meeting critical safety and security requirements.

Discover Versal AI Edge Series

Versal Premium Series

Engineered for the most demanding compute and data movement applications in wired communications, data center compute, test and measurement, and aerospace and defense, the AMD Versal Premium Series delivers outstanding adaptive signal processing capacity by integrating AI Engines, which combines programmable logic, DSP Engines, and hard IP blocks for Ethernet and High-Speed Crypto.

Discover Versal Premium Series

Versal AI Edge Series VEK280 Evaluation Kit

The VEK280 Evaluation Kit, equipped with the Versal AI Edge VE2802 adaptive SoC, offers AIE-ML and DSP hardware acceleration engines, along with multiple high-speed connectivity options.. This kit is optimized for ML inference applications in markets such as automotive, vision, aerospace and defense, industrial, scientific, and medical.

Buy VEK280

Versal AI Core Series VCK190 Evaluation Kit

The VCK190 Evaluation Kit enables designers to develop solutions using AI and DSP Engines capable of delivering over 100X greater compute performance compared to current server class CPUs. With a breadth of connectivity options and standardized development flows, the Versal AI Core Series VC1902 device provides the Versal portfolio's highest AI inference and signal processing throughput for cloud, network, and edge applications.

Buy VCK190

Get Started

The AMD Vitis unified software platform provides comprehensive core development kits and libraries that use hardware-acceleration technology.

Download Vitis Unified Software Platform

Visit the Vitis GitHub and AI Engine Development repositories to access a variety of AI Engine tutorials and learn more about the technology features and design methodology.

AI Engine tools, both compiler and simulator, are integrated within the Vitis IDE and require an additional dedicated license. Contact your local AMD sales representative for more information on how to access the AI Engine tools and license or visit the Contact Sales form.

AMD Vitis Model Composer is a model-based design tool that enables rapid design exploration within the Simulink® and MATLAB® environments. It facilitates AI Engine ADF graph development and testing at the system level, allowing users to incorporate RTL and HLS blocks with AI Engine kernels and/or graphs in the same simulation. Leveraging the signal generation and visualization features within the Simulink and MATLAB tool enables DSP engineers to design and debug in a familiar environment. To learn how to use Versal AI Engines with Vitis Model Composer, visit the AI Engine resource page.

Download Vitis Model Composer

Based on the Versal AI Core Series, the VCK190 kit enables designers to develop solutions using AI Engines and DSP Engines. The evaluation kit has everything you need to jump-start your designs.

Learn More about the Versal AI Core Series VCK190 Evaluation Kit

Also available is the PCIe®-based VCK5000 development card, featuring the Versal AI Core device with AI Engines, built for high-throughput AI inference in the data center.

Learn More about the VCK5000 Data Accelerator Card

For AIE-ML development, the VEK280 Evaluation Kit, based on the Versal AI Edge Series, will enable developers for DSP and ML applications.

Learn More about the Versal AI Edge Series VEK280 Evaluation Kit

AMD training and learning resources provide the practical skills and fundamental knowledge you need to be fully productive in your next Versal adaptive SoC development project. Courses include:

From solution planning to system integration and validation, AMD provides tailored views of the extensive list of Versal adaptive SoC documentation to maximize the productivity of user designs. Visit the Versal adaptive SoC design process hubs to get the latest content for your design needs and explore AI Engine capabilities and design methodologies.

Explore the Versal Adaptive SoC Design Process Hubs

Stay Informed

Join the adaptive SoC and FPGA notification list to receive the latest news and updates.

数据中心

商用系统

个人和游戏

嵌入式产品

资源

加速器

自适应加速器

DPU 加速器

以太网适配器

工作站

台式机

笔记本电脑

资源

自适应 SoC 和 FPGA

模块化系统 (SOM)

技术

开发者资源

评估板与套件

处理器工具

显卡工具和应用

自适应 SoC 和 FPGA

IP 与应用

GPU 加速器工具和应用

概要

面向数据中心和云计算

面向边缘计算和终端

面向开发人员

行业

行业

行业

Industrias

游戏

系统

技术

资源

EPYC（霄龙）处理器

Radeon 显卡与 AMD 芯片组

FPGA 和自适应 SoC

Alveo 加速器和 Kria SOM

锐龙处理器

以太网适配器

概要

处理器

加速器

自适应 SoC、FPGA 和 SOM

显卡

概要

资源按市场领域

资源按产品

资源按类型

关于我们的合作伙伴

AMD 全球支持

处理器与显卡

加速器

FPGA 与自适应 SoC

选择我们的零售合作伙伴

自适应和嵌入式计算

Get AMD Fan Gear

Get AMD Fan Gear

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Buy Direct From AMD

Your cart is empty

AI Engine: Meeting the Compute Demands of Next-Generation Applications

AI Engine Architecture​

Multiple Programming Options

For AI/ML applications:

Deterministic​

Efficiency​

AI Engine Tile​

Heterogeneous Workloads: Signal Processing and Machine Learning Inference Acceleration​

AI Engine Tile​

Support for many workloads/applications​

Native support for real, complex, floating-point data types

Dedicated HW features for FFT and FIR implementations​

OPs per AIE Tile

AI Engine-ML Tile​

AI Engine Architecture

Deterministic

Efficiency

AI Engine Tile

Heterogeneous Workloads: Signal Processing and Machine Learning Inference Acceleration

AI Engine Tile

Support for many workloads/applications

Dedicated HW features for FFT and FIR implementations

AI Engine-ML Tile

Extended native support for ML data types

Increased array memory to localize data

Part of a Heterogeneous Platform