Deploying and Running the Model
Deploying and Running Models on Alveo U200/250
Vitis AI provides unified C++ and Python APIs for Edge and Cloud to deploy models on FPGAs.
Programming with VART
Vitis AI provides a C++ DpuRunner class with the following interfaces:
std::pair<uint32_t, int> execute_async(
const std::vector<TensorBuffer*>& input,
const std::vector<TensorBuffer*>& output);
- Submit input tensors for execution and output tensors to store
results. The host pointer is passed using the TensorBuffer object. This function
returns a job ID and the status of the function
call.
int wait(int jobid, int timeout);The job ID returned by execute_async is passed to
wait()to block until the job is complete and the results are ready.TensorFormat get_tensor_format() - Query the DpuRunner for the tensor format it expects.
Returns DpuRunner::TensorFormat::NCHW or DpuRunner::TensorFormat::NHWC
std::vector<Tensor*> get_input_tensors() - Query the DpuRunner for the shape and name of the output tensors it
expects for its loaded Vitis AI
model.
std::vector<Tensor*> get_output_tensors() - To create a DpuRunner object call the
following:
create_runner(const xir::Subgraph* subgraph, const std::string& mode = "")It returns the following:
std::unique_ptr<Runner>
The input to create_runner is a XIR subgraph generated by the Vitis AI compiler.
C++ Example
// get dpu subgraph by parsing model file
auto runner = vart::Runner::create_runner(subgraph, "run");
// populate input/output tensors
auto job_data = runner->execute_async(inputs, outputs);
runner->wait(job_data.first, -1);
// process outputs
Vitis AI also provides a Python ctypes Runner class that mirrors the C++ class, using the C DpuRunner implementation:
class Runner:
def __init__(self, path)
def get_input_tensors(self)
def get_output_tensors(self)
def get_tensor_format(self)
def execute_async(self, inputs, outputs)
# differences from the C++ API:
# 1. inputs and outputs are numpy arrays with C memory layout
# the numpy arrays should be reused as their internal buffer
# pointers are passed to the runtime. These buffer pointers
# may be memory-mapped to the FPGA DDR for performance.
# 2. returns job_id, throws exception on error
def wait(self, job_id)
Python Example
dpu_runner = runner.Runner(subgraph,"run")
# populate input/output tensors
jid = dpu_runner.execute_async(fpgaInput, fpgaOutput)
dpu_runner.wait(jid)
# process fpgaOutput
DPU Debug with VART
This chapter aims to demonstrate how to verify DPU inference result with VART tools. TensorFlow ResNet50, Caffe ResNet50, and PyTorch ResNet50 networks are used as examples. Following are the four steps for debugging the DPU with VART:
- Generate a quantized inference model and reference result
- Generate a DPU xmodel
- Generate a DPU inference result
- Crosscheck the reference result and the DPU inference result
Before you start to debug the DPU result, ensure that you have set up the environment according to the instructions in the Getting Started section.
TensorFlow Workflow
To generate the quantized inference model and reference result, follow these steps:
- Generate the quantized inference model by running the following command to
quantize the model. The quantized model, quantize_eval_model.pb, is generated in the
quantize_modelfolder.vai_q_tensorflow quantize \ --input_frozen_graph ./float/resnet_v1_50_inference.pb \ --input_fn input_fn.calib_input \ --output_dir quantize_model \ --input_nodes input \ --output_nodes resnet_v1_50/predictions/Reshape_1 \ --input_shapes ?,224,224,3 \ --calib_iter 100 - Generate the reference result by running the following command to generate
reference
data.
vai_q_tensorflow dump --input_frozen_graph \ quantize_model/quantize_eval_model.pb \ --input_fn input_fn.dump_input \ --output_dir=dump_gpuThe following figure shows part of the reference data.
- Generate the DPU xmodel by running the following command to generate the DPU
xmodel
file.
vai_c_tensorflow --frozen_pb quantize_model/quantize_eval_model.pb \ --arch /opt/vitis_ai/compiler/arch/DPUCAHX8H/U50/arch.json \ --output_dir compile_model \ --net_name resnet50_tf - Generate the DPU inference result by running the following command to
generate the DPU inference result and compare the DPU inference result with the
reference data
automatically.
env XLNX_ENABLE_DUMP=1 XLNX_ENABLE_DEBUG_MODE=1 XLNX_GOLDEN_DIR=./dump_gpu/dump_results_0 \ xilinx_test_dpu_runner ./compile_model/resnet_v1_50_tf.xmodel \ ./dump_gpu/dump_results_0/input_aquant.bin \ 2>result.log 1>&2Forxilinx_test_dpu_runner, the usage is as follow:xilinx_test_dpu_runner <model_file> <input_data>After the above command runs, the DPU inference result and the comparing result
result.logare generated. The DPU inference results are located in thedumpfolder. - Crosscheck the reference result and the DPU inference result.
- View comparison results for all
layers.
grep --color=always 'XLNX_GOLDEN_DIR.*layer_name' result.log - View only the failed
layers.
grep --color=always 'XLNX_GOLDEN_DIR.*fail ! layer_name' result.log
If the crosscheck fails, use the following methods to further check from which layer the crosscheck fails.
- Check the input of DPU and GPU, make sure they use the same input data.
- Use
xirtool to generate a picture for displaying the network's structure.Usage: xir svg <xmodel> <svg>Note: In the Vitis AI docker environment, execute the following command to install the required library.sudo apt-get install graphvizWhen you open the picture you created, you can see many little boxes arround these ops. Each box means a layer on DPU. You can use the last op's name to find its corresponding one in GPU dump-result. The following figure shows parts of the structure.
- Submit the files to Xilinx.
If certain layer proves to be wrong on DPU, prepare the quantized model, such as
quantize_eval_model.pbas one package for further analysis by factory and send it to Xilinx with a detailed description.
- View comparison results for all
layers.
Caffe Workflow
To generate the quantized inference model and reference result, follow these steps:
- Generate the quantized inference model by running the following command to
quantize the model.
vai_q_caffe quantize -model float/test_quantize.prototxt \ -weights float/trainval.caffemodel \ -output_dir quantize_model \ -keep_fixed_neuron \ 2>&1 | tee ./log/quantize.logThe following files are generated in the
quantize_modelfolder.- deploy.caffemodel
- deploy.prototxt
- quantize_train_test.caffemodel
- quantize_train_test.prototxt
- Generate the reference result by running the following command to generate
reference
data.
DECENT_DEBUG=5 vai_q_caffe test -model quantize_model/dump.prototxt \ -weights quantize_model/quantize_train_test.caffemodel \ -test_iter 1 \ 2>&1 | tee ./log/dump.logThis creates the
dump_gpufolder and files as shown in the following figure. - Generate the DPU xmodel by running the following command to generate DPU
xmodel file.
vai_c_caffe --prototxt quantize_model/deploy.prototxt \ --caffemodel quantize_model/deploy.caffemodel \ --arch /opt/vitis_ai/compiler/arch/DPUCAHX8H/U50/arch.json \ --output_dir compile_model \ --net_name resnet50 - Generate the DPU inference result by running the following command to
generate the DPU inference result.
env XLNX_ENABLE_DUMP=1 XLNX_ENABLE_DEBUG_MODE=1 \ xilinx_test_dpu_runner ./compile_model/resnet50.xmodel \ ./dump_gpu/data.bin 2>result.log 1>&2For
xilinx_test_dpu_runner, the usage is as follow:xilinx_test_dpu_runner <model_file> <input_data>After the above command runs, the DPU inference result and the comparing result
result.logare generated. The DPU inference results are underdumpfolder. - Crosscheck the reference result and the DPU inference result.
The crosscheck mechanism is to first make sure input(s) to one layer is identical to reference and then the output(s) is identical too. This can be done with commands like
diff,vimdiff, andcmp. If two files are identical,diffandcmpwill return nothing in the command line.- Check the input of DPU and GPU, make sure they use the same input data.
- Use
xirtool to generate a picture for displaying the network's structure.Usage: xir svg <xmodel> <svg>Note: In Vitis AI docker environment, execute the following command to install the required library.sudo apt-get install graphvizThe following figure is part of the ResNet50 model structure generated by
xir_cat. - View the xmodel structure image and find out the last layer
name of the model.Note: Check the last layer first. If the crosscheck of the last layer is successful, then the whole layers' crosscheck will pass and there is no need crosscheck the other layers.
For this model, the name of the last layer is `subgraph_fc1000_fixed_(fix2float)`.
- Search the keyword
fc1000underdump_gpuanddump. You will find the reference result filefc1000.binunderdump_gpuand DPU inference result0.fc1000_inserted_fix_2.binunderdump/subgraph_fc1000/output/. - Diff the two files.
If the last layer's crosscheck fails, then you have to do the crosscheck from the first layer until you find the layer where the crosscheck fails.
Note: For the layers that have multiple input or output (e.g.,res2a_branch1), input correctness should be checked first and then check the output. - Search the keyword
- Submit the files to Xilinx if the DPU
cross check fail.
If a certain layer proves to be wrong on the DPU, prepare the following files as one package for further analysis by factory and send it to Xilinx with a detailed description.
- Float model and prototxt file
- Quantized model, such as deploy.caffemodel, deploy.prototxt, quantize_train_test.caffemodel, and quantize_train_test.prototxt.
PyTorch Workflow
To generate the quantized inference model and reference result, follow these steps:
- Generate the quantized inference model by running the following command to
quantize the
model.
python resnet18_quant.py --quant_mode calib --subset_len 200 - Generate the reference result by running the following command to generate
reference
data.
python resnet18_quant.py --quant_mode test - Generate the DPU xmodel by running the following command to generate DPU
xmodel
file.
vai_c_xir -x /PATH/TO/quantized.xmodel -a /PATH/TO/ arch.json -o /OUTPUTPATH -n netname} - Generate the DPU inference result.
This step is same as the step in Caffe workflow.
- Crosscheck the reference result and the DPU inference result.
This step is same as the step in Caffe workflow.
Multi-FPGA Programming
Most modern servers have multiple Xilinx® Alveo™ cards and you would want to take advantage of scaling up and scaling out deep-learning inference. Vitis AI provides support for multi-FPGA servers using the following building blocks.
Xbutler
The Xbutler tool manages and controls Xilinx FPGA resources on a machine. With the Vitis AI 1.0 release, installing Xbutler is mandatory for running a deep-learning solution using Xbutler. Xbutler is implemented as a server-client paradigm. Xbutler is an addon library on top of Xilinx XRT to facilitate multi-FPGA resource management. Xbutler is not a replacement to Xilinx XRT. The feature list for Xbutler is as follows:
- Enables multi-FPGA heterogeneous support
- C++/Python API and CLI for the clients to allocate, use, and release resources
- Enables resource allocation at FPGA, compute unit (CU), and service granularity
- Auto-release resource
- Multi-client support: Enables multi-client/users/processes request
- XCLBIN-to-DSA auto-association
- Resource sharing amongst clients/users
- Containerized support
- User defined function
- Logging support
Multi-FPGA, Multi-Graph Deployment with Vitis AI
Vitis AI provides different applications built using the Unified Runner APIs to deploy multiple models on single/multiple FPGAs. Detailed description and examples are available in the Vitis AI GitHub (Multi-Tenant Multi FPGA Deployment).
Xstream API
A typical end-to-end workflow involves heterogeneous compute nodes which include FPGA for accelerated services like ML, video, and database acceleration and CPUs for I/O with outside world and compute not implemented on FPGA. Vitis AI provides a set of APIs and functions to enable composition of streaming applications in Python. Xstream APIs build on top of the features provided by Xbutler. The components of Xstream API are as follows.
- Xstream
- Xstream ($VAI_PYTHON_DIR
/vai/dpuv1/rt/xstream.py) provides a standard mechanism for streaming data between multiple processes and controlling execution flow and dependencies. - Xstream Channel
- Channels are defined by an alphanumeric string. Xstream Nodes may publish payloads to channels and subscribe to channels to receive payloads. The default pattern is PUB-SUB, that is, all subscribers of a channel will receive all payloads published to that channel. Payloads are queued up on the subscriber side in FIFO order until the subscriber consumes them off the queue.
- Xstream Payloads
- Payloads contain two items: a blob of binary data and metadata. The binary blob and metadata are transmitted using Redis, as an object store. The binary blob is meant for large data. The metadata is meant for smaller data like IDs, arguments and options. The object IDs are transmitted through ZMQ. ZMQ is used for stream flow control. The ID field is required in the metadata. An empty payload is used to signal the end of transmission.
- Xstream Node
- Each Xstream Node is a stream processor. It is a separate process that can
subscribe to zero or more input channels, and output to zero or more output
channels. A node may perform computation on payload received on its input
channel(s). The computation can be implemented in CPU, FPGA or GPU. To define a
new node, add a new Python file in
vai/dpuv1/rt/xsnodes. See ping.py as an example. Every node should loop forever upon construction. On each iteration of the loop, it should consume payloads from its input channel(s) and publish payloads to its output channel(s). If an empty payload is received, the node should forward the empty payload to its output channels by callingxstream.end()and exit. - Xstream Graph
- Use
$VAI_PYTHON_DIR/vai/dpuv1/rt/xsnodes/grapher.pyto construct a graph consisting of one or more nodes. WhenGraph.serve()is called, the graph spawns each node as a separate process and connect their input/output channels. The graph manages the life and death of all its nodes. Seeneptune/services/ping.pyfor a graph example. For example:graph = grapher.Graph("my_graph") graph.node("prep", pre.ImagenetPreProcess, args) graph.node("fpga", fpga.FpgaProcess, args) graph.node("post", post.ImagenetPostProcess, args) graph.edge("START", None, "prep") graph.edge("fpga", "prep", "fpga") graph.edge("post", "fpga", "post") graph.edge("DONE", "post", None) graph.serve(background=True) ... graph.stop() - Xstream Runner
- The runner is a convenience class that pushes a payload to the input channel of
a graph. The payload is submitted with a unique ID. The runner then waits for
the output payload of the graph matching the submitted ID. The purpose of this
runner is to provide the look-and-feel of a blocking function call. A complete
standalone example of Xstream is here:
${VAI_ALVEO_ROOT}/ examples/deployment_modes/xs_classify.py.
AI Kernel Scheduler
Real world deep learning applications involve multi-stage data processing pipelines which include many compute intensive pre-processing operations like data loading from disk, decoding, resizing, color space conversion, scaling, and croping multiple ML networks of different kinds like CNN, and various post-processing operations like NMS.
The AI kernel scheduler (AKS) is an application to automatically and efficiently pipeline such graphs without much effort from the users. It provides various kinds of kernels for every stage of the complex graphs which are plug and play and are highly configurable. For example, pre-processing kernels like image decode and resize, CNN kernel like the Vitis AI DPU kernel and post processing kernels like SoftMax and NMS. You can create their graphs using kernels and execute their jobs seamlessly to get the maximum performance.
For more details and examples, see the Vitis AI GitHub (AI Kernel Scheduler).
Neptune
Neptune provides a web server with a modular collection of nodes defined in
Python. These nodes can be strung together in a graph to create a service. You can
interact with the server to start and stop these services. You can extend Neptune by
adding your own nodes and services. Neptune builds on top of the Xstream API. In the
following picture, the user is running three different machine learning models on 16
videos from YouTube in real-time. Through a single Neptune server, the time and space
multiplexing of the FPGA resources are enabled. Detailed documentation and examples can
be found here: ${VAI_ALVEO_ROOT}/neptune. Neptune is in
the early access phase in this Vitis AI release.
For more details see, Vitis AI GitHub (Neptune).
Apache TVM and Microsoft ONNX Runtime
In addition to VART and related APIs, Vitis AI has integrated with the Apache TVM and Microsoft ONNX Runtime frameworks for improved model support and automatic partitioning. This work incorporates community driven machine learning framework interfaces that are not available through the standard Vitis AI compiler and quantizers. In addition, it incorporates highly optimized CPU code for x86 and Arm CPUs, when certain layers may not yet be available on Xilinx DPUs.
TVM is currently supported on the following:
- DPUCADX8G
- DPUCZDX8G
ONNX Runtime is currently supported on the following:
- DPUCADX8G
Apache TVM
Apache TVM is an open source deep learning compiler stack focusing on building efficient implementations for a wide variety of hardware architectures. It includes model parsing from TensorFlow, TensorFlow Lite (TFLite), Keras, PyTorch, MxNet, ONNX, Darknet, and others. Through the Vitis AI integration with TVM, Vitis AI is able to run models from these frameworks. TVM incorporates two phases. The first is a model compilation/quantization phase which produces the CPU/FPGA binary for your desired target CPU and DPU. Then by installing the TVM Runtime on your Cloud or Edge device, the TVM APIs in Python or C++ can be called to execute the model.
To read more about Apache TVM, see https://tvm.apache.org.
Vitis AI provides tutorials and installation guides on Vitis AI and TVM integration on theVitis AI GitHub repository: https://github.com/Xilinx/Vitis-AI/tree/master/external/tvm.
Microsoft ONNX Runtime
Microsoft ONNX Runtime is an open source inference accelerator focused on ONNX models. It is the platform Vitis AI has integrated with to provide first-class ONNX model support which can be exported from a wide variety of training frameworks. It incorporates very easy to use runtime APIs in Python and C++ and can support models without requiring the separate compilation phase that TVM requires. Included in ONNXRuntime is a partitioner that can automatically partition between the CPU and FPGA further enhancing ease of model deployment. Finally, it also incorporates the Vitis AI quantizer in a way that does not require separate quantization setup.
To read more about Microsoft ONNX Runtime, see https://microsoft.github.io/onnxruntime/.
Vitis AI provides tutorials and installation guides on Vitis AI and ONNXRuntime integration on the Vitis AI GitHub repository: https://github.com/Xilinx/Vitis-AI/tree/master/external/onnxruntime.