Accelerating Subgraph with ML Frameworks
Partitioning is the process of splitting the inference execution of a model between the FPGA and the host. Partitioning is necessary to execute models that contain layers unsupported by the FPGA. Partitioning can also be useful for debugging and exploring different computation graph partitioning and execution to meet a target objective.
Partitioning Functional API Call in TensorFlow
- Create/initialize the partition
class:
from vai.dpuv1.rt.xdnn_rt_tf import TFxdnnRT xdnnTF = TFxdnnRT(args) - Loading the partitioned
graph:
graph = xdnnTF.load_partitioned_graph() - Apply preprocessing and post processing as if the original graph is loaded.
Partitioner API
- Networkfile
- tf.Graph, tf.GraphDef, or path to the network file
- loadmode
- Saving protocol of the network file. Supported formats [pb (default), chkpt, txt, savedmodel]
- quant_cfgfile
- DPUCADX8G quantization file
- batch_sz
- Inference batch size. The default value for this is one.
- startnode
- List of start nodes for FPGA partition (optional. Defaults to all placeholders)
- finalnode
- List of final nodes for FPGA partition (optional. Defaults to all sink nodes)
Partitioning Steps
- Loading the original
graph
Partitioner can handle frozen tf.Graph, tf.GraphDef, or a path to the network file/folder. If the pb file is provided the graph should be properly frozen. Other options include model stores using tf.train.Saver and tf.saved_model.
- Partitioning
In this step the subgraph specified by startnode and finalnode sets is analyzed for FPGA acceleration. This is done in multiple phases.
- All graph nodes get partitioned into (FPGA) supported and unsupported sets using one of two method. The default (compilerFunc='SPECULATIVE') method uses rough estimate of the hardware operation tree. The second method (compilerFunc= ‘DEFINITIVE’) utilizes the hardware compiler. The latter is more accurate and can handle complex optimization schemes based on the specified options, however, it takes considerable more time to conclude the process.
- Adjacent supported and unsupported nodes get merged into (fine grained) connected components.
- Supported partitions get merged into maximally connected components, while maintaining the DAG property.
- Each supported partition gets (re)compiled using hardware compiler to create runtime code, quantization info, and relevant model parameters.
- Each supported partition subgraph is stored for visualization and debug purposes.
- Each supported subgraph gets replaced by tf.py_func node (with naming convention fpga_func_<partition_id>) that contains all necessary python function calls to accelerate that subgraph over FPGA.
- Freezing the modified graph
The modified graph gets frozen and stored with “-fpga” suffix.
- Run natively in TensorFlow
The modified graph can be loaded using load_partitioned_graph method of the partitioner class. The modified graph replaces the default TensorFlow graph and can be used similar to the original graph.
Practical Notes
The compiler optimizations can be modified by passing the applicable compiler arguments either through positional argument or options arguments to the Partitioner class TFxdnnRT. If model is not properly frozen, the compiler might fail optimizing some operations such as batchnorm.
startnode, and finalnode sets should be a vertex separators. This means that the removal of startnode or finalnode should separate the graph into two distinct connected components (except when startnode is a subset of graph placeholders).
Wherever possible, do not specify cut nodes between layers that are executed as a single macro layers, e.g., for Conv(x) -> BiasAdd(x), placing Conv(x) in a different FPGA partition than BiasAdd(x) may result in suboptimal performance (throughput, latency, and accuracy).
The partitioner initialization requires quant_cfgfile to exist to be able to create
executable code for FPGA. In case FPGA execution is not intended, this requirement can
be circumvented by setting quant_cfgfile=”IGNORE”.
Partitioning Support in Caffe
Xilinx has enhanced Caffe package to automatically partition a Caffe graph. This function separates the FPGA executable layers in the network and generates a new prototxt, which is used for the inference. The subgraph cutter creates a custom python layer to be accelerated on the FPGA. The following code snippet explains the code:
from vai.dpuv1.rt.scripts.framework.caffe.xfdnn_subgraph \
import CaffeCutter as xfdnnCutter
def Cut(prototxt):
cutter = xfdnnCutter(
inproto="quantize_results/deploy.prototxt",
trainproto=prototxt,
outproto="xfdnn_auto_cut_deploy.prototxt",
outtrainproto="xfdnn_auto_cut_train_val.prototxt",
cutAfter="data",
xclbin=XCLBIN,
netcfg="work/compiler.json",
quantizecfg="work/quantizer.json",
weights="work/deploy.caffemodel_data.h5"
)
cutter.cut()
#cutting and generating a partitioned graph auto_cut_deploy.prototxt
Cut(prototxt)
Cut(prototxt)
The auto_cut_deploy.prototxt generated in the previous step,
has complete information to run inference. For example:
- Notebook execution
- There are two example notebooks (image detection and image classification) that
can be accessed from
$VAI_ALVEO_ROOT/notebooksto understand these steps in detail. - Script execution
- There is a python script that can be used to run the models with default
settings. It can be run using the following commands:
- PreparePhase
- Python
$VAI_ALVEO_ROOT/examples/caffe/run.py --prototxt <example prototxt> --caffemodel <example caffemodel> --prepare- prototxt
- Path to the prototxt of the model
- caffemodel
- Path to the caffemodel of the model
- output_dir
- Path to save the quantization, compiler and subgraph_cut files
- qtest_iter
- Number of iterations to test the quantization
- qcalib_iter
- Number of iterations to calibration used for quantization
- Validate Phase
- Python
$VAI_ALVEO_ROOT/examples/caffe/run.py –validate- output_dir
- If output_dir is given in the prepare phase, give the same argument and value to use the files generated in prepare phase.
- numBatches
- Number of batches which can be used to test the inference.