Libraries and Samples
The Vitis™ AI Library contains the following types of neural network libraries based on Caffe framework:
- Classification
- Face detection
- SSD detection
- Pose detection
- Semantic segmentation
- Road line detection
- YOLOV3 detection
- YOLOV2 detection
- Openpose detection
- RefineDet detection
- ReID detection
- Multitask
- Face recognition
- Plate detection
- Plate recognition
- Medical segmentation
Also, the Vitis™ AI contains the following types of neural network libraries based on TensorFlow framework:
- Classification
- SSD detection
- YOLOv3 detection
- Medical detection
And, the Vitis™ AI supports the following type of neural network libraries based on PyTorch framework.
- Classification
- ReID detection
- Face recognition
- Semantic segmentation
- Point cloud
- Medical segmentation
- 3D segmentation
The related libraries are open source and can be modified as needed. The open source codes are available on Github.
The Vitis™ AI Library provides image test samples and video test samples for all the above networks. In addition, the kit provides the corresponding performance test program. For video based testing, we recommend to use raw video for evaluation. Decoding by software libraries on Arm® CPU may have inconsistent decoding time, which may affect the accuracy of evaluation.
Model Library
After the model packet is installed on the target, all the models are stored under /usr/share/vitis_ai_library/models/. Each model is stored in a separate folder, which is composed of the following files by default:
- [model_name].xmodel
- [model_name].prototxt
Take the "inception_v1" model as an example. inception_v1.xmodel is the model data. inception_v1.prototxt is the parameter of the model.
Model Type
Classification
The Classification library is used to classify images. Such neural networks are trained on ImageNet for ILSVRC and they can identify the objects from its 1000 classification. The Vitis AI Library integrates networks including, but not limited to, ResNet18, ResNet50, Inception_v1, Inception_v2, Inception_v3, Inception_v4, Vgg, mobilenet_v1, mobilenet_v2, and Squeezenet into Xilinx libraries. The input is a picture with an object and the output is the top-K most probable category.
The following table lists the classification models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | inception_resnet_v2_tf | TensorFlow |
| 2 | inception_v1_tf | |
| 3 | inception_v3_tf | |
| 4 | inception_v4_2016_09_09_tf | |
| 5 | mobilenet_v1_0_25_128_tf | |
| 6 | mobilenet_v1_0_5_160_tf | |
| 7 | mobilenet_v1_1_0_224_tf | |
| 8 | mobilenet_v2_1_0_224_tf | |
| 9 | mobilenet_v2_1_4_224_tf | |
| 10 | resnet_v1_101_tf | |
| 11 | resnet_v1_152_tf | |
| 12 | resnet_v1_50_tf | |
| 13 | vgg_16_tf | |
| 14 | vgg_19_tf | |
| 15 | mobilenet_edge_1_0_tf | |
| 16 | mobilenet_edge_0_75_tf | |
| 17 | inception_v2_tf | |
| 18 | MLPerf_resnet50_v1.5_tf | |
| 19 | resnet50_tf2 | |
| 20 | mobilenet_1_0_224_tf2 | |
| 21 | inception_v3_tf2 | |
| 22 | resnet_v2_50_tf | |
| 23 | resnet_v2_101_tf | |
| 24 | resnet_v2_152_tf | |
| 25 | resnet50 | Caffe |
| 26 | resnet18 | |
| 27 | inception_v1 | |
| 28 | inception_v2 | |
| 29 | inception_v3 | |
| 30 | inception_v4 | |
| 31 | mobilenet_v2 | |
| 32 | squeezenet | |
| 33 | resnet50_pt | PyTorch |
| 34 | squeezenet_pt | |
| 35 | inception_v3_pt |
Face Detection
The Face Detection library uses the DenseBox neuron network to detect human faces. The input is a picture with the faces you want to detect and the output is a vector of the result structure containing the information of each detection box. The following image shows the result of face detection.
The following table lists the face detection models supported by the AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | densebox_320_320 | Caffe |
| 2 | densebox_640_360 |
Face Landmark Detection
The Face Landmark network is used to detect five key points on a human face. The five points include the left eye, the right eye, the nose, the left corner of the lips, and the right corner of the lips. This network is used to correct face direction (what this means is if a face is not directly facing the camera (e.g., tilted 20 degrees left or right), it is "adjusted" to face the camera directly) before face feature extraction. The input image should be a face which is detected by the face detection network. The output of the network is the five key points. The five key points are normalized. The following image shows the result of face detection.
The following table lists the face landmark models supported by the AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | face_landmark | Caffe |
SSD Detection
The SSD Detection library is commonly used with the SSD neuron network. SSD is a neural network which is used to detect objects. The input is a picture with some objects you want to detect. The output is a vector of the result structure containing the information of each detection box. The following image shows the result of SSD detection.
The following table lists the SSD detection models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | ssd_mobilenet_v1_coco_tf | TensorFlow |
| 2 | ssd_mobilenet_v2_coco_tf | |
| 3 | ssd_resnet_50_fpn_coco_tf | |
| 4 | mlperf_ssd_resnet34_tf | |
| 5 | ssdlite_mobilenet_v2_coco_tf | |
| 6 | ssd_inception_v2_coco_tf | |
| 7 | ssd_pedestrian_pruned_0_97 | Caffe |
| 8 | ssd_traffic_pruned_0_9 | |
| 9 | ssd_adas_pruned_0_95 | |
| 10 | ssd_mobilenet_v2 |
Pose Detection
The Pose Detection library is used to detect the posture of the human body. This library includes a neural network which can identify 14 key points on the human body (you can use our SSD detection library). The input is a picture that is detected by the pedestrian detection neural network. The output is a structure containing the coordinates of each point. The following image shows the result of pose detection.
The following table lists the pose detection models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | sp_net | Caffe |
Semantic Segmentation
Semantic segmentation assigns a semantic category to each pixel in the input image, that is, it identifies pixels as part of an object, say, a car, a road, a tree, a horse, etc. Libsegmentation is a segmentation library which can be used in ADAS applications. It offers simple interfaces for a developer to deploy segmentation tasks on a Xilinx® FPGA.
The following is an example of semantic segmentation, where "blue gray" denotes the sky, "green" denotes trees, "red" denotes people, "dark blue" denotes cars, "plum" denotes the road, and "gray" denotes structures.
The following table lists the semantic segmentation models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | fpn | Caffe |
| 2 | FPN-resnet18_Endov | |
| 3 | semantic_seg_citys_tf2 | TensorFlow |
| 4 | mobilenet_v2_cityscapes_tf | |
| 5 |
SemanticFPN_cityscapes_pt |
PyTorch |
| 6 | ENet_cityscapes_pt | |
| 7 | unet_chaos-CT_pt |
Road Line Detection
The Road Line Detection library is used to draw lane lines in ADAS applications.
Each lane line is represented by a number representing the category. A
vector<Point> is used to draw the lane line. In the test code, a color map is
used. Different types of lane lines are represented by different colors. The point is
stored in the container vector, and the polygon interface
cv::polylines() of OpenCV is used to draw the lane line. The
following image shows the result of road line detection.
| No | Model Name | Framework |
|---|---|---|
| 1 | vpgnet_pruned_0_99 | Caffe |
YOLOv3 Detection
YOLO is a neural network which is used to detect objects. The current version is v3. The input is a picture with one or more objects and the output is a vector of the result struct which is composed of the detected information. The following image shows the result of YOLOv3 detection.
The following table lists the YOLOv3 detection models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | yolov3_voc_tf | TensorFlow |
| 2 | yolov3_adas_pruned_0_9 | Caffe |
| 3 | yolov3_voc | |
| 4 | yolov3_bdd | |
| 5 | yolov4_leaky_spp_m | |
| 6 | tiny_yolov3_vmss |
YOLOv2 Detection
| No | Model Name | Framework |
|---|---|---|
| 1 | yolov2_voc | Caffe |
| 2 | yolov2_voc_pruned_0_66 | |
| 3 | yolov2_voc_pruned_0_71 | |
| 4 | yolov2_voc_pruned_0_77 |
Openpose Detection
0: head, 1: neck, 2: L_shoulder, 3:L_elbow, 4: L_wrist, 5: R_shoulder,
6: R_elbow, 7: R_wrist, 8: L_hip, 9: L_knee, 10: L_ankle, 11: R_hip,
12: R_knee, 13: R_ankleThe input of the network is 368x368. The following image shows the result of openpose detection.
The following table lists the Openpose detection models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | openpose_pruned_0_3 | Caffe |
RefineDet Detection
RefineDet is a neural network that is used to detect human bodies. The input is a picture with some individuals that you would like to detect. The output is a vector of the result structure that contain each box’s information. The following image shows the result of RefineDet detection:
The following table lists the RefineDet detection models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | refinedet_pruned_0_8 | Caffe |
| 2 | refinedet_pruned_0_92 | |
| 3 | refinedet_pruned_0_96 | |
| 4 | refinedet_baseline | |
| 5 | refinedet_VOC_tf | TensorFlow |
ReID Detection
The task of person re-identification is to identify a person of interest at any time or place. This is done by extracting the image feature and comparing the features. Images of the same person should have similar features and have small feature distance, while images of different persons have large feature distance. Given a queried image and a pile of candidate images, the image that has the smallest feature distance is identified as the same person as the queried image. The following table lists the ReID detection models supported by the Vitis AI Library.
| Number | Model Name | Framework |
|---|---|---|
| 1 | reid | Caffe |
| 2 | personreid-res18_pt | PyTorch |
| 3 |
personreid-res50_pt |
|
| 4 |
facereid-large_pt |
|
| 5 |
facereid-small_pt |
Multi-task
The multi-task library is appropriate for a model that has multiple sub-tasks. The Multi-task model in the Vitis AI Library has two sub-tasks: semantic segmentation and SSD detection. The following table listss the multi-task models supported by the Vitis AI Library.
| Number | Model Name | Framework |
|---|---|---|
| 1 | multi_task | Caffe |
| 2 | MT-resnet18_mixed_pt | PyTorch |
Face Recognition
The models of face feature are used for face recognition. They can extract the features of a person's face. The output of these models are 512 features. If you have two different images and you want to know if they are of the same person, use these models to extract features of the two images, and then use calculation functions and mapped functions to get the similarity of the two images.
The following table listss the face recognition models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | facerec_resnet20 | Caffe |
| 2 | facerec_resnet64 | |
| 3 |
facerec-resnet20_mixed_pt |
PyTorch |
Plate Detection
The Plate Detection library uses the DenseBox neuron network to detect license plates. The input is a picture of the vehicle that is detected by the SSD and the output is a structure containing the plate location information. The following image shows the result of the plate detection.
The following table lists the plate detection models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | plate_detect | Caffe |
Plate Recognition
The Plate Recognition library uses a classification network to recognize license plate number (Chinese license plates only). The input is a picture of the license plate that is detected by plate detect. The output is a structure containing license plate number information. The following image shows the result of the plate recognition.
The following table lists the plate recognition models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | plate_num | Caffe |
Medical Segmentation
Endoscopy is a common clinical procedure for the early detection of cancers in hollow-organs such as nasopharyngeal cancer, esophageal adenocarcinoma, gastric cancer, colorectal cancer, and bladder cancer. Accurate and temporally consistent localization and segmentation of diseased region-of-interests enable precise quantification and mapping of lesions from clinical endoscopy videos, which is critical for monitoring and surgical planning.
The medical segmentation model is used to classify diseased region-of-interests in the input image. It can be classified into many categories, including BE, cancer, HGD, polyp, and suspicious.
Libmedicalsegmentation is a segmentation library which can be used in segmentation of multi-class diseases in endoscopy. It offers simple interfaces for developers to deploy segmentation tasks on Xilinx FPGAs. The following is an example of medical segmentation, where the goal is to mark the diseased region.
The following is an example of semantic segmentation, where the goal is to predict class labels for each pixel in the image.
The following table lists the medical segmentation models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | FPN_Res18_Medical_segmentation | Caffe |
Medical Detection
The RefineDet model is based on vgg16. It is used for medical detection and can detect five types of diseases, namely, BE, cancer, HGD, polyp, and suspicious from an input endoscopy image like the Endoscopy Disease Detection and Segmentation database (EDD2020).
The following table lists the medical detection models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | RefineDet-Medical_EDD_tf | TensorFlow |
Medical Cell Segmentation
The nucleus is an organelle present within all eukaryotic cells, including human cells. Abberant nuclear shape can be used to identify cancer cells, for example, pap smear tests for the diagnosis of cervical cancer. Medical segmentation cell models offer nuclear segmentation in digital microscopic tissue images which can enable extraction of high quality features for nuclear morphometric and other analyses in computational pathology. The following images show the results of cell segmentation.
The following table lists the Medical Cell Segmentation models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | medical_seg_cell_tf2 | TensorFlow |
Retinaface
This retinaface network is used to detect human face and face landmark. The input is a picture with some faces you would like to detect and the output contains face positions, scores, and landmarks of faces.
The following table lists the retinaface detection models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | retinaface | Caffe |
Face Quality
Th Face Quality library uses the face quality network to detect the quality score of a face. If a face is clear and a front face, the score is high. On the contrary, a blurry or side face will get a low score. The score range from 0 to 1. It also provide face landmark positions. The input is a face which is detected by face detect network and the output contains quality score and five landmark key points.
The following table lists the face quality models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | face-quality | Caffe |
| 2 | face-quality_pt | PyTorch |
Hourglass
0 - r ankle, 1 - r knee, 2 - r hip, 3 - l hip, 4 - l knee, 5 - l ankle,
6 - pelvis, 7 - thorax, 8 - upper neck, 9 - head top, 10 - r wrist,
11 - r elbow, 12 - r shoulder, 13 - l shoulder, 14 - l elbow, 15 - l wristThis network can detect the posture of only one person in the input image. The input of the network is 256x256. The following image shows the result of hourglass detection.
The following table lists the hourglass models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | hourglass-pe_mpii | Caffe |
Pointpillars
Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. The pointpillars model is a novel deep network and encoder that can be trained end-to-end on LiDAR point clouds. It offers the best architecture for 3D object detection from LiDAR. The following image shows the result of a pointpillar test.
The following table lists the pointpillars models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | pointpillars_kitti_12000_0_pt | PyTorch |
| 2 | pointpillars_kitti_12000_1_pt | PyTorch |
3D Segmentation
The 3D segmentation library can support the SalsaNext model, which is used for the uncertainty-aware semantic segmentation of a full 3D LiDAR point cloud in real-time. SalsaNext is the next version of SalsaNet which has an encoder-decoder architecture, where the encoder unit has a set of ResNet blocks and the decoder unit combines upsampled features from the residual blocks.
The following table lists the3D segmentation models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | salsanext_pt | PyTorch |
Covid19 Segmentation
The Covid19 segmentation library can support the COVID-Net model which is a deep convolutional neural network design tailored for the detection of COVID-19 cases from chest X-ray (CXR) images.
The following table lists the Covid19 segmentation models supported by the Vitis AI Library.
| No | Model Name | Framework |
|---|---|---|
| 1 | FPN-resnet18_covid19-seg_pt | PyTorch |
Model Samples
Currently, there are 27 model samples that are located in ~/Vitis-AI/demo/Vitis-AI-Library/samples. Each sample has the following four kinds of test samples:
- test_jpeg_[model type]
- test_video_[model type]
- test_performance_[model type]
- test_accuracy_[model type]
Take YOLOv3 as an example.
- Before you run the YOLOv3 detection example, you can choose one of the
following yolov3 models to run:
- yolov3_bdd
- yolov3_voc
- yolov3_voc_tf
- Ensure that the following test programs exists:
- test_jpeg_yolov3
- test_video_yolov3
- test_performance_yolov3
- test_accuracy_yolov3_bdd
- test_accuracy_yolov3_adas_pruned_0_9
- test_accuracy_yolov3_voc
- test_accuracy_yolov3_voc_tf
If the executable program does not exist, you have to cross compile it on the host and then copy the executable program to the target.
- To test the image data, execute the following
command:
#./test_jpeg_yolov3 yolov3_bdd sample_yolov3.jpgThe result is printed on the terminal. Also, you can view the output image: sample_yolov3_result.jpg.
- To test the video data, execute the following
command:
#./test_video_yolov3 yolov3_bdd video_input.mp4 -t 8 - To test the model performance, execute the following
command:
The result is printed on the terminal.#./test_performance_yolov3 yolov3_bdd test_performance_yolov3.list -t 8 - To test the model accuracy, prepare your own image dataset, image
list file and the ground truth of the images. Then execute the following
command:
#./test_accuracy_yolov3_bdd [image_list_file] [output_file]
After the output_file is generated, a script file is needed to automatically compare the results. Finally, the accuracy result can be obtained.