feat(ml): ML on Rockchip NPUs (#15241)

2025-11-14 17:36:12 +00:00 · 2025-03-18 00:04:08 +08:00 · 2025-03-18 00:04:08 +08:00 · 14c3b99c0f
commit 14c3b99c0f
parent 1e184a70f1
43 changed files with 2417 additions and 4726 deletions
--- a/docs/docs/features/ml-hardware-acceleration.md
+++ b/docs/docs/features/ml-hardware-acceleration.md
@ -12,6 +12,7 @@ You do not need to redo any machine learning jobs after enabling hardware accele
 - ARM NN (Mali)
 - CUDA (NVIDIA GPUs with [compute capability](https://developer.nvidia.com/cuda-gpus) 5.2 or higher)
 - OpenVINO (Intel GPUs such as Iris Xe and Arc)
+- RKNN (Rockchip)

 ## Limitations

@ -19,6 +20,7 @@ You do not need to redo any machine learning jobs after enabling hardware accele
 - Only Linux and Windows (through WSL2) servers are supported.
 - ARM NN is only supported on devices with Mali GPUs. Other Arm devices are not supported.
 - Some models may not be compatible with certain backends. CUDA is the most reliable.
+- Search latency isn't improved by ARM NN due to model compatibility issues preventing its use. However, smart search jobs do make use of ARM NN.

 ## Prerequisites

@ -33,6 +35,7 @@ You do not need to redo any machine learning jobs after enabling hardware accele
  - The `hwaccel.ml.yml` file assumes the path to it is `/usr/lib/libmali.so`, so update accordingly if it is elsewhere
  - The `hwaccel.ml.yml` file assumes an additional file `/lib/firmware/mali_csffw.bin`, so update accordingly if your device's driver does not require this file
 - Optional: Configure your `.env` file, see [environment variables](/docs/install/environment-variables) for ARM NN specific settings
+  - In particular, the `MACHINE_LEARNING_ANN_FP16_TURBO` can significantly improve performance at the cost of very slightly lower accuracy

 #### CUDA

@ -47,6 +50,16 @@ You do not need to redo any machine learning jobs after enabling hardware accele
 - Ensure the server's kernel version is new enough to use the device for hardware accceleration.
 - Expect higher RAM usage when using OpenVINO compared to CPU processing.

+#### RKNN
+
+- You must have a supported Rockchip SoC: only RK3566, RK3568, RK3576 and RK3588 are supported at this moment.
+- Make sure you have the appropriate linux kernel driver installed
+  - This is usually pre-installed on the device vendor's Linux images
+- RKNPU driver V0.9.8 or later must be available in the host server
+  - You may confirm this by running `cat /sys/kernel/debug/rknpu/version` to check the version
+- Optional: Configure your `.env` file, see [environment variables](/docs/install/environment-variables) for RKNN specific settings
+  - In particular, setting `MACHINE_LEARNING_RKNN_THREADS` to 2 or 3 can _dramatically_ improve performance for RK3576 and RK3588 compared to the default of 1, at the expense of multiplying the amount of RAM each model uses by that amount.
+
 ## Setup

 1. If you do not already have it, download the latest [`hwaccel.ml.yml`][hw-file] file and ensure it's in the same folder as the `docker-compose.yml`.
@ -127,3 +140,12 @@ Note that you should increase job concurrencies to increase overall utilization
 - If you encounter an error when a model is running, try a different model to see if the issue is model-specific.
 - You may want to increase concurrency past the default for higher utilization. However, keep in mind that this will also increase VRAM consumption.
 - Larger models benefit more from hardware acceleration, if you have the VRAM for them.
+- Compared to ARM NN, RKNPU has:
+  - Wider model support (including for search, which ARM NN does not accelerate)
+  - Less heat generation
+  - Very slightly lower accuracy (RKNPU always uses FP16, while ARM NN by default uses higher precision FP32 unless `MACHINE_LEARNING_ANN_FP16_TURBO` is enabled)
+  - Varying speed (tested on RK3588):
+    - If `MACHINE_LEARNING_RKNN_THREADS` is at the default of 1, RKNPU will have substantially lower throughput for ML jobs than ARM NN in most cases, but similar latency (such as when searching)
+    - If `MACHINE_LEARNING_RKNN_THREADS` is set to 3, it will be somewhat faster than ARM NN at FP32, but somewhat slower than ARM NN if `MACHINE_LEARNING_ANN_FP16_TURBO` is enabled
+    - When other tasks also use the GPU (like transcoding), RKNPU has a significant advantage over ARM NN as it uses the otherwise idle NPU instead of competing for GPU usage
+  - Lower RAM usage if `MACHINE_LEARNING_RKNN_THREADS` is at the default of 1, but significantly higher if greater than 1 (which is necessary for it to fully utilize the NPU and hence be comparable in speed to ARM NN)
--- a/docs/docs/install/environment-variables.md
+++ b/docs/docs/install/environment-variables.md
@ -170,6 +170,8 @@ Redis (Sentinel) URL example JSON before encoding:
 | `MACHINE_LEARNING_MAX_BATCH_SIZE__FACIAL_RECOGNITION`       | Set the maximum number of faces that will be processed at once by the facial recognition model      |  None (`1` if using OpenVINO)   | machine learning |
 | `MACHINE_LEARNING_PING_TIMEOUT`                             | How long (ms) to wait for a PING response when checking if an ML server is available                |             `2000`              | server           |
 | `MACHINE_LEARNING_AVAILABILITY_BACKOFF_TIME`                | How long to ignore ML servers that are offline before trying again                                  |             `30000`             | server           |
+| `MACHINE_LEARNING_RKNN`                                     | Enable RKNN hardware acceleration if supported                                                      |             `True`              | machine learning |
+| `MACHINE_LEARNING_RKNN_THREADS`                             | How many threads of RKNN runtime should be spinned up while inferencing.                            |               `1`               | machine learning |

 \*1: It is recommended to begin with this parameter when changing the concurrency levels of the machine learning service and then tune the other ones.