Inference Guide
Ready-to-deploy recipes for validated open-weight LLMs on Alauda AI. Each model in this guide has been deployed end-to-end on a real cluster and benchmarked, so you get a known-good deployment manifest, the runtime image that serves it, and the throughput you can expect.
The models here were validated on Huawei Ascend 910B4 NPU with the community
vLLM-Ascend engine, deployed through Alauda AI's InferNex surface — a KServe
LLMInferenceService reconciled by the InferNex-Bridge into a load-aware router
(hermes-router / EPP) in front of the vLLM-Ascend instances. All were run through the
same InferNex aggregation surface and the same two benchmark scenarios (the two
Qwen models share an identical 2 × TP=4 topology and are directly comparable; the
larger DeepSeek MoE uses 1 × TP=8). For the runtime model (KServe, ModelCar storage,
scheduling) see Model Deployment & Inference.
TOC
Validated modelsRuntime imagesBenchmark scenariosDeploy a validated modelCaveatsVerify the ModelCar signatureValidated models
All three were validated on Ascend 910B4 (32 GB/card), driven through KServe
LLMInferenceService with load-aware routing (InferNex-Bridge + hermes-router). The
two Qwen models run an 8-card aggregation — 2 instances × TP=4; DeepSeek-V4-Flash
(~151 GB W4A8) fills all 8 cards as 1 instance × TP=8, and additionally validated
the MaaS gateway (API-key) ingress next to the internal KServe ingress.
Runtime images
The Ascend CANN images are arm64. Always match the runtime image's CANN version to the host NPU driver on your nodes. Only the engines actually used in this guide are listed; other engines (MindIE, SGLang, …) were not benchmarked at this size.
Benchmark scenarios
Both models were measured with aiperf against the same two scenarios, modelled on
real serving patterns. Output is pinned to 128 tokens and load is closed-loop,
concurrency 4 (4 in-flight requests, fixed). Each scenario ran 240 requests.
Both scenarios run on a single-node 8-card deployment (Qwen models: 2 instances × TP=4; DeepSeek-V4-Flash: 1 instance × TP=8). Latency (TTFT / ITL / E2E) is the per-instance operating point under steady 2-in-flight load; total throughput (TPS) is the aggregate across the instances and scales with the instance count. TPS is the total-token (input + output) caliber; the decode-only output rate is reported separately and is much smaller under these long-input workloads. DeepSeek-V4-Flash additionally ran each scenario through the MaaS gateway (API-key) ingress as well as the internal KServe ingress — see its page.
Deploy a validated model
Each model page links self-contained YAMLs under
assets/
that hold the real InferNex deployment — a KServe LLMInferenceService
(infernex.io/runtime: true) plus the two LLMInferenceServiceConfig objects
(engine template + hermes-router/EPP template) that the InferNex-Bridge reconciles
into the running instances.
Caveats
- These manifests deploy through InferNex (
LLMInferenceService+ InferNex-Bridge- hermes-router). The two
LLMInferenceServiceConfigobjects live in thekservenamespace; theLLMInferenceServicelives in your deployment namespace.
- hermes-router). The two
- Resource keys are for Ascend 910B4 (
huawei.com/Ascend910). Adjust the resource key, image, and version fields for your actual NPU model. - The ModelCar images are public on Docker Hub under
alaudadockerhub— the manifests pull them with no credentials. Mirror them to your own registry and repointmodel.uriif you prefer; the modelcar pull secret in the manifest is only needed for a private registry. - The benchmark numbers were measured closed-loop (concurrency 4) on 8 cards. Treat them as the per-instance operating point under steady load, not a saturation ceiling.
Verify the ModelCar signature
The ModelCar images are signed with Cosign. Verify an
image against the published public key (cosign.pub)
before deploying:
The three signed images and their digests:
--insecure-ignore-tlog=true is required because these were signed with
--tlog-upload=false (no public transparency-log entry); verification relies on the
public key alone.