Qwen3-32B

Standard Qwen3 dense model, 32B parameters, served in BF16 (~62 GB across four cards). Validated on Ascend 910B4 (32 GB/card) with vLLM-Ascend v0.18.0 through Alauda AI's InferNex surface. The validated 8-card topology is TP=4 × 2 replicas, benchmarked in two deployment specsagg-base (load-balancing) and agg-mc-kv (cross-instance KV cache store + KV-cache-aware routing).

Model identity

FieldValue
PublisherQwen (Alibaba)
ArchitectureQwen3 dense (standard attention, no MTP)
Parameters32B
Native dtypeBF16 (~62 GB)
Model sourcehttps://huggingface.co/Qwen/Qwen3-32B

Validated hardware × stack

PlatformEngineVersion / configStatus
Ascend 910B4 32 GB × 8 (TP=4 × 2 replicas)vLLM-Ascendv0.18.0-openeuler (V1 engine)✅ closed-loop, 2-scenario perf, two deployment specs (agg-base + agg-mc-kv)

Model configuration

The vLLM-Ascend engine flags that define the serving envelope (identical for both deployment specs):

ParameterValue
Tensor parallelism (tensor-parallel-size)4
Replicas (instances)2 (= 8 cards)
max-model-len40960
max-num-batched-tokens40960
gpu-memory-utilization0.8
block-size128
max_tokens (output, benchmark-pinned)128
Prefix cachingenabled (--enable-prefix-caching)

Deployment specs

Both specs run on the same 8 cards (2 × TP=4) through InferNex (KServe LLMInferenceService + InferNex-Bridge + hermes-router). agg-mc-kv adds a cross-instance KV cache store and KV-cache-aware routing on top of agg-base:

Componentagg-baseagg-mc-kv
hermes-router (EPP)✅ started✅ started
Routing strategyrandom (load-balancing)kv-cache-aware
cache-indexer
mooncake KV store (AscendStoreConnector)
tokenizer sidecar
Cross-instance KV reusenoyes

Deploy

Self-contained InferNex manifests (each bundles the engine + hermes-router LLMInferenceServiceConfig and the LLMInferenceService, 2 replicas × TP=4):

SpecFile
agg-base (load-balancing)qwen3-32b-agg-base-llmisvc.yaml
agg-mc-kv (KV store + KV-cache-aware routing)qwen3-32b-agg-mc-kv-llmisvc.yaml
base=https://raw.githubusercontent.com/alauda/aml-docs/master/docs/en/inference_guide/assets/qwen3-32b
# edit namespace / model.uri registry / image tag first, then:
kubectl apply -f $base/qwen3-32b-agg-base-llmisvc.yaml

Benchmark results

Closed-loop aiperf 0.7.0, TP=4 × 2 replicas (8 × 910B4), concurrency 4. Two scenarios — ① 8k system-prompt reuse and ② 17.5k multi-turn — each 240 requests, output pinned to 128 tokens. TTFT / E2E in ms, ITL in ms, TPS = total tokens/s (input + output).

Scenario ① — fixed-length system-prompt reuse (ISL ~8k / OSL 128)

Deployment specTTFT avg (ms)ITL avg (ms)E2E avg (ms)TPS (in+out)
agg-base132871.4103913120
agg-mc-kv44568.791653558

Scenario ② — multi-turn dialogue (ISL ~17.5k / OSL 128)

Deployment specTTFT avg (ms)ITL avg (ms)E2E avg (ms)TPS (in+out)
agg-base461295.9167914270
agg-mc-kv166377.3114816498
NOTE

How to read these. All runs completed 240/240 with zero errors at a steady 2 in-flight requests per instance. agg-base numbers are a single representative run; agg-mc-kv is the mean of 3 runs.

The value of agg-mc-kv shows up mainly in TTFT (first-token latency): the cross-instance KV store lets a request reuse prefix KV that another instance already computed, skipping a re-prefill. The effect is largest on the multi-turn scenario (TTFT 4612 → 1663 ms, E2E 16.8 → 11.5 s, TPS +52%), where each turn shares a long growing prefix. Within those 3 runs scenario ② has a wide cold/warm spread: the first (cold) run pays the full prefill of the 16k base (TTFT ~3.5 s), while the warm runs reach TTFT ~0.76 s and E2E ~9.9 s — the mean above blends both, and warm steady state is the operating point you see in sustained serving. Decode-step latency (ITL) is roughly the same between specs — it is not what the KV store accelerates.

At this 2-instance scale most of the agg-mc-kv gain comes from the KV store; isolating the routing contribution would need a separate A/B, and the store's value grows with more instances (more cross-instance reuse opportunities). These are half-scale numbers (8 cards); aggregate throughput scales with the instance count.