一、低精前序调用链(vLLM -> vLLM-Ascend)

vLLM 推理低精的核心调用点其实就是 quant_method 的三个方法:

  • create_weights:注册空 Parameter:weight / scale / offset / scale_second ...,写入 input_dim/output_dim/weight_loader 等加载元信息。
  • process_weights_after_loading:权重加载阶段 checkpoint tensor 会通过 weight_loader 将 weight copy 到 create_weights 时注册的 Parameter。process_weights_after_loading 就是在权重加载之后,对权重做转置、NZ 格式转换、scale 合成、int4 pack。
  • apply:实际调用 torch_npu.npu_weight_quant_batchmatmul(...),里面会做低精激活和反量化。

其中:

  • create_weights 和 process_weights_after_loading 都是在 LLM 构造阶段完成的。

具体来说,都是 Executor __init__ 时 load_model,进而调用到 GPUModelRunner 的 load_model,然后在 model_loader 里发起调用。

enginecore 创建 -> executor 创建 -> worker 创建 -> runner 创建 -> 模型结构初始化 -> create_weights -> checkpoint 权重加载 -> process_weights_after_loading

  • appy 是模型 forward 时级联各 layer 的 forward,进而调用到 quant_method.apply。

这里有个代码写法上的细节。模型直接 层实例(入参),这种写法是触发了类的 __call__ 方法调用。父类 nn.Module 实现的 __call__ 会调用 forward,进而走到子类 overwrite 的 forward 实现。

具体 LLM 初始化和推理过程中三个 quant_method 方法被调用到的前序调用链见下面时序图:

dLPRRnD757wtvFymeYz6DPkgwaE4aeWqj5BLcAOWG5LThSQxOsVnxincPpOaF27K10Df3KAAQ6aHVMYRXpOfGgKL8VJFi7RoBthcilQklRd0Ip8xFkURCzzyvtAE2no9EF3BfKNF9QZ9c6YmbN89nuBHE6YGg5pYRO-6EC812l1AZN2EM-IITuEWoG-oFtAo51Fg46CnwtijYXn2XS.png

@startuml
hide footbox
autonumber
skinparam maxMessageSize 80
skinparam sequenceMessageAlign center

actor User as user

participant "前序启动链\nLLM / Engine / Executor / Worker" as pre
participant "NPUModuleRunner" as runner
participant "BaseModelLoader" as loader
participant "具体模型类\nDeepseekV2ForCausalLM" as model
participant "具体层\nQKVParallelLinear\nColumnParallelLinear" as layer
participant "AscendModelSlimConfig\nquant_config" as qconfig
participant "AscendLinearMethod\nquant_method" as qmethod
participant "AscendW8A8DynamicLinearMethod\nscheme" as scheme
participant "torch_npu ops" as npu

== LLM 初始化 / 模型加载阶段 ==

user -> pre : LLM(model=..., quantization=...)
activate pre

pre -> pre : 构造 EngineArgs / VllmConfig\n创建 EngineCore / Executor / Worker\nNPUPlatform 注册 Ascend quant_config\nWorker.init_device() 创建 NPUModuleRunner\nWorker.load_model()

pre -> runner : load_model()
activate runner

runner -> loader : load_model(vllm_config, model_config)
activate loader

loader -> model : initialize_model(...)\n实例化 DeepseekV2ForCausalLM
activate model

model -> layer : __init__()\nself.qkv_proj = QKVParallelLinear(...)
activate layer

layer -> qconfig : get_quant_method(layer, prefix)
activate qconfig

qconfig -> qmethod : new AscendLinearMethod(scheme)
activate qmethod

qmethod -> scheme : 绑定具体量化 scheme\n例如 W8A8_DYNAMIC / W4A8_DYNAMIC
activate scheme
scheme --> qmethod
deactivate scheme

qmethod --> qconfig : quant_method
deactivate qmethod

qconfig --> layer : quant_method
deactivate qconfig

layer -[#red]> qmethod : <b><color:red>quant_method.create_weights(...)</color></b>
activate qmethod
qmethod -> scheme : create_weights(...)
activate scheme
scheme --> layer : 注册 weight / scale / offset / bias 等 Parameter
deactivate scheme
qmethod --> layer
deactivate qmethod

layer --> model : layer 初始化完成
deactivate layer

model --> loader : model object
deactivate model

loader -> loader : load_weights(model, model_config)\n读取 checkpoint tensor\nmodel.load_weights(...)\nparam.weight_loader(...)

loader -[#red]> qmethod : <b><color:red>quant_method.process_weights_after_loading(layer)</color></b>
activate qmethod
qmethod -> scheme : process_weights_after_loading(layer)
activate scheme
scheme --> layer : 转置 / pack / 转 NPU 友好格式
deactivate scheme
qmethod --> loader
deactivate qmethod

loader --> runner : model ready
deactivate loader

runner --> pre : load_model done
deactivate runner

pre --> user : LLM 初始化完成
deactivate pre


== generate 推理阶段 / 每轮 decoding ==

user -> pre : llm.generate(prompts, sampling_params)
activate pre

pre -> pre : add_request()\nscheduler.schedule()\nmodel_executor.execute_model()

pre -> runner : execute_model(scheduler_output)
activate runner

runner -> model : model(input_ids, positions, ...)
activate model

model -> layer : self.qkv_proj(hidden_states)
note right of layer
这里是 nn.Module.__call__
最终分发到
ColumnParallelLinear.forward(...)
end note

activate layer

layer -[#red]> qmethod : <b><color:red>quant_method.apply(layer, x, bias)</color></b>
activate qmethod

qmethod -> scheme : apply(layer, x, bias)
activate scheme

scheme -> npu : npu_dynamic_quant(x)
activate npu
npu --> scheme : quantized activation + scale
deactivate npu

scheme -> npu : npu_quant_matmul(x_q, weight, scale, ...)
activate npu
npu --> scheme : output
deactivate npu

scheme --> qmethod : output_parallel
deactivate scheme

qmethod --> layer : output_parallel
deactivate qmethod

layer --> model : qkv
deactivate layer

model -> model : attention / MLP / residual / norm ...

model -> layer : lm_head 或其他 Linear
activate layer
layer -[#red]> qmethod : <b><color:red>quant_method.apply(...)</color></b>
activate qmethod
qmethod -> scheme : apply(...)
scheme --> qmethod : logits相关输出
qmethod --> layer
deactivate qmethod
layer --> model
deactivate layer

model --> runner : hidden_states / logits
deactivate model

runner --> pre : model_output
deactivate runner

pre -> pre : compute_logits\nsample_tokens / sampler\nassemble RequestOutput

pre --> user : completion output
deactivate pre

@enduml

二、低精实现调用链

待填坑

2.1 LinearMethod

2.2 KVCacheMethod

2.3 MoEMethod

三、融合算子使能

待填坑