vllm.model_executor.model_loader.reload ¶
Layerwise weight reloading utilities for vLLM.
This module provides functionality to reload model weights layer-by-layer, which is useful for weight updates without full model reconstruction.
Limitations: 1. Composition with CPU offloading has not been implemented 2. Reloading Attention/MLA weights (q_scale, k_scale, v_scale) has not been implemented 3. Tied parameters will only reflect processing from one of the parent layers (for example, only processing from embed_tokens will have an effect) 4. This design assumes that the number of weights loaded from disk is the same as the number of weights created at model init time. This is not true for quant methods which (1) pad weights or (2) load qkv weights into the same parameter. Both of these cases are non-issues for today's quant methods, but future quantizations may cause reloading to fail
Modules:
| Name | Description |
|---|---|
layerwise | |
meta | |
sanitize | |
torchao_decorator | |
types | |
utils | |
__all__ module-attribute ¶
__all__ = [
"record_metadata_for_reloading",
"initialize_layerwise_reload",
"finalize_layerwise_reload",
"set_torchao_reload_attrs",
"support_quantized_model_reload_from_hp_weights",
]
finalize_layerwise_reload ¶
finalize_layerwise_reload(
model: Module, model_config: ModelConfig
)
Remove the outermost layer of weight loading wrappers.
This function should be applied after initialize_layerwise_reload is applied unwrap the layerwise weight loaders.
Also processes Attention/MLA layers, which must be processed after all other layers
Source code in vllm/model_executor/model_loader/reload/layerwise.py
initialize_layerwise_reload ¶
initialize_layerwise_reload(model: Module)
Set up layerwise weight loading with deferred processing.
Must be called after record_metadata_for_reloading. This function: 1. Saves current kernel tensors for later copying 2. Restores layer parameters/buffers from metadata (on meta device) 3. Wraps weight loaders to defer processing until all weights are loaded
When all weights for a layer are loaded, the wrapped loaders will: 1. Materialize the layer onto the target device 2. Load all cached weights 3. Run quantization processing if applicable 4. Copy processed values back to original tensor storage
Source code in vllm/model_executor/model_loader/reload/layerwise.py
record_metadata_for_reloading ¶
record_metadata_for_reloading(model: Module)
Record layer metadata needed for later reloading.
Stores parameter and buffer metadata as meta tensors for restoration. Must be called before initialize_layerwise_reload.
Source code in vllm/model_executor/model_loader/reload/layerwise.py
set_torchao_reload_attrs ¶
set_torchao_reload_attrs(
model: Module, model_config: ModelConfig
)
support_quantized_model_reload_from_hp_weights ¶
support_quantized_model_reload_from_hp_weights(
original_load_weights: FunctionType,
)
Decorator for load_weights method for AutoWeightsLoader.load_weights to support reloading high precision (bfloat16/float16/float32) weight for an already quantized model, this involves restoring the weights to a high precision weights and then online quantize the weights.
Only applies to torchao quantized models. Assumes that all model weights are loaded within a single weights iterator (cannot perform batched updates)