vllm.model_executor.model_loader.reload.layerwise ¶
LAYERWISE_INFO module-attribute ¶
LAYERWISE_INFO: WeakKeyDictionary[
Module, LayerReloadingInfo
] = WeakKeyDictionary()
__all__ module-attribute ¶
__all__ = [
"get_layerwise_info",
"record_metadata_for_reloading",
"initialize_layerwise_reload",
"finalize_layerwise_reload",
]
_get_original_loader ¶
Return the weight loader with any layerwise wrappers removed
Source code in vllm/model_executor/model_loader/reload/layerwise.py
_layerwise_process ¶
_layerwise_process(layer: Module, info: LayerReloadingInfo)
Finalize layer loading after all weights have been cached.
This function: 1. Materializes the layer onto the target device 2. Loads all cached weights 3. Runs quantization processing if applicable 4. Copies processed values back to original tensor storage
Source code in vllm/model_executor/model_loader/reload/layerwise.py
_place_kernel_tensors ¶
_place_kernel_tensors(
layer: Module, info: LayerReloadingInfo
)
Source code in vllm/model_executor/model_loader/reload/layerwise.py
finalize_layerwise_reload ¶
finalize_layerwise_reload(
model: Module, model_config: ModelConfig
)
Remove the outermost layer of weight loading wrappers.
This function should be applied after initialize_layerwise_reload is applied unwrap the layerwise weight loaders.
Also processes Attention/MLA layers, which must be processed after all other layers
Source code in vllm/model_executor/model_loader/reload/layerwise.py
get_layerwise_info ¶
get_layerwise_info(layer: Module) -> LayerReloadingInfo
Get information related to restoring and layerwise processing. If no previous information existed, a new entry is constructed
Source code in vllm/model_executor/model_loader/reload/layerwise.py
initialize_layerwise_reload ¶
initialize_layerwise_reload(model: Module)
Set up layerwise weight loading with deferred processing.
Must be called after record_metadata_for_reloading. This function: 1. Saves current kernel tensors for later copying 2. Restores layer parameters/buffers from metadata (on meta device) 3. Wraps weight loaders to defer processing until all weights are loaded
When all weights for a layer are loaded, the wrapped loaders will: 1. Materialize the layer onto the target device 2. Load all cached weights 3. Run quantization processing if applicable 4. Copy processed values back to original tensor storage
Source code in vllm/model_executor/model_loader/reload/layerwise.py
make_online_process_loader ¶
Create a wrapped weight loader that defers processing.
Source code in vllm/model_executor/model_loader/reload/layerwise.py
record_metadata_for_reloading ¶
record_metadata_for_reloading(model: Module)
Record layer metadata needed for later reloading.
Stores parameter and buffer metadata as meta tensors for restoration. Must be called before initialize_layerwise_reload.