vllm.v1.worker.utils ¶
AttentionGroup dataclass ¶
Source code in vllm/v1/worker/utils.py
metadata_builders class-attribute instance-attribute ¶
metadata_builders: list[AttentionMetadataBuilder] = field(
default_factory=lambda: []
)
__init__ ¶
__init__(
backend: type[AttentionBackend],
layer_names: list[str],
kv_cache_spec: KVCacheSpec,
kv_cache_group_id: int,
metadata_builders: list[AttentionMetadataBuilder] = (
lambda: []
)(),
) -> None
create_metadata_builders ¶
create_metadata_builders(
vllm_config,
device,
kernel_block_size: int | None,
num_metadata_builders: int = 1,
)
Source code in vllm/v1/worker/utils.py
get_metadata_builder ¶
get_metadata_builder(
ubatch_id: int = 0,
) -> AttentionMetadataBuilder
add_kv_sharing_layers_to_kv_cache_groups ¶
add_kv_sharing_layers_to_kv_cache_groups(
shared_kv_cache_layers: dict[str, str],
kv_cache_groups: list[KVCacheGroupSpec],
runner_only_attn_layers: set[str] | None = None,
) -> None
Sets up KV cache sharing by reusing the allocated KV caches in kv_caches for layers that do not allocate its own KV cache, based on the mapping in shared_kv_cache_layers. Adds these layers to the corresponding KV cache group, which is needed to ensure that attention metadata is assigned later.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shared_kv_cache_layers | dict[str, str] | Layer pairings for cross-layer KV sharing. If an Attention layer | required |
kv_cache_groups | list[KVCacheGroupSpec] | The KV cache groups of the model. | required |
Source code in vllm/v1/worker/utils.py
bind_kv_cache ¶
bind_kv_cache(
kv_caches: dict[str, Tensor],
forward_context: dict[str, Attention],
runner_kv_caches: list[Tensor],
num_attn_module: int = 1,
) -> None
Bind the allocated KV cache to both ModelRunner and forward context so that the KV cache can be used in the forward pass.
This function
1) Fills the ModelRunner's kv cache list (runner_kv_caches) with kv_caches. 2) Associates each attention layer in the forward_context with its corresponding KV cache in kv_caches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kv_caches | dict[str, Tensor] | The allocated kv_caches with layer names as keys. | required |
forward_context | dict[str, Attention] | The global forward context containing all Attention layers with layer names as keys. | required |
runner_kv_caches | list[Tensor] | The kv_cache declared by ModelRunner. | required |
Source code in vllm/v1/worker/utils.py
is_residual_scattered_for_sp ¶
is_residual_scattered_for_sp(
vllm_config: VllmConfig, num_input_tokens: int
) -> bool
Check if the residual tensor is scattered for sequence parallelism.
The residual tensor is scattered across tensor parallel ranks when sequence parallelism and tensor parallelism is enabled.
This follows the same logic as SequenceParallelismPass.is_applicable_for_range(): - In full-graph compilation mode (no splitting ops or using inductor graph partition), SP is always applied - Otherwise, SP is only applied for specific shapes in compile_sizes
Source code in vllm/v1/worker/utils.py
request_memory ¶
request_memory(
init_snapshot: MemorySnapshot, cache_config: CacheConfig
) -> int
Calculate the amount of memory required by vLLM, then validate that the current amount of free memory is sufficient for that.
Source code in vllm/v1/worker/utils.py
sanity_check_mm_encoder_outputs ¶
sanity_check_mm_encoder_outputs(
mm_embeddings: MultiModalEmbeddings,
expected_num_items: int,
) -> None
Perform sanity checks for the result of vllm.model_executor.models.SupportsMultiModal.embed_multimodal.