vllm.model_executor.layers.fused_moe.shared_fused_moe ¶
SharedFusedMoE ¶
Bases: FusedMoE
A FusedMoE operation that also computes the results of shared experts. If an all2all communicator is being used the shared expert computation can be interleaved with the fused all2all dispatch communication step.
Source code in vllm/model_executor/layers/fused_moe/shared_fused_moe.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |
use_overlapped instance-attribute ¶
use_overlapped = (
use_overlapped
and not (
enable_eplb
and backend != "allgather_reducescatter"
or use_fi_all2allv_kernels
)
and _shared_experts is not None
)
__init__ ¶
__init__(
shared_experts: Module | None,
gate: Module | None = None,
use_overlapped: bool = True,
routed_input_transform: Module | None = None,
**kwargs,
)
Source code in vllm/model_executor/layers/fused_moe/shared_fused_moe.py
apply_routed_input_transform ¶
Apply transform for routed experts (e.g., latent projection).
This is called by FusedMoE.forward_native. The original hidden_states is saved separately so shared experts get [S, hidden_size] while routed experts get the transformed [S, moe_latent_size].
TODO: For latent MoE bandwidth optimization, fc2_latent_proj could be moved inside SharedFusedMoE to all-reduce on the smaller latent dimension.