<?xml version='1.0' encoding='utf-8'?>
<!-- This template is for creating an Internet Draft using xml2rfc,
    which is available here: http://xml.resource.org. -->
<?xml-model href="rfc7991bis.rnc"?>  <!-- Required for schema validation and schema-aware editing -->
<!-- <?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?> -->
<!-- This third-party XSLT can be enabled for direct transformations in XML processors, including most browsers -->

<rfc
      xmlns:xi="http://www.w3.org/2001/XInclude"
      category="info"
      docName="draft-zhang-rtgwg-llmmoe-multicast-02"
      ipr="trust200902"
      obsoletes=""
      updates=""
      submissionType="IETF"
      xml:lang="en"
      tocInclude="true"
      tocDepth="4"
      symRefs="true"
      sortRefs="true"
      version="3">
  <!-- xml2rfc v2v3 conversion 2.38.1 -->
  <!-- category values: std, bcp, info, exp, and historic
    ipr values: trust200902, noModificationTrust200902, noDerivativesTrust200902,
       or pre5378Trust200902
    you can add the attributes updates="NNNN" and obsoletes="NNNN" 
    they will automatically be output with "(if approved)" -->

 <!-- ***** FRONT MATTER ***** -->

 <front>
    <!-- The abbreviated title is used in the page header - it is only necessary if the 
        full title is longer than 39 characters -->

   <title abbrev="Abbreviated Title">Multicast use case in LLM MoE</title>
    <seriesInfo name="Internet-Draft" value="draft-zhang-rtgwg-llmmoe-multicast-02"/>
    <!-- add 'role="editor"' below for the editors if appropriate -->

   <!-- Another author who claims to be an editor -->

   <author fullname="Zheng Zhang" initials="Z" surname="Zhang">
      <organization>ZTE Corporation</organization>
      <address>
        <postal>
          <street/>
          <!-- Reorder these if your country does things differently -->

         <city></city>
          <region/>
          <code/>
          <country>China</country>
        </postal>
        <phone></phone>
        <email>zhang.zheng@zte.com.cn</email>
        <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
    
    <author fullname="Wei Duan" initials="W" surname="Duan">
      <organization>ZTE Corporation</organization>
      <address>
        <postal>
          <street/>
          <!-- Reorder these if your country does things differently -->

         <city></city>
          <region/>
          <code/>
          <country>China</country>
        </postal>
        <phone></phone>
        <email>duan.wei1@zte.com.cn</email>
        <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
	
	<author fullname="Xiaohu Xu" initials="X" surname="Xu">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <street/>
          <!-- Reorder these if your country does things differently -->

         <city></city>
          <region/>
          <code/>
          <country>China</country>
        </postal>
        <phone></phone>
        <email>xuxiaohu_ietf@hotmail.com</email>
        <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
	
	<author fullname="Yisong Liu" initials="Y" surname="Liu">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <street/>
          <!-- Reorder these if your country does things differently -->

         <city></city>
          <region/>
          <code/>
          <country>China</country>
        </postal>
        <phone></phone>
        <email>liuyisong.ietf@gmail.com</email>
        <!-- uri and facsimile elements may also be added -->
     </address>
    </author>
    
    <date year="2026"/>
    <!-- If the month and year are both specified and are the current ones, xml2rfc will fill 
        in the current day for you. If only the current year is specified, xml2rfc will fill 
     in the current day and month for you. If the year is not the current one, it is 
     necessary to specify at least a month (xml2rfc assumes day="1" if not specified for the 
     purpose of calculating the expiry date).  With drafts it is normally sufficient to 
     specify just the year. -->

   <!-- Meta-data Declarations -->

   <area>Routing</area>
    <workgroup>RTGWG</workgroup>
    <!-- WG name at the upperleft corner of the doc,
        IETF is fine for individual submissions.  
     If this element is not present, the default is "Network Working Group",
        which is used by the RFC Editor as a nod to the history of the IETF. -->

   <keyword>LLM MoE Multicast</keyword>
    <!-- Keywords will be incorporated into HTML output
        files in a meta tag but they have no effect on text or nroff
        output. If you submit your draft to the RFC Editor, the
        keywords will be used for the search engine. -->

   <abstract>
      <t>Large Language Models (LLMs) have been widely used in recent years. 
      The Mixture of Experts (MoE) architecture is one of the features of LLMs that enables efficient inference and cost-effective training. 
      With the MoE architecture, there are potential multicast use cases such as tokens dispatching. 
      This draft attempts to analyze these use cases.</t>
    </abstract>
  </front>
  <middle>
    <section numbered="true" toc="default">
      <name>Introduction</name>
      <t>In recent years, large language models (LLMs) have been widely used. 
      Mixture of Experts Model (MoE) is one of the functions of LLM to achieve efficient inference and economical training. 
      Many LLMs currently adopt the MoE architecture, such as DeepSeek-V2/V3, Google Gemini 1.5 Pro, xAI Grok-1, Mistral 8*22B, Qwen3, etc. 
      During inference, MoE only activates a small number of parameters to determine each output token, 
      which significantly reduces the amount of computation required by the processor, 
      thereby reducing the overall computational requirements. 
      Therefore, the fewer parameters are activated, the less computation the processor needs to perform. 
      In the MoE architecture, one token needs to be sent to multiple experts, which is a typical multicast use case.</t>
      
      <t>In most LLMs, two experts are activated during the computation: one is a routed expert and the other is a shared expert. 
      In DeepSeekV3, one token activates eight routed experts and one shared expert.</t>
      
      <t>When all activated experts are located on a node with multiple GPUs installed, only intra-node communication is required. 
      When activated experts are located on different nodes, inter-node communication is required. 
      Due to the bandwidth difference between intra-node and inter-node scenarios, 
      communication across leaf switches and even spine switches is inevitable.</t>
      
      <t>This draft analyzes the multicast use case of LLM in data centers.</t>

      <section numbered="true" toc="default">
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
       "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
       document are to be interpreted as described in <xref target="RFC2119" format="default"/>.</t>
      </section>
    </section>
    
    <section numbered="true" toc="default">
      <name>Use case - tokens dispatching</name>
      <figure anchor="Fig0">
        <artwork align="left" name="Figure 0" type="" alt=""><![CDATA[
            +-----------+               +-----------+
            |  Spine 1  |               |  Spine x  |
            +-+------+--+               +-+------+--+
              |      |                    |      |
              |      |                    |      +---------+
     +--------+      +--------------------|---------+      |
     |      +-----------------------------+         |      |
     |      |                                       |      |
   +-+------+-+     +----------+                  +-+------+-+
   |  Leaf 1  |     |  Leaf 2  |    ......        |  Leaf n  |
   +-+--+---+-+     +----------+                  +--+--+--+-+
     |  |   |                                        |  |  |
     |  |   +-------------------------------------+  |  |  |
     |  +----------------+                        |  |  |  |
     |                   |              +---------|--|--+  |
   +-+            +------|--------------|---------|--+     +-----+
   |              |      |              |         |              |
 +-+--+----+---+--+-+  +-+--+----+---+--+-+     +-+--+----+---+--+-+
 |GPU1|GPU2|...|GPU8|  |GPU1|GPU2|...|GPU8| ... |GPU1|GPU2|...|GPU8|
 +----+----+---+----+  +----+----+---+----+     +----+----+---+----+
        node 1                 node 2                   node m     
           ]]></artwork>
      </figure>
      
      <t>During the pre-filling and decoding phases, tokens need to be sent to all selected experts, 
      including routed experts and shared experts. The tokens dispatching can be intra-node or inter-node. 
      Different LLMs use different numbers of experts. 
      For example, Mixtral uses 8 experts and activates 2 experts at a time; 
      LlaMa 4 uses 16 experts (Scout) or 128 experts (Maverick), and activates 2 experts at a time; 
      DeepSeekV3 uses 256 experts and activates 9 experts at a time. 
      The more routed experts there are, the more distribution there is between nodes. 
      In order to balance the experts, it is difficult to limit the number of experts to one node 
      even if only two experts (one routed expert and one shared expert) are used.</t>
      
      <t>The tokens dispatching can be optimized. 
      For example, in DeepSeekV3, LLM first selects the node group and then selects the expert from the node. 
      By implementing the node restricted routing function, 
      a maximum of four nodes are selected to reduce the inter-node consumption of tokens dispatching. 
      In addition, in order to maximize the usage of the high intra-node bandwidth, after the switch or GPU in the node receives the tokens, 
      the switch or GPU needs to distribute the tokens to the experts in the same node. 
      This optimization aims to reduce the inter-node distribution, but it cannot avoid multicast between nodes.</t>
	  
	  <t>It is worth noting that even during an inference or training process, the experts selected by the token are not fixed. 
	  Each token may be sent to a different combination of experts. 
	  For example, in DeepSeekv3, each token may be sent to 9 different experts.</t>
      
      <t>Therefore, the use of multicast may be intra-node or inter-node. 
      The existing multicast implementation methods are different in intra-node and inter-node scenarios, 
      and multicast management is more difficult.</t>
      
      <section numbered="true" toc="default">
        <name>Intra-node multicast</name>
        <t>When tokens need to be sent to multiple GPUs in the same node, 
        the GPU or the switch connected to the GPU may send the tokens in a multicast manner. 
        This requires the switch or GPU to support the multicast function. 
        This function can reduce the computational burden of the source GPU and reduce the bandwidth consumption between nodes.</t>
      </section>
      
      <section numbered="true" toc="default">
        <name>Inter-node multicast</name>
        <t>When tokens need to be sent to multiple nodes, Leaf switches and even Spine switches need to forward tokens. 
        Due to the limitation of inter-node bandwidth, the more packets there are, the greater the risk of congestion. 
        Using multicast technology can reduce the burden on the source GPU and reduce the risk of congestion.</t>
      </section>
	  
	  <section numbered="true" toc="default">
        <name>Dynamic Requirements</name>
        <t>Due to the random nature of multicast destination selection, 
		for example, in the token dispatching process mentioned above, each token may select a different expert combination.
        The selection process is very short, leaving no time for multicast technologies like PIM to establish a multicast tree.		
		Therefore, multicast technology that can meet dynamic needs is needed.</t>
      </section>
	  
	  <section numbered="true" toc="default">
        <name>Reliability requirements</name>
        <t>The transmission of all types of data (including tokens) used for LLM calculations requires extremely high reliability. 
		This means that packet loss, long delays or jitters, and retransmissions during data transmission can impact LLM calculations. 
		Reliability is a paramount requirement when applying multicast technology to data transmission.</t>
        <t>If reliability is insufficient, even if most data reaches its destination quickly, 
		if even one destination fails to receive the data in time due to packet loss, long latency, or excessive jitter, 
		the LLM calculation may need to be restarted, significantly reducing computational efficiency.</t>
        <t>So compared to unicast, multicast reliability technology is more complex. 
		Especially in LLM applications, it is necessary to avoid packet loss, long latency, excessive jitter, 
		and retransmissions caused by individual multicast branches to minimize the impact on LLM calculations.</t>
      </section>
    </section>
    
    <section numbered="true" toc="default">
      <name>Multicast technologies analysis</name>
      <t>Protocol Independent Multicast - Sparse Mode (PIM-SM) <xref target="RFC7761" format="default"/> is a traditional multicast technology. 
      It relies on PIM signaling to build the multicast tree. 
      When the receivers change, the multicast tree may need to be rebuilt. 
      When PIM is used for intra-node or inter-node multicast, the stability of the multicast tree is more important. 
      It may not be applicable when the expert combination is flexible. 
      Even in the intra-node scenario, the number of potential multicast trees may be large despite the limited number of GPUs in a single node.</t>
      
      <t>BIER (Bit-Indexed Explicit Replication) <xref target="RFC8279" format="default"/> is an architecture 
      that provides optimal multicast forwarding through a "multicast domain", 
      without requiring intermediate routers to maintain any per-flow state or to engage in an explicit tree-building protocol. 
      BIER is more flexible than PIM. 
      Experts can be numbered and can act as ingress or egress BFRs in BIER. 
      BIER header encapsulation can be a function defined in <xref target="RFC8296" format="default"/>, 
      <xref target="I-D.ietf-bier-bierin6" format="default"/>, or <xref target="I-D.zzhang-bier-unmasked-bier" format="default"/>. 
      By using the BIER function, Leaf and Spine switches, and even GPUs or switches connected to GPUs, 
      can pre-build expert-based forwarding tables. tokens can be sent to any selected expert.</t>
      
      <t>Other multicast methods, such as PIM DM (dense mode) and ingress replication, 
      may consume more bandwidth and may not be a good choice for multicast scenarios such as LLM tokens dispatching.</t>
      
      <t>Considering dynamic requirements like token dispatching, technologies like PIM, 
	  which require the establishment of a multicast tree, are inadequate. 
	  BIER, on the other hand, allows the source GPU (similar to BFIR in BIER) to directly specify the destination expert group (similar to BFERs in BIER)
	  and encapsulate it into the message, eliminating the time for multicast tree establishment. 
	  Therefore, BIER is the most suitable multicast technology. 
	  Of course, prior to this, control-plane negotiation between the source GPU and the experts regarding data transmission is required.</t>
	  
	  <t>While the network layer can provide multicast capabilities for multicast scenarios, 
      the multicast approach needs to work in conjunction with the LLM software. 
      It may work in conjunction with the implementation of collective communication and NIC (network interface card).</t>
    </section>
    
    <section anchor="IANA" numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>There are no IANA consideration introduced by this draft.</t>
    </section>
    <section anchor="Security" numbered="true" toc="default">
      <name>Security Considerations</name>
      <t>There are no security issues introduced by this draft.</t>
    </section>
  </middle>
  <!--  *****BACK MATTER ***** -->

 <back>

   <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        <?rfc include="reference.RFC.2119.xml"?>    
        <?rfc include="reference.RFC.7761.xml"?>
        <?rfc include="reference.RFC.8279.xml"?>
        <?rfc include="reference.RFC.8296.xml"?>
      </references>
      <references title="Informative References">
        <?rfc include="reference.I-D.ietf-bier-bierin6.xml"?>
        <?rfc include="reference.I-D.zzhang-bier-unmasked-bier.xml"?>
    </references>
    </references>
 </back>
</rfc>
