Hi! I have reviewed this document as part of the Operational directorate's ongoing effort to review all IETF documents being processed by the IESG. These comments were written primarily for the benefit of the operational area directors. Document editors and WG chairs should treat these comments just like any other last call comments. This document is on the Informational track, providing mechanisms for optimization of LAG/ECMP load-balancing. Summary: Not ready, there are issues to solve The document offers a thorough analysis and presents a taxonomy and lexicon to talk about LAG/ECMP load-balancing. It also presents various options with pros and cons of the mechanics presented in this document. This is all very useful and helpful. I do have a number of questions and comments included below, categorized as Major and Minor. I apologize in advance if I misunderstood anything in the document that lead to these questions and observations, and I hope these are useful. Major: Section 4 describes the trade-off and limitations of local-only optimizations. However, this document describes what's an active (stateful) mechanism as opposed to a hash-based passive (stateless) mechanism. There should be a section of Operational considerations of Stateful LAG/ECMP LB, given that monitoring flow degrades forwarding performance, requires state maintenance, etc. 4.2. Operational Overview ... Step 2) The egress component links are periodically scanned for link utilization and the imbalance for the LAG/ECMP group is monitored. If the imbalance exceeds a certain imbalance threshold, then re- balancing is triggered. Measurement of the imbalance is discussed further in 5.1. Additional criteria may also be used to determine whether or not to trigger rebalancing, such as the maximum utilization of any of the component links, in addition to the imbalance. If the egress component link of an ECMP are measured, but those are in different routers, how is this a local-only method, and how is the loop closed and "rebalancing required" notified? Take for example: +--B A==+ +--C If B and C measure inbalance, how do they know they belong to the same ECMP? The doc says: All of the steps identified above can be done locally within the router itself or could involve the use of a central management entity. But I am not sure how some of these are done locally only, and also the "central management entity" seems underspecified. 5.1. Configuration Parameters for Flow Rebalancing ... Also, this paragraph and document defines a number of variables like the "imbalance threshold", the "max utilization of any component links", etc. From an operational perspective: how are those values set? What are their defaults? What are appropriate ranges and values? Section 5 describes nicely the parameters, but does not give guidance of default values and ranges. 4.3. Large Flow Recognition 4.3.1. Flow Identification A flow (large flow or small flow) can be defined as a sequence of packets for which ordered delivery should be maintained. Flows are typically identified using one or more fields from the packet header, for example: . Layer 2: source MAC address, destination MAC address, VLAN ID. . IP header: IP Protocol, IP source address, IP destination address, flow label (IPv6 only), TCP/UDP source port, TCP/UDP destination port. Are these only applicable to TCP and UDP traffic? I think there needs to be a more exhaustive list of transports for this to be useful. For tunneling protocols like Generic Routing Encapsulation (GRE) [RFC 2784], Virtual eXtensible Local Area Network (VXLAN) [VXLAN], Network Virtualization using Generic Routing Encapsulation (NVGRE) [NVGRE], Stateless Transport Tunneling (STT) [STT], etc., flow identification is possible based on inner and/or outer headers. Please add L2TPv3 as a key tunneling protocol. Also, for tunneling protocols, there is a lot more than that. Yes, inner or outer. BUT there is also the tunnel header typically. For example, GRE Key, L2TPv3 Session ID, etc. Sometimes, these summarize a flow decision. You might also want to look at (and reference) RFC 5640, "Load-Balancing for Mesh Softwires". 4.3.2. Criteria and Techniques for Large Flow Recognition From a bandwidth and time duration perspective, in order to recognize large flows we define an observation interval and observe the bandwidth of the flow over that interval. A flow that exceeds a certain minimum bandwidth threshold over that observation interval would be considered a large flow. >From an operational standpoint, it appears these techniques are under-specified. As it pertains to these thresholds, time intervals, etc. How are those configured? What are defaults? What are appropriate ranges? Sections 4.3 and 4.4 present respectively different techniques for sampling and re-balancing. THe analysis are very useful. It would be really helpful to have a table summarizing all the different options and associated pros and cons, and perhaps some applicability-based recommendations. 5.2. System Configuration and Identification Parameters ... How are those parameters (besides an IP address) defined? What is a "LAG ID"? An UTF-8 string? A 64-bit unsigned integer? 5.3. Information for Alternative Placement of Large Flows See comment above regarding transport protocols and tunnels. 5.6. Monitoring information 5.6.1. Interface (link) utilization The incoming bytes (ifInOctets), outgoing bytes (ifOutOctets) and interface speed (ifSpeed) can be measured from the Interface table (iftable) MIB [RFC 1213]. Why are these algorithms using MIBs only? Minor: I think it is confusing to talk about "short-lived large flows" referring to them as "small flows". In fact I think it is potentially very confusing. I'd recommend creating a new term. The introduction describes a bunch of numbers (5% link bandwidth, 10s/100s flows, etc) but from an operational standpoint it is not clear how those potentially vary or are tied to a specific set of use cases. Further, not clear how those can potentially influence different algorithms. Maybe the answer is to put caps on them, or other answer, but it would help to be more prescriptive about applicability. 1.2. Terminology ECMP table: A table that is used as the nexthop of an ECMP route that comprises the set of component links and the weights associated with each of those component links. The weights are used to determine which values of the hash function map to a given component link. It is not clear what the "weights" are if this is ECMP and not UCMP (U for Unequal). Also, "a table used as the next hop" is confusing. LAG table: A table that is used as the output port which is a LAG that comprises the set of component links and the weights associated with each of those component links. What is the input? or what is the LAG Table associated to (i.e., not a route) Figure 2: Unevenly Utilized Component Links I am not sure how realistic the example in Section 3, Figure 2 is, if only two flows congest a member link... 4. Mechanisms for Optimizing LAG/ECMP Component Link Utilization The suggested mechanisms in this draft are about a local optimization solution; they are local in the sense that both the identification of large flows and re-balancing of the load can be accomplished completely within individual nodes in the network without the need for interaction with other nodes. It is not clear to me how a local-only node can deal with node polarization in ECMP networks. A small explanation of this could help. . Component Link Weight: The relative weight to be applied to traffic for a given component link when using hash-based techniques for load distribution. Is this for ECMP or UCMP? 11. References 11.1. Normative References 11.2. Informative References I would have expected that many of these references are Normative (i.e., needed to understand the document). Yes, the doc is Informational. The meaning of Normative vs. Informative still remains. Hope this helps. Thanks, -- Carlos.