Reviewer: Toerless Eckert Summary: The purpose of the document is to extend the BGP message signaling and local router procedures for failover of "Designated Forwarders" for pseudowires using calculated future timestamps and expecting clock synchronization across the forwarders, so that after receipt of the BGP message, the switchover can be handled autonomously by every node as synchronously as desired and allowed for by the clock synchronization method used. Review result: On The Right Track I am the assigned IOTDIR reviewer. I found the document well written and easy to read, except for some typos, other nits and some logical description gap. (unfortunately ?) I find the approach of the draft very useful, and i always wished we would have been able to build this in other IETF protocol domains (IP multicast), so i happen to have a range of technical concerns and suggestions primarily around the completeness of the documents methods and detail specifications, which i hope will be helpfull to improve on the quality of the text and usefulness of the solution. The following is a list of G.i general comments followed by the commented idnits version of the draft. Thank you very much for the work! Toerless Eckert General comments: G.1 minor: Why IOTdir review ? I am a bit puzzled why this draft was given to IOTdir for early review. Neither the draft nor the RFCs it references mentions IoT. And the mentioned pseudowire use-cases are all around DataCenter. So i wonder what specific IoT feedback the authors/WG is looking for. If thereactually is a specific type of use-cases for IoT with this technology, then it would be great to mention. G.2 minor/suggestion: HRW has known problems HRW was popularized and (in)validated in deployments of PIM-SM since 1995 and hence rfc2362 way before HRW1998 was written, but of course not credited in RFC8485. I would nevertheless like to point out that the IP Multicast community in the IETF had some run-ins with operators over the decades who where disappointed by its non-equal distribution in face of specific typical set of parameters such as consecutive or close to each other router-IDs. Of course, the parameters used in EVPN are different, and i have not tried to validate if or how such deployment specific anomalies would or could equally apply to the EVPN version, but i would strongly suggest to be aware that HRW is by far a well randomizing algorithm especially for the order of the input parameters. HRW is now probably 30 years old, and maybe EVPN may wants to look into newer, and supposedly better algorithms such as MurmurHash (which was a recommendation from a math geek colleague even 15 years ago - and other proposals in the IETF are picking up on it too). G.3 minor/question: Please consider adding ordered shutdown support If my understanding of RFC7432/RFC8584 and this draft is correct, the interruption in case of an ordered shutdown of a DF is as large as that of an unexpected shutdown/service interruption (without the detection of interruption of course). I think this is not necessary. I think it would be great if this draft could add support for the synchronized switchover in case of ordered shutdown of a DF because such procedures constitute likely a large number of outages in daily operations of larger networks. For example, the new extended community could have a flag indication of such an ordered shutdown so that the indicated SCT will trigger synchronized failover to the BDF (Backup DF). And only after the failover has happened would the primary DF send out the NLRI withdraw route and finish the shutdown operation. G.4 mayor: analysis of actual failover behavior The mechanism of this draft seems to aspire through synchronized switchover to achieve a switchover interruption in the order of 10 msec (the skew default value). I am worried that in the face of a large number of failovers (because of a large number of VLAN/ES services), that the interruption becomes larger and that it will be inconsistent across different services. The way i imagine the failover to operate (from similar failovers n other technologies like multicast), A router may fairly quickly be able to generate the SCT carrying routes, so there can be a burst of SCT routes all with the same SCT. When those SCT then actually expire both on the sending and receiving router, the speed at which they are added/deleted in hardware-forwarding will depend on the performance of updating hardware forwarding registers. Which may be inconsistent across different routers. It is also not clear to me if the BGP infrastructure or other factors can or can not introduce any reordering. But if for example we have thousand routes that need to be updated, and one router can update 1000 routes/sec and the other can update 2000 routes/sec, then one will be done after half a second, the other after one second - no reordering assumed. So it would be very helpfull to have some idea about the maximum imaginable scalability required and likely min/max performances to vet the impact of this candidate issue. There is of course a way to overcome this issue, which is to generate SCT that take the performance of (de)installation of hardware forwarding entries into account, for example by assuming some floor performance and generate SCT for such burst of service routes with timestamps increasing such that when they will be executed, they will stay under such a performance floor. Aka: Have a difference of e.g.: 4msec between each route, in result creating no more than 250 SCP updates/second. In any case, it would be great if the grat target goal of this draft - less than 10 msec interruption would not be invalidated by such real-world performance impacts if it actually is easy to overcome it with a bit of additional text in the draft. G.5 mayor: Behavior upon non-synchronization. I think the draft should do more due-diligence in its text for various conditions of non-correct time synchronization between devices. Let first agree on the conditions and general direction, and the i am happy to propose text if it makes sense to the WG. a) A router can and then should validate the state of synchronization of its clock (in NTP for example this is typically possible via some management API, not sure if there is already a YANG model). When restarting, the that its clock is not synchronized to a necessary degree of accuracy yet. Minimum required synchronization accuracy should be configurable, default maybe 3 msec. In this case the router would wait until the synchronization is sufficient up to a maximum time period (configurable, default maybe 30 seconds). If synchronization is not sufficient then, revert to behave as non-draft compliant router - and upgrade later on if and when synchronization is successful. b) A router which is aware that it is correctly synchronized is is receiving an SCT update from another router which did not correctly recognize its own synchronization failure (e.g.: does not have the API to validate its local clock being synchronized). This condition might warrant a flag bit in the route updates, if feasible. To discover and work around this condition, routers will perform plausibility check on received SCT timestamps, e.g.: validate that the received timestamp is within a reasonable window around the local (synchronzied) clock at the time of reception of the SCT carrying route: at least one second from current clock, at most the configured interval (default 3 seconds), plus extensions, such as some seconds if concern G.4 is taken into account. If ithe received SCD is out of bounds, then the receiving router would raise some error condition and perform some fallback failover, e.g.: within 3 seconds from reception (to avoid that failover would happen at an imappropriately long time in the future immediately, when SCT is in the past). G.6 minor: some suggested NTP operational text The following is proposed text for some NTP clock synchronization operational considerations sections including only G.5 suggestion a). But also other aspects crucial for successfull deployment. ---- While the use of a synchronized clock between the participating routers makes the solution itself very simple and accurate, it does introduce a new potentially large and complex dependency against the clock synchronization mechanism used. Because of the use of NTP timestamps, it is not possible to build really lightweight and autonomously operating clock synchronization systems. Instead, one will likely need to create an operational dependency against a clock source with automated inclusion of complexities specifically the leap seconds, which includes satellite clock sources (Beidou, Galileo, GLONASS or GPS), or terrestrial (DCF77, WWVB, MSF or JJY). If this dependency is operationally already established for other purposes, then the mechanism of this document does not provide incremental requirements except maybe for the required accuracy. Otherwise the requirements to operate the clock synchronization need to be analyzed. For the mechanism of this document to provide the desired benefit, synchronization of a few millisecond (5) or less is required, so that the skew is sufficient to separate the break DF times from the make DF times. This should in general not be a problem to achieve with minimal NTPv4 installations that are aware of common pittfalls as follows. When a router restarts, initial synchronization to other NTP server(s) is sped up if the router has a local battery backed RTC clock from which it can derive derive a starting time as well as the capability to step the clock to quickly synchronize to the other NTP server(s). If either is not possible, synchronization may take more than a few seconds after reboot and it may be desirable to delay the bringing up DF functionality up until the desired accuracy of clock synchronization is achieved. Synchronization across WAN links can be subject to asymmetric latency, which can be as high as some msec, such as for pseudowires across transcontinental connectibity between backup DCs. Clock synchronization protocols can not automatically figure out such asymmetric propagation latencies. If deployments with such asymmetric latencies is required, the clock synchronization protocol needs to have options to learn about such asymmetries, such as through configuration. G.7 minor: make before break instead of break before make I think that it would make sense to define skew as configurable and explicitly point to the option of making it positive so as to achieve "make before break" functionality, E.g.: making the recovering router become DF slightly before the withdrawing router. I can think of several type of customer services that can better deal with duplicates than with even short term losses. And unless i am overlooking some looping issues in the broadcast domains (which i likely may), the only reason to do break before make is IMHO services where the simultaneous sending will result in overload. But whenever a service has a lot rate of actual user traffic, most application will prefer a few duplicates over a few losst packets. -- The following is idnits output to have line numbers. issues/discussions from the review have no line numbers. ------ draft-ietf-bess-evpn-fast-df-recovery-09.txt: Showing Errors (**), Flaws (~~), Warnings (==), and Comments (--). Errors MUST be fixed before draft submission. Flaws SHOULD be fixed before draft submission. Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Running in submission checking mode -- *not* checking nits according to https://www.ietf.org/id-info/checklist . ---------------------------------------------------------------------------- No nits found. -------------------------------------------------------------------------------- 2 BESS Working Group P. Brissette, Ed. 3 Internet-Draft A. Sajassi 4 Updates: 8584 (if approved) LA. Burdet 5 Intended status: Standards Track Cisco 6 Expires: 9 January 2025 J. Drake 7 Independent 8 J. Rabadan 9 Nokia 10 8 July 2024 12 Fast Recovery for EVPN Designated Forwarder Election 13 draft-ietf-bess-evpn-fast-df-recovery-09 15 Abstract 17 The Ethernet Virtual Private Network (EVPN) solution provides 18 Designated Forwarder (DF) election procedures for multihomed Ethernet 19 Segments. These procedures have been enhanced further by applying 20 Highest Random Weight (HRW) algorithm for Designated Forwarder 21 election in order to avoid unnecessary DF status changes upon a 22 failure. This document improves these procedures by providing a fast 23 Designated Forwarder election upon recovery of the failed link or 24 node associated with the multihomed Ethernet Segment. This document 25 updates Section 2.1 of [RFC8584] by optionally introducing delays 26 between some of the events therein. 28 The solution is independent of the number of EVPN Instances (EVIs) 29 associated with that Ethernet Segment and it is performed via a 30 simple signaling between the recovered node and each of the other 31 nodes in the multihoming group. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at https://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on 9 January 2025. 50 Copyright Notice 52 Copyright (c) 2024 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 57 license-info) in effect on the date of publication of this document. 58 Please review these documents carefully, as they describe your rights 59 and restrictions with respect to this document. Code Components 60 extracted from this document must include Revised BSD License text as 61 described in Section 4.e of the Trust Legal Provisions and are 62 provided without warranty as described in the Revised BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 67 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 68 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3 69 1.3. Challenges with Existing Mechanism . . . . . . . . . . . 3 70 1.4. Design Principles for a Solution . . . . . . . . . . . . 5 71 2. DF Election Synchronization Solution . . . . . . . . . . . . 5 72 2.1. BGP Encoding . . . . . . . . . . . . . . . . . . . . . . 6 73 2.2. Updates to RFC8584 . . . . . . . . . . . . . . . . . . . 7 74 3. Synchronization Scenarios . . . . . . . . . . . . . . . . . . 8 75 3.1. Concurrent Recoveries . . . . . . . . . . . . . . . . . . 10 76 4. Backwards Compatibility . . . . . . . . . . . . . . . . . . . 11 77 5. Security Considerations . . . . . . . . . . . . . . . . . . . 11 78 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 79 7. Normative References . . . . . . . . . . . . . . . . . . . . 12 80 Appendix A. Contributors . . . . . . . . . . . . . . . . . . . . 13 81 Appendix B. Acknowledgements . . . . . . . . . . . . . . . . . . 13 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 14 84 1. Introduction 86 The Ethernet Virtual Private Network (EVPN) solution [RFC7432] is 87 becoming pervasive in data center (DC) applications for Network 88 Virtualization Overlay (NVO) and DC interconnect (DCI) services, and 89 in service provider (SP) applications for next generation virtual 90 private LAN services. nit: If there is any IoT use, please mention nit: "pervasive" is a bold statement. I do not know enough to support or doubt it, but if there was any reference you could add to support the claim, then it would make it stronger. Else maybe tone it down ("widely used")... 92 [RFC7432] describes Designated Frowarder (DF) election procedures for ^ typo 93 multihomed Ethernet Segments. These procedures are enhanced further 94 in [RFC8584] by applying the Highest Random Weight (HRW) algorithm nit: please add the HRW1998 reference as used in RFC8584 as reference for the term HRW and include it here. 95 for DF election in order to avoid unnecessary DF status changes upon 96 a link or node failure associated with the multihomed Ethernet 97 Segment. This document makes further improvements to the DF election nit: insert paragraph break before "This" (background -> contribution). 98 procedures in [RFC8584] by providing an option for a fast DF election 99 upon recovery of the failed link or node associated with the 100 multihomed Ethernet Segment. This DF election is achieved 101 independent of the number of EVPN Instances (EVIs) associated with 102 that Ethernet Segment and it is performed via straightforward 103 signaling between the recovered node and each of the other nodes in 104 the multihomed group. 105 This document updates the DF Election Finite State Machine (FSM) 106 described in Section 2.1 of [RFC8584], by optionally introducing 107 delays between some events, as further detailed in Section 2.2. The 108 solution is based on a simple one-way signaling mechanism. 110 1.1. Requirements Language 112 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 113 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 114 "OPTIONAL" in this document are to be interpreted as described in BCP 115 14 [RFC2119] [RFC8174] when, and only when, they appear in all 116 capitals, as shown here. 118 1.2. Terminology 120 PE: Provider Edge device. 122 Designated Forwarder (DF): A PE that is currently forwarding 123 (encapsulating/decapsulating) traffic for a given VLAN in and out 124 of a site. 126 EVI: An EVPN instance spanning the Provider Edge (PE) devices 127 participating in that EVPN. 129 1.3. Challenges with Existing Mechanism 131 In EVPN technology, multiple Provider Edge (PE) devices have the 132 ability to encapsulate and decapsulate data belonging to the same 133 VLAN. Under certain conditions, this may cause Layer2 duplicates and 134 potential loops if there is a momentary overlap in forwarding roles 135 between two or more PE devices, consequently leading to broadcast 136 storms. 138 EVPN [RFC7432] currently specifies timer-based synchronization among 139 PE devices within a redundancy group. This approach can lead to 140 duplications and potential loops due to multiple Designated 141 Forwarders (DFs) if the timer interval is too short, or to packet 142 drops if the timer interval is too long. 144 Split-horizon filtering, as described in Section 8.3 of [RFC7432], 145 can prevent loops but does not address duplicates. However, if there 146 are overlapping Designated Forwarders (DFs) of two different sites 147 simultaneously for the same VLAN, the site identifier will differ 148 when the packet re-enters the Ethernet Segment. Consequently, the 149 split-horizon check will fail, resulting in Layer 2 loops. minor: i can not find a description of this setup and problem in [RFC7342], and the description in the paragraph above is quite terse so that i am not sure that i would make up from scratch a fitting example. I think it would thus be useful to provide an topology with an appropriate example of this condition and explain the problem based on that topology example. 151 The updated Designated Forwarder (DF) procedures outlined in 152 [RFC8584] use the well-known Highest Random Weight (HRW) algorithm to 153 prevent the reshuffling of VLANs among PE devices within the 154 redundancy group during failure or recovery events. This approach 155 minimizes the impact on VLANs not assigned to the failed or recovered 156 ports and eliminates the occurrence of loops or duplicates during 157 such events. 159 However, upon PE insertion or a port being newly added to a 160 multihomed Ethernet Segment, HRW also cannot help as a transfer of DF 161 role to the new port must occur while the old DF is still active. 163 +---------+ 164 +-------------+ | | 165 | | | | 166 / | PE1 |----| | +-------------+ 167 / | | | MPLS/ | | |---CE3 168 / +-------------+ | VxLAN/ | | PE3 | 169 CE1 - | Cloud | | | 170 \ +-------------+ | |---| | 171 \ | | | | +-------------+ 172 \ | PE2 |----| | 173 | | | | 174 +-------------+ | | 175 +---------+ 177 Figure 1: CE1 multihomed to PE1 and PE2. 179 In Figure 1, when PE2 is inserted in the Ethernet Segment or its 180 CE1-facing interface recovered, PE1 will transfer the DF role of some 181 VLANs to PE2 to achieve load balancing. However, because there is no 182 handshake mechanism between PE1 and PE2, overlapping of DF roles for 183 a given VLAN is possible which leads to duplication of traffic as 184 well as Layer 2 loops. 186 Current EVPN specifications [RFC7432] and [RFC8584] rely on a timer- 187 based approach for transferring the DF role to the newly inserted 188 device. This can cause the following issues: 190 * Loops/Duplicates if the timer value is too short 191 * Prolonged Traffic Blackholing if the timer value is too long 193 1.4. Design Principles for a Solution 195 The clock-synchronization solution for fast DF recovery presented in 196 this document follows several design principles and presents 197 multiples advantages, namely: 199 * Complex handshake signaling mechanisms and state machines are 200 avoided in favor of a simple uni-directional signaling approach. 202 * The fast DF recovery solution maintains backwards-compatibility 203 (see Section 4) by ensuring that PEs any unrecognized new BGP 204 Extended Community. 206 * Existing DF Election algorithms remain supported. 208 * The fast DF recovery solution is independent of any BGP delays in 209 propagation of Ethernet Segment routes (Route Type 4) minor: This claim is unclear to me. There is an overall maximum for the propagation latency plus processing time of "just" a few seconds with the default SCT calculation, right ? And that is communicated "in conjunction with" the Ethernet Segment routes according to your below explanation. So there is a maximum propagation limit. And likely some serialization, timing dependencies.... ??!! 211 * The fast DF recovery solution is agnostic of the actual time 212 synchronization mechanism used, and normalizes to NTP for EVPN 213 signalling only. XXX 215 2. DF Election Synchronization Solution 217 The fast DF recovery solution relies on the concept of common clock 218 alignment between partner PEs participating in a common Ethernet 219 Segment i.e. PE1 and PE2 in Figure 1. The main idea is to have all 220 peering PEs of that Ethernet Segment perform DF election, and apply 221 the result at the same pre-announced time. 223 The DF Election procedure, as described in [RFC7432] and as 224 optionally signalled in [RFC8584], is applied. All PEs attached to a 225 given Ethernet Segment are clock-synchronized using a networking 226 protocol for clock synchronization (e.g., NTP, PTP). When a new PE 227 is inserted in an Ethernet Segment or a failed PE device of the 228 Ethernet Segment recovers, that PE communicates to peering partners 229 the current time plus the value of the timer for partner discovery 230 from step 2 in Section 8.5 of [RFC7432]. This constitutes an "end 231 time" or "absolute time" as seen from the local PE. That absolute 232 time is called the "Service Carving Time" (SCT). 234 A new BGP Extended Community, the Service Carving Timestamp is 235 advertised along with the Ethernet Segment route (RT-4) to 236 communicate the Service Carving Time to other partners. 238 Upon receipt of the new BGP Extended Community, partner PEs can 239 determine the service carving time of the newly insterted PE. To 240 eliminate any potential for duplicate traffic or loops, the concept 241 of skew is introduced: a small time offset to ensure a controlled and 242 orderly transition when multiple Provider Edge (PE) devices are 243 involved. The receiving partner PEs add a skew (default = -10ms) to 244 the Service Carving Time to enforce this mechanism. The previously 245 inserted PE(s) must perform service carving first, followed shortly 246 by the newly insterted PE, after the specified skew delay. 248 To summarize, all peering PEs perform service carving almost 249 simultaneously at the time announced by the newly added/recovered PE. 250 The newly inserted PE initiates the SCT, and triggers service carving 251 immediately on its local timer expiry. The previously inserted PE(s) 252 receiving Ethernet Segment route (RT-4) with a SCT BGP extended 253 community, perform service carving shortly before Service Carving 254 Time. 256 2.1. BGP Encoding 258 A new BGP extended community is defined to communicate the Service 259 Carving Timestamp for each Ethernet Segment. 261 A new transitive extended community where the Type field is 0x06, and 262 the Sub-Type is 0x0F is advertised along with the Ethernet Segment 263 route. The expected Service Carving Time is encoded as an 8-octet 264 value as follows: 266 1 2 3 267 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 268 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 269 | Type = 0x06 | Sub-Type(0x0F)| Timestamp Seconds ~ 270 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 271 ~ Timestamp Seconds | Timestamp Fractional Seconds | 272 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 274 Figure 2: Service Carving Time 276 The timestamp exchanged uses the NTP prime epoch of January 1, 1900 277 [RFC5905] and the 64-bit NTP Timestamp Format. The NTP Era value is 278 not exchanged and Era 0 is assumed as of the writing of this 279 document. A DF Election operation occurring exactly at the Era 280 transition boundary some time in 2036 is outside of the scope of this 281 document. mayor: This description effectively only supports the protocol until the end of Era 0, because it not only describes what to do during switchover to Era N+1, but it does not describe how to operate without encoding the Era. This makes the protocol useful (without another RFC) for less than 12 years. That is IMHO insufficient. One simple solution, would be to describe that the Era is not included in the encoding, but that a plausibility check is made on received timestamps. If it is completely out of range with the receiving routers current Era, but within rage with Era-1 or Era+1, then the timestamp is accordingly adjusted to use that Era. In another solution option, you can encode the Era by carving space from the SCT encoding as follows: IMHO, it is unnecessary to encode the fractional seconds with 16 bits. The accuracy of the signalled timestamp does NOT impact the synchronized accuracy of the execution of DF switchover. It only impacts the granularity of timestamps that can be generated. If you would signal only the top 8 bits of the fractional seconds, then you could still trigger a synchronized switchover at intervals of 4 msec, which IMHO is more than necessary. And the switchover could still be synchronized to an arbitrary better accuracy, such as 1 usec if just the clock synchronization between the router is that good. Practically speaking, NTP clock synchronization may often be just 1 msec accurate anyhow. Even if you consider my thoughts from above concern G.4, and want to assign different timestamps for every Ethernet Segment (especially with large number of ethernet segments), then an interval of 4 msec would likely be more than sufficient granularity. So with just 8 bit fractional second encoding, you have 8 bit spare in the encoding you can use for Era and other features (in the future). 282 The 64-bit NTP Timestamp Format consists of a 32-bit part for Seconds 283 and a 32-bit part for Fraction, which are encoded in the Service 284 Carving Time as follows: 286 * Timestamp Seconds: 32-bit NTP seconds are encoded in this field. 288 * Timestamp Fractional Seconds: the high order 16 bits of the NTP 289 'Fraction' field are encoded in this field. 291 When rebuilding a 64-bit NTP Timestamp Format using the values from a 292 received SCT BGP extended community, the lower order 16 bits of the 293 Fractional field are set to 0. The use of a 16-bit fractional 294 seconds yields adequate precision of 15 microseconds (2^-16 s). 296 This document introduces a new flag called "T" (for Time 297 Synchronization) to the bitmap field of the DF Election Extended 298 Community defined in [RFC8584]. 300 1 2 3 301 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 302 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 303 | Type = 0x06 | Sub-Type(0x06)| RSV | DF Alg | |A| |T| ~ 304 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 305 ~ Bitmap | Reserved = 0 | 306 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 308 Figure 3: DF Election Extended Community 310 * Bit 3: Time Synchronization (corresponds to Bit 27 of the DF 311 Election Extended Community). When set to 1, it indicates the 312 desire to use Time Synchronization capability with the rest of the 313 PEs in the Ethernet Segment. nit: "Bit 3" is a confusing definition because the "DF Election Extended Community" field is only mentioned in the prior paragraph and not shown with this name in the picture. I would suggest to replace picture 3 with Figure 4 from rfc8584 - which does show "Bitmap", and then follow it with Figure 5 from rfc8584 with "T" added, and then follow with the "Bit 3" bullet point. 315 This capability is utilized in conjunction with the agreed-upon DF 316 Election Type. For instance, if all the PE devices in the Ethernet 317 Segment indicate possessing Time Synchronization capability and ^^^^^^^^^^ nit: "the desire to use the" (to be consistent with the definition of T in line 312. 318 request the DF Election Type to be Highest Random Weight (HRW), then 319 the HRW algorithm is edused in conjunction with this capability. A ^^^^^^ nit: deduced ? 320 PE which does not support the procedures set out in this document, or 321 receives a route from another PE in which th capability is not set ^ nit: "e" 322 MUST NOT delay Designated Forwarder election as this could lead to 323 duplicate traffic in some instances (overlapping Designated 324 Forwarders). 326 2.2. Updates to RFC8584 328 This document introduces an additional delay to the events and 329 transitions defined for the default DF election algorithm FSM in 330 Section 2.1 of [RFC8584] without changing the FSM state or event 331 definitions themselves. 333 Upon receiving a RECV_ES message, the peering PE's Finite State nit: RFC8584 uses the term "RCVD_ES" for an event, and does not use the term "RECV_ES" for a message. Unless there is good reason to introduce new (inconsistent/duplicate) terminology, pls. change to terminology RCVD_ES event. Also further below (line 350). 334 Machine (FSM) transitions from the DF_DONE (indicating the DF 335 election process was complete) state to the DF_CALC (indicating that 336 a new DF calculation is needed) state . Due to the Service Carving 337 Time (SCT) included in the Ethernet-Segment update, the completion of 338 the DF_CALC state and the subsequent transition back to the DF_DONE 339 state are delayed. This delay ensures proper synchronization and 340 prevents conflicts. Consequently, the accompanying forwarding 341 updates to the Designated Forwarder (DF) and Non-Designated Forwarder 342 (NDF) states are also deferred. 344 The corresponding actions when transitions are performed or states 345 are entered/exited are modified as follows: nit: Suggest to rewrite to the following, to be more precise: Item 9. in RFC8584, Section 2.1, List "Corresponding actions when transitions are performed or states are entered/exited" is changed as follows: 347 9. DF_CALC on CALCULATED: Mark the election result for the VLAN or 348 VLAN Bundle. 350 9.1 If an SCT timestamp is present during the RECV_ES event of 351 Action 11, wait until the time indicated by the SCT before 352 proceeding to step 9.2. 354 9.2 Assume the role of DF or NDF for the local PE concerning the 355 VLAN or VLAN Bundle, and transition to the DF_DONE state. 357 This revised approach ensures proper timing and synchronization in 358 the DF election process, avoiding conflicts and ensuring accurate 359 forwarding updates minor: a) Given how this is the normative text, i am worried that the "skew" variable is not mentioned. Please insert accordingly. b) 9.1 does not seem to cover the SCT delay that needs to be performed (equally, except for skew) by the newly inserted PE. 9.1 only mentions the condition of RECV_ES, which to me does not sounds like the newly inserted PE. minor: I am somewhat irritated that neither RFC8584 nor this draft have any text in the state machiner section to indicate when/how ES routes are generated. This would help IMHO especially in this new draft, because it is the time when the timestamp is taken, SCT calculated and inserted into the ES route, and i guess that that also starts the process leading to CALCULATED event on the newly inserted router. 361 3. Synchronization Scenarios 363 Consider Figure 1 as an example, where initially PE2 has failed and 364 PE1 has taken over. This scenario illustrates the problem with the 365 DF-Election mechanism described in Section 8.5 of [RFC7432], 366 specifically in the context of the timer value configured for all PEs 367 on the Ethernet Segment. 369 Procedure based on Section 8.5 of [RFC7432] with the default 3 second 370 timer in step 2: 372 1. Initial state: PE1 is in a steady-state and PE2 is recovering 374 2. Recovery: PE2 recovers at an absolute time of t=99. 376 3. Advertisement: PE2 advertises RT-4, sent at t=100, to partner 377 PE1. 379 4. Timer Start: PE2 starts a 3 second timer to allow the reception 380 of RT-4 from other PE nodes. 382 5. Immediate carving: PE1 performs service carving immediately upon 383 RT-4 reception, i.e. t=100 plus some BGP propagation delay. 385 6. Delayed Carving: PE2 performs service carving at time t=103 387 [RFC7432] favors traffic drops over duplicate traffic. With the 388 above procedure, traffic drops will occur as part of each PE recovery 389 sequence since PE1 transitions some VLANs to Non-Designated Forwarder 390 (NDF) immediately upon RT-4 reception. 391 The timer value (default = 3 seconds) directly affects the duration 392 of the packet drops. A shorter (or zero) timer may result in 393 duplicate traffic or traffic loops. 395 Procedure based on the Service Carving Time (SCT) approach: 397 1. Initial state: PE1 is in a steady state, and PE2 is recovering 399 2. Recovery: PE2 recovers at an absolute time of t=99. 401 3. Advertisement: PE2 advertises RT-4, sent at t=100, with a target 402 SCT value of t=103 to partner PE1. 404 4. Timer Start: PE2 starts a 3 second timer to allow the reception 405 of RT-4 from other PE nodes. minor: IMHO, this is not a 3 second timer, but a timer with a deadline of t=103. Which is only at most 3 seconds, depending on whether step 4. happens exactly at t=100 or somewhat later. Practically, it would always be later. IMHO, it would be good to emphasize on this crucial benefit of the new mechanism. Maybe need to insert some addtl. processing delay into the section 8.5 example vs. this example to show this difference (delay between steps 3 and 4). 407 5. Service Carving Timer: PE1 starts the service carving timer, with 408 the remaining time until t=103 410 6. Simultaneous Carving: Both PE1 and PE2 carve at an absolute time 411 of t=103 413 To maintain the preference for minimal loss over duplicate traffic, 414 PE1 should carve slightly before PE2 (with skew). The recovering PE2 415 performs both DF to NDF and NDF to DF transitions per VLAN at the 416 timer's expiry. The original PE1, which received the SCT, applies 417 the following: 419 * DF to NDF Transition(s): at t=SCT minus skew, where both PEs are 420 NDF for the skew duration. 422 * NDF to DF Transition(s): at t=SCT minor: In line 238, the draft says "Upon receipt of the new BGP Extended Community" ... skew is being applied. Above text (line 419) instead defines application of skew upon determination of the state transitiom. It may be that in all cases where the BGP Extended Community is received, there is always only at most a DF to NDF transition (but no NDF to DF transition), staying at NDF), but it still is not ideal to have two inconsistent definitions when skew is being applied. Technically i think the DF to NDF transition case is more sound than the "receipt of the BGP extended community", aka: fix text around line 238 ?! 424 This split-behavior ensures a smooth DF role transition with minimal 425 loss. 427 Using the SCT approach, the negative effect of the timer to allow the 428 reception of RT-4 from other PE nodes is mitigated. Furthermore, the 429 BGP Ethernet Segment route (RT-4) transmission delay (from PE2 to 430 PE1) becomes a non-issue. The SCT approach shortens the 3-second 431 timer window to the order of milliseconds. 433 3.1. Concurrent Recoveries 435 In the eventuality 2 or more PEs in a peering Ethernet Segment group 436 are recovering concurrently or roughly the same time, each will 437 advertise a Service Carving Timestamp. This SCT value would 438 correspond to what each recovering PE considers the "end time" for DF 439 Election. A similar situation arises in sequentially recovering PEs, 440 when a second PE recovers approximately at the time of the first PE's 441 advertised SCT expiry, and with its own new SCT-2 outside of the 442 initial SCT window. 444 In the case of multiple concurrent DF elections, each initiated by 445 one of the recovering PEs, the SCTs must be ordered chronologically. 446 All PEs shall execute only a single DF Election at the service 447 carving time corresponding to the largest (latest) received timestamp 448 value. This DF Election will involve all active PEs in a unified DF 449 Election update. nit: I think the wording 444-449 is misleading/incomplete. The latest SCT timestamp is not the top critera, but if i understand the intent correctly, each "later" PEi also needs to be considered to be a better(best) DF than the prior PE, right ? Aka: In your below example (line 451ff), PE1 is DF When PE1 receives RT-4 from PE2, PE1 will redo DF calculation and consider PE2 to be the DF winner When PE2 later receives RT-4 from PE3, PE1 will redo DF calculation and now consider PE3 to be the DF winner. And only because PE3 is the DF winner, will PE1 now also cancel the SCT for PE2. If on the other hand, the DF HRW for PE3 would be lower than that of PE2, than PE1 would of course redo the DF election but given how PE3 does not show the result, this AFAIK should also mean that the SCT from PE3 should have no impact. Yes/No ? In any case it would be useful to improve the description to make this clearer. Especially if/when i misunderstood it. 451 Example: 453 1. Initial State: PE1 is in a steady state, with services elected at 454 PE1. 456 2. Recovery of PE2: PE2 recovers at time t=100 and advertises RT-4 457 with a target SCT value of t=103 to its partners (PE1) 459 3. Timer Initiation by PE2: PE2 starts a 3 second timer to allow the 460 reception of RT-4 from other PE nodes. 462 4. Timer Initiation by PE1: PE1 starts the service carving timer, 463 with the remaining time until t=103. 465 5. Recovery of PE3: PE3 recovers at time t=102 and advertises RT-4 466 with a target SCT value of t=105 to its partners (PE1, PE2). 468 6. Timer Initiation by PE3: PE3 starts a 3 second timer to allow the 469 reception of RT-4 from other PE nodes 471 7. Timer Update by PE2: PE2 cancels the running timer and starts the 472 service carving timer with the remaining time until t=105. 474 8. Timer Update by PE1: PE1 updates its service carving timer, with 475 the remaining time until t=105. 477 9. Service Carving: PE1, PE2, and PE3 perform service carving at the 478 absolute time of t=105. 480 In the eventuality a PE in a Ethernet Segment group recovers during 481 the discovery window specified in Section 8.5 of [RFC7432], and does 482 not support or advertise the T-bit, then all PEs in the current 483 peering sequence SHALL immediately revert to the default [RFC7432] 484 behavior. 486 4. Backwards Compatibility 488 For the DF election procedures to achieve global convergence and 489 unanimity within a redundancy group, it is essential that all 490 participating PEs agree on the DF election algorithm to be employed. 491 However, it is possible that some PEs may continue to use the 492 existing modulo-based DF election algorithm from [RFC7432] and not 493 utilize the new Service Carving Time (SCT) BGP extended community. 494 PEs that operate using the baseline DF election mechanism will simply 495 discard the new SCT BGP extended community as unrecognized. 496 [RFC7432] and do not rely on the new SCT BGP extended community. 498 A PE can indicate its willingness to support clock-synchronized 499 carving by signaling the new 'T' DF Election Capability and including 500 the new SCT BGP extended community along with the Ethernet Segment 501 Route (Type-4). If one or more PEs attached to the Ethernet Segment 502 do not signal T=1, then all PEs in the Ethernet Segment SHALL revert 503 to the timer-based approach as specified in [RFC7432]. This 504 reversion is particularly crucial in preventing VLAN shuffling when 505 more than two PEs are involved. 507 5. Security Considerations 509 The mechanisms in this document use EVPN control plane as defined in 510 [RFC7432]. Security considerations described in [RFC7432] are 511 equally applicable. 513 For the new SCT Extended Community, attack vectors may be setting the 514 value to zero, to a value in the past or to large times in the 515 future. The procedures in this document address implicitly what 516 occurs with a carving time in the past, as this would be a naturally 517 occurring event with a large BGP propagation delay: the receiving PE 518 SHALL treat the DF Election at the peer as having occurred already, 519 and proceed without starting any timer to futher delay service 520 carving. For timestamp values in the future, a rogue PE may be 521 advertising a value inconsistent with its local behavior. This is no 522 different than a rogue PE setting all its DF Election results 523 inconstently to its peers using (or ignoring adherence to) the 524 procedures from [RFC7432], and the result would similarly be 525 duplicate or dropped traffic. It is left to implementations to 526 decide what consists an "unreasonably large" SCT value. 528 This document uses MPLS and IP-based tunnel technologies to support 529 data plane transport. Security considerations described in [RFC7432] 530 and in [RFC8365] are equally applicable. 532 6. IANA Considerations 534 IANA maintains the "EVPN Extended Community Sub-Types" registry set 535 up by [RFC7153]. IANA is requested to confirm the First Come First 536 Served assignment as follows: 538 Sub-Type Value Name Reference Date 539 -------------- ------------------------- ------------- ---- 540 0x0F Service Carving Timestamp This document TBD 542 IANA should replace the field TBD with the date of publicaton of this 543 document as an RFC. 545 IANA maintains the "DF Election Capabilities" registry set up by 546 [RFC8584]. IANA is requested to make the following assignment from 547 this registry: 549 Bit Name Reference Date 550 ---- ---------------- ------------- ---- 551 3 Time Synchronization This document TBD 553 IANA should replace the field TBD with the date of publicaton of this 554 document as an RFC. 556 7. Normative References 558 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 559 Requirement Levels", BCP 14, RFC 2119, 560 DOI 10.17487/RFC2119, March 1997, 561 . 563 [RFC5905] Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch, 564 "Network Time Protocol Version 4: Protocol and Algorithms 565 Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010, 566 . 568 [RFC7153] Rosen, E. and Y. Rekhter, "IANA Registries for BGP 569 Extended Communities", RFC 7153, DOI 10.17487/RFC7153, 570 March 2014, . 572 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 573 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 574 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 575 2015, . 577 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 578 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 579 May 2017, . 581 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 582 Uttaro, J., and W. Henderickx, "A Network Virtualization 583 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 584 DOI 10.17487/RFC8365, March 2018, 585 . 587 [RFC8584] Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake, 588 J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet 589 VPN Designated Forwarder Election Extensibility", 590 RFC 8584, DOI 10.17487/RFC8584, April 2019, 591 . 593 Appendix A. Contributors 595 In addition to the authors listed on the front page, the following 596 co-authors have also contributed substantially to this document: 598 Gaurav Badoni 599 Cisco 601 Email: gbadoni@cisco.com 603 Dhananjaya Rao 604 Cisco 606 Email: dhrao@cisco.com 608 Appendix B. Acknowledgements 610 Authors would like to acknowledge helpful comments and contributions 611 of Satya Mohanty and Bharath Vasudevan. Also thank you to Anoop 612 Ghanwani and Gunter van de Velde for their thorough review with 613 valuable comments and corrections. 615 Authors' Addresses 617 Patrice Brissette (editor) 618 Cisco 619 Email: pbrisset@cisco.com 621 Ali Sajassi 622 Cisco 623 Email: sajassi@cisco.com 625 Luc Andre Burdet 626 Cisco 627 Email: lburdet@cisco.com 629 John Drake 630 Independent 631 Email: je_drake@yahoo.com 633 Jorge Rabadan 634 Nokia 635 Email: jorge.rabadan@nokia.com EOF