I am the assigned Gen-ART reviewer for this draft. The General Area Review Team (Gen-ART) reviews all IETF documents being processed by the IESG for the IETF Chair. Please treat these comments just like any other last call comments. For more information, please see the FAQ at . Document: draft-ietf-pals-endpoint-fast-protection-04 Reviewer: Dale R. Worley Review Date: 2016-12-06 IETF LC End Date: 2016-12-06 IESG Telechat date: (not set) Before reviewing this document, I knew nothing about pseudowire routing, so my review does not fully assess the technical aspects of the document. I suspect that almost all of these items are editorial. I hope that they are simply places where the description can be made clearer to the naive reader by making it more exact. Some of them may be places where I don't fully understand the technology. It's possible that some of them are technical issues. Summary: This draft is basically ready for publication, but has nits that should be fixed before publication. These items seem to have technical content: In section 4.2 and 4.3, it is not clear to me whether all of the PWs being carried by a particular tunnel must have the same protector. Section 4.2 says that it is so, but section 4.3 suggests that the PWs can be divided into subsets which have different protectors, without mentioning any correspondence between the subsets and the tunnel-groups of PWs. What is the mechanics of context identifiers in the context of protecting egress ACs? It seems like this should all be analogous to the described cases (protecting PEs), but as this is a specification, how the analogy works should be made explicit. In section 6.4, does there need to be a specification for the "Reserved" fields of the Protection FEC Element TLV? Usually, there is a specification "must be sent with zeros/must be ignored on receipt". (Or is there a global understanding of this in LDP?) In section 6.4, there is a definition of "PW Information". Are there previous definitions of "PW information" in PW/LDP/MPLS usage? It looks a lot like what's defined in RFC 4447 sections 5.2 and 5.3. If they are semantically the same, the same format should be used and simply referred to here. If the format is different because they aren't semantically the same ... perhaps there should be a note explaining how/why. Nits/editorial comments: There are quite a number of places where "a" or "the" seems to be omitted before nouns. I'll let the RFC Editor identify/correct those. There is a general issue regarding what aspects of local repair are configured by some external means (e.g., the protectors that PLRs use, the context identifiers) and what aspects are established automagically by the defined mechanisms (the bypass tunnels). The document would have been clearer to me if these were separated explicitly, but I suspect that it is common usage in routing specifications for the reader to sort it out. Abstract IMO it would help if the Abstract mentioned that only IP/MPLS transport tunnels are protected. I say this because the only part of this technology that I'd previously heard of was MPLS; such a specification in the abstract would tell a large body of potential readers that the document is *not* relevant to their situation. 1. Introduction The mechanism is applicable to LDP signaled PWs. It is relevant to networks with redundant PWs and multi-homed CEs. It is designed on the basis of MPLS upstream label assignment and context-specific label switching [RFC5331]. It might be more informative to say "Fast failure protection for pseudowire endpoints" rather than "The mechanism". Which of the factors listed are prerequisites for the use of this mechanism and which are factors which this mechanism additionally supports? Whichever of these are prerequisites for the solution should be mentioned as such in the Abstract, including particularly that the PW must be carried by IP/MPLS transport tunnels (as described in section 4.1 paragraph 2). Fast protection refers to its ability to restore traffic in the order of tens of milliseconds. Compared with global repair and control plane repair, this mechanism can provide faster service restoration. What is the time scale of "global repair and control plane repair"? Given that you give the time scale of "fast protection", it would be informative to have comparable values for other repair techniques. However, it is intended to complement those mechanisms, rather than replacing them. It might be useful to explain why global/control plane repair is still needed. (Then again, that might be obvious to anyone in the field.) 3.1. Single-Segment PW For either mode, when considering the traffic flowing in a given direction over an active path, this document views the ACs, PEs and PWs to serve primary or backup roles. In particular, the ACs, PEs and PW along this active path are primary, while those along the other path are backup. Note that in the active-active mode, the backup path is an active path by itself, carrying its own share of traffic while protecting the other active path. The wording here doesn't seem to be quite right. I think you need to phrase it more like this: For either mode, when considering the traffic flowing in a given direction over an active path, this document views the ACs, PEs and PWs to serve primary or backup roles or both. In particular, given an active path, the ACs, PEs and PW along that active path have primary roles, while those along the other path have backup roles. Note that in the active-active mode, each AC, PE, and PW has a primary role (due to being on an active path) and also a backup role protecting the other path (which is also active). -- For clarity, primary egress AC, primary egress PE, backup egress AC, and backup egress PE may simply be referred to as primary AC, primary PE, backup AC, and backup PE, respectively, when the context of a discussion is egress endpoint. This is correct, but it seems to overlook the use of "primary PE" to mean "the PE that is being protected (when we are considering a particular PLR and protector)". That usage is common in the rest of the document, but difficult for the naive reader to abstract from the above text. 4.1. Applicability In a network where transport tunnels may provide ECMP to primary PEs, care should be taken to prevent misordered packet delivery during local repair. Should "ECMP" get a reference? Or is it so common that any reader of this document can be assumed to know already? In a network where transport tunnels may provide ECMP to primary PEs, care should be taken to prevent misordered packet delivery during local repair. Perhaps this should be qualified with "if the PW or some flows within the PW are sensitive to packet misordering", as is mentioned later in the paragraph. Naively, it seems that with ECMP it is impossible to prevent packet reordering. (E.g., it's hard to imagine using ECMP with ATM circuits unless the destination performed SAR separately for each path.) However, the scenario developed in the paragraph could cause a dramatic increase in the "distance" that packets are reordered, and that increase could be a problem. Perhaps the text could emphasize that the problem isn't so much the absolute presence of reordering but its magnitude. 4.2. Local Repair and Protector I find this title peculiarly hard to parse. I think the problem is the ambiguity whether "local" modifies "protector" or not, as well as the fact that "local repair" is a concept whereas "protector" is a singular device. Perhaps "Local Repair and Protectors" ("protectors" is a class of devices, with similar abstractness as "local repair") or "Local Repair, Bypass Tunnels, and Protectors" (since bypass tunnels seem to be a similarly critical part of the solution). In anticipation of the failure, the PLR MUST also pre-establish a bypass tunnel to a "protector", and pre-install a bypass route in the data plane. It might be clearer to say "a bypass route for the bypass tunnel in the data plane". It would be useful to mention here that the protector is either a backup PE which, in a sense, functions as an alternative terminus for the PW segment, or a router which will forward traffic to such a backup PE. This raises the point that, in a way, the bypass tunnel is an alternative continuation of the protected transport tunnel. E.g., it has as the destination endpoint the same virtual interface address (the context identifier). Or does the PLR, when repairing, act as a proper downstream PE for the primary PW segment, and a proper upstream PE for the PW segment in the bypass tunnel? (I'm likely not using the terminology correctly here, but I suspect that the terminology is not used in a strictly correct manner in section 4.2.) Upon detecting the failure, the PLR invokes the bypass route in the data plane, and reroutes PW traffic to the protector through the bypass tunnel. Does "invokes the bypass route" have a clear meaning, i.e., does this use of "invoke" have a proper definition? I think you can elide that phrase and just say "the PLR reroutes PW traffic to the protector ...", with better clarity. In this document, the PLR simply computes and establishes a node protection bypass tunnel in the same fashion as the normal IP/MPLS node protection, except that with the notion of context identifier, the bypass tunnel will be established from the PLR to the protector (Section 4.6). Is there a reference for "the normal IP/MPLS node protection"? Without the context identifier, what would happen which contrasts to "from the PLR to the protector"? On the other hand, this imposes a requirement on the protector that it MUST be able to forward the packets based on a PW label that is assigned by the primary PE, and ensure that the traffic MUST eventually reach the target CE. More strictly, the protector must forward the packets *along the backup path*. Of course, that's implied by the rest of the document, but it might be worth including that fact in this statement. It also implies that all of the PEs along the backup path must be able to forward the packets based on the PW label of the primary path. That's also implied, but I don't think it's stated anywhere. 4.3. Context Identifier Likewise, the PWs terminated on a primary PE may be protected by multiple protectors, each for a subset of the PWs. I think this is actually "PW segments" rather than "PWs" (since a PW only terminates at the egress PE). There seems to be a general issue in the document of using "PW" in places where the strictly correct term is "PW segment". But compare with section 4.2, "This requires that the protector given by the bypass tunnel MUST be intended for all the PWs carried by the [primary] transport tunnel." That means that all PWs in one tunnel must be within one of these subsets of PWs. Is that really correct? It seems to me that the context identifier is used as the address of the virtual interface that is downstream end of the protected tunnel. If so, it would have helped me make a mental model of the process by stating that here in section 4.3 -- 4.3 states that a context identifier is an address, but does not specify what it is the address of. What is the mechanics of context identifiers in the context of protecting egress ACs? Since there is a bypass tunnel, there must be an address of the virtual interface within the protector that is the terminus of the bypass tunnel. It seems like this address should be the context identifier used by the PLR, but in this situation there is no downstream "primary PE" from which to make a {primary PE, protector} pair. I think that what works is to use the destination CE in place of the primary PE. That's straightforward, since the pairs are never directly represented in the protocol, so most of the mechanics should be unchanged. In this situation, the PLR can't learn the context identifier from the primary PE, but I suppose it can invent/be configured with the context identifier itself, since it doesn't need to establish a tunnel to the primary PE. It seems like this should all be analogous to the described cases, but as this is a specification, how the analogy works should be made explicit. 4.3.1. Semantics A distinct context identifier MUST be assigned to the primary PE and each protector. This sentence is not correct as written since it states that context identifiers are assigned to primary PEs and also are assigned to protectors. I think you meant to say A distinct context identifier MUST be assigned to each pair of primary PE and protector that it uses. or A distinct context identifier MUST be assigned to each {primary PE, protector} pair. 4.3.2. FEC In an MPLS network, a context identifier represents a FEC (Forwarding Equivalence Class) for transport tunnels and bypass tunnels destined for it. This doesn't seem to match the definition of "context identifier" given previously, as a single context identifier can be used by several PW segments ending at a PE, which PWs carry packets of different FECs. Is this perhaps "context label"? Or are there multiple meanings of "context identifier" which should be explicitly disambiguated? Or am I misreading this? 4.5. Transport Tunnel In egress PE node protection and S-PE node protection, when a node failure is detected on any ECMP branch, the penultimate hop router SHOULD act as a PLR to reroute all the traffic of the ECMP set to the protector. I think "node failure" should be "link failure" here -- if there is *node* failure (of the PE), then all of the ECMP paths will not work, and the PLR would have to reroute all the traffic anyway. 4.6. Bypass Tunnel A bypass tunnel SHOULD NOT need to be further protected against a transit link failure, transit node failure, or egress node failure. I think this is an incorrect use of "SHOULD NOT", as it is not a qualifier of a requirement on a device. I think that you mean A bypass tunnel SHOULD NOT be protected against a transit link failure, transit node failure, or egress node failure. But you might want to explain it with There is little or no benefit from protecting a bypass tunnel, so a bypass tunnel SHOULD NOT ... 4.7. Examples of Forwarding State In the forwarding plane, it is indicated by the bypass tunnel(s) destined for the context identifier. What is "it"? I think it refers to "each label space", but in that case I think "they" would be the preferred usage. Better, replace "it" with a noun phrase. 5. Revertive Behavior Possible triggers of global repair include PW status notification, VCCV, BFD, end-to- end OAM between CEs, etc. I see that "VCCV" is the topic of RFC 5085, but it is not the name of a page in Wikipedia, which suggests that giving a reference for it would be useful. Particularly in the case of egress PE and S-PE node failures, if the ingress PE or the protector loses communication with the (S-)PE for an extensive period of time, LDP session may go down. "the (S-)PE" isn't unique in this context, as there is an ingress PE in the discourse. I think you mean "the primary (S-)PE" (i.e., the protected PE). 6. LDP Extensions To facilitate the procedures, Usually "facilitate" only applies to making easier something that could be done already. In this case, the new TLV is necessary to enable these procedures, so I suggest s/facilitate/support/. The procedures in this section are only applicable, if the protector advertises the Egress Protection Capability TLV, the primary PE supports the advertisement of the Protection FEC Element TLV, and in the centralized protector model, the backup PE also supports the advertisement of the Protection FEC Element TLV. My understanding is that "the procedures in this section" means "establishing endpoint fast failure protection via LDP". That seems to be sufficiently important that it is probably worth expanding this sentence into a bullet list: The procedures in this section are only applicable if: o the protector advertises the Egress Protection Capability TLV, o the primary PE supports the advertisement of the Protection FEC Element TLV, and o in the centralized protector model, the backup PE also supports the advertisement of the Protection FEC Element TLV. Subtly, this seems to be the list of preconditions under which the PLR can establish the needed tunnels and perform the local repair -- this list does not list that the PLR must implement the needed behaviors, which is obviously a precondition of local repair. Or perhaps it is a list of LDP capabilities that are required to set up local repair via LDP. In either case, it suggests the first clause could be revised to change "the procedures ..." with a phrase that is more specific as to what, precisely, is enabled by the conjunction of these three items. 6.1. Egress Protection Capability TLV The TLV Code Point is TBD. It needs to be assigned by IANA. It seems like this should be flagged with a note so the RFC Editor can put in the assigned value. The "Capability Data" is encoded with the context identifier of the {primary PE, protector}. This should be something like: The "Capability Data" is encoded with the context identifiers for the {primary PE, protector} pairs for which the advertiser is the protector. 6.4. Protection FEC Element TLV Does there need to be a specification for the "Reserved" fields? Usually, there is a specification "must be sent with zeros/must be ignored on receipt". (Or is there a global specification of this in LDP?) ~ PW Information ~ Are there previous definitions of "PW information" in PW/LDP/MPLS usage? It looks a lot like what's defined in RFC 4447 sections 5.2 and 5.3. If they are semantically the same, the same format should be used and simply referred to here. If they aren't semantically the same ... perhaps there should be a note explaining why. 7. IANA Considerations The assignment for the Egress Protection Capability TLV should be described more definitively. My impression is that it should be assigned out of the "TLV Type Name Space" in the LDP parameters page, but that doesn't seem to be stated explicitly. Also, there seem to be three blocks of values (0x0001-0x07FF, 0x0800-0x08FF, 0x0900-0x3DFF) that are all marked as "IETF consensus", but which presumably differ in some manner. Which block should the value be in? Perhaps for clarity, an appropriate skeleton protocol assignment table row should be shown. Value Hex Name Label Advertisement Discipline ------------------------------------------------------------------- 131 0x83 Protection FEC Element DU Section 6.2 calls the new TLV in the Label Mapping message "a Protection FEC Element TLV", but section 7 calls it an "LDP Protection FEC Type Name Space value". The latter phrase consists of 7 successive nouns and is (IMO) unparsable by a reader who doesn't already know what it means. I suggest changing it to "the Protection FEC Element value in the LDP FEC Type Name Space" (which aligns with the titles used in the IANA protocol assignment web page). It appears that the values in the FEC Type Name Space are 8 bits but one place in section 7 gives the value as "0x083", which should presumably be changed to "0x83". 8. Security Considerations In all scenarios, the role of protector is entirely managed by network operator, and backup PEs can be used anyway to host PWs and LDP sessions. Perhaps "anyway" could be clarified as: ... and, regardless of local repair, backup PEs can be used to host PWs and LDP sessions. -- In general, [RFC5920] describes the security framework for MPLS networks. [RFC3209] describes the security considerations for RSVP LSPs. [RFC5036] describes the security considerations for the base LDP specification. [RFC5561] describes the security considerations which apply when using the LDP capability mechanism. All these security framework and considerations apply to this document as well. Is there a reference for security considerations for "IP tunnels" (as mentioned in section 4.3.1 and 4.6)? Dale