Hello,

I have been selected as the Routing Directorate reviewer for this draft. The Routing Directorate seeks to review all routing or routing-related drafts as they pass through IETF last call and IESG review, and sometimes on special request. The purpose of the review is to provide assistance to the Routing ADs. For more information about the Routing Directorate, please see ​

http://trac.tools.ietf.org/area/rtg/trac/wiki/RtgDir


Although these comments are primarily for the use of the Routing ADs, it would be helpful if you could consider them along with any other IETF Last Call comments that you receive, and strive to resolve them through discussion or by updating the draft.

Thanks,

--John


Document: draft-ietf-grow-ix-bgp-route-server-operations-03.txt
Reviewer: John Scudder
Review Date: 2014-09-18
IETF LC End Date: 2014-09-22 
Intended Status: Informational


Summary: 

	• I have some minor concerns about this document that I think should be resolved before publication.


Comments:

This is overall a good document and worth publishing, although I have found a number of minor issues I would like the authors to address before the document progresses. I initially flagged the first two issues as "major" but on consideration I've moved them to the "minor" list. With the noted exceptions, I think the document is very good in terms of its readability and fitness for publication without major editing.


Major Issues:

- None identified.


Minor Issues:

- Throughout the document, various terms are used to describe what RFC 4271 calls a "route". The definition given in RFC 4271 is:

   Route
      A unit of information that pairs a set of destinations with the
      attributes of a path to those destinations.  The set of
      destinations are systems whose IP addresses are contained in one
      IP address prefix carried in the Network Layer Reachability
      Information (NLRI) field of an UPDATE message.  The path is the
      information reported in the path attributes field of the same
      UPDATE message.

That is, one NLRI plus its path attributes, as carried in an UPDATE, is a "route". I would suggest adopting this term, or "BGP route" if you prefer, instead of terms such as "NLRI UPDATE message", "NLRI message", "prefix UPDATE message", and even just plain "NLRI" and "message". Also some, but not all, of the uses of "prefix". I think doing so will make the document clearer, more readable, and more technically accurate. A simple search for the terms I've called out should show most of them so I won't enumerate them here unless you ask me to (feel free, if you want). 

- Reference [RS-ARCH] is a dead link. I found a live copy at 

http://www.cs.usc.edu/assets/003/83191.pdf

. It might be worth checking with the authors of RS-ARCH to ask what a good archival reference is.

- S. 4.2 talks about scaling. I'm trying to make sense of the analysis:

   Regardless of any Loc-RIB optimization technique is implemented, the
   route server's control plane bandwidth requirements will scale
   according to O(P * N), where P is the total number of unique paths
   received by the route server and N is the total number of route
   server clients.  

So far so good. (Except nit: there seems to be a word missing, such as "whether" as in "Regardless of whether any Loc-RIB...")

   In the case where P_avg (the arithmetic mean number
   of unique paths received per route server client) remains roughly
   constant even as the number of connected clients increases, this
   relationship can be rewritten as O((P_avg * N) * N) or O(N^2).  

I don't see where the second factor of N comes from. You're basically expanding the P in the first expression as P_avg * N -- but why? I think this would only apply if add-path all-paths was chosen as the path hiding mitigation strategy -- but this is not touched on in route-server-operations, only in ix-bgp-route-server, and besides that the beginning of the paragraph implies you're analyzing the multiple Loc-RIB strategy, so I don't guess all-path is what you were thinking of. If you're not doing all-path, the O(N^2) analysis is wrong AFAICT. To see this, consider that the inbound routes require O(P_avg * N) which is just O(N), but the number of routes you're going to advertise is bounded by the size of the Internet routing table, which is a constant for purposes of this analysis, so also O(N). In and out are summed, not multiplied, so the whole thing works out to be O(N), not O(N^2).

So I think this needs to either be corrected, or the assumptions need to be better explained. Moving on:

   This
   quadratic upper bound on the network traffic requirements indicates
   that the route server model will not scale to arbitrarily large
   sizes.

If you continue to think this sentence is warranted, I think it should be better quantified. Of course nothing can scale to *arbitrarily* large sizes, but that still leaves a lot to the imagination. I would think it would be beneficial for an IX operator reading this document to be able to have some idea of how practical the limitation is. Since the analysis in question is looking at control traffic bandwidth consumption, it wouldn't be too onerous to throw some simple assumptions up against it -- for example, "if we suppose a RS receives on average 100,000 routes from each client with a rate of change of 10 routes/second, sends on average 1,000,000 routes to each client with a rate of change of 100 routes/second, and that each route consumes on average 50 bytes in a BGP UPDATE message, simple arithmetic shows that a GigE connection to that RS will be fully saturated by the time the number of clients reaches 25,000." (Which does not seem like a very practical limitation, the RS will hit a CPU or memory bottleneck first.)

Anyway, maybe you will decide on reconsideration of the big-O analysis that this bit is not needed at all, which would be OK with me.

- S 4.2.2.1, 

   If the route server
   operator has prior knowledge of interconnection relationships between
   route server clients, then the operator may configure separate Loc-
   RIBs only for route server clients with unique outbound routing
   policies.

It wasn't obvious to me what "outbound" applies to -- the client? The RS? -- and for that matter why an inbound policy (on the RS) might not apply. Possibly this could be remedied by simply dropping the adjective "outbound".

- S. 4.2.1.2,

   destination splitting would require significant co-ordination
   between the route server operator and each route server client

It's not clear to me why it would "require significant co-ordination", depending on what resource you're trying to conserve. Two examples of how you could avoid coordination while still getting benefit: You could have clients send all their routes to all the RSes, but have RSes filter out the prefixes they don't care about. This gives the RS most of the CPU benefit it would have gotten had the client done the filtering (prefix filtering is cheap), almost all the memory benefit (the filtered routes need not be retained in the Adj-RIB-In), and around half the control traffic bandwidth benefit. The client incurs cost to send duplicate routes that are going to be discarded by the RS, but the client is presumably not the bottleneck resource. Or better still, the RS could use ORF towards the clients to control what routes the clients will send.

- S. 4.6.1,

OLD:
   Prefixes sent to the route server are tagged with specific [RFC1997]
   or [RFC4360] BGP community attributes

I don't think the naked references scan well as adjectives in this context. I suggest

NEW:
   Prefixes sent to the route server are tagged with specific standard [RFC1997]
   or extended [RFC4360] BGP community attributes

- Also in S. 4.6.1,

OLD:
   As both standard and extended BGP communities values are restricted
   to 6 octets

Actually standard communities are restricted to less than that. Perhaps reword as

NEW:
   As both standard and extended BGP communities values are restricted
   to 6 octets or fewer

- Also in S. 4.6.1,

   route server operator should take care to ensure
   that the predefined BGP community values mechanism used on their
   route server is compatible with [RFC4893] 4-octet autonomous system
   numbers.

I suspect an RS operator reading this might be left scratching his or her head and asking "what does it mean for me to be compatible with RFC4893 in this context"? It would be kind to offer them some guidance, since after all this is a guidance document.

- S. 4.7: Where you say "non-commutative" I think you mean "non-transitive".

- S. 4.7:

   Problems of this form can be dealt with using [RFC5881] bidirectional
   forwarding detection.

It's not clear to me how certain non-transitive forwarding failures can be dealt with using BFD. To take an example, suppose clients A, B and C peer with RS. The IX fabric has a failure such that A and B can both reach RS, but not each other. C has connectivity to everyone. Prefix X is advertised to RS by both B and C. For whatever reason, RS selects X via B to advertise to A. Even if A runs BFD towards B, at best A can determine that the route from RS can't be used. A isn't able to fail over to C's route as it would in the full-mesh case, since it's not aware of it. Depending on A's other connectivity, this may result in sub-optimal routing towards X, or complete loss of connectivity to X.

It's beyond the scope of the draft to solve this problem, but the text could be made more accurate. A minimal fix would be

   Problems of this form can be partially mitigated using [RFC5881] bidirectional
   forwarding detection.

although you might want to go on a bit longer to explain what problems can't be mitigated.

- S. 4.8:

   This problem is not specific to route servers and it can also be
   implemented using bilateral peering sessions.  However, the potential
   damage is amplified by route servers because a single BGP session can
   be used to affect many networks simultaneously.

This is true, but there is a more severe way RSes aggravate the problem: In a full mesh, a router can (and usually does) directly enforce a "no third-party next hops" policy against its peers. An RS peer by definition cannot enforce this policy against the RS, so the RS is the only place it can be enforced.

- S. 4.8:

   Route server operators SHOULD check that the BGP NEXT_HOP attribute
   for NLRIs received from a route server client matches the interface
   address of the client.  If the route server receives an NLRI where
   these addresses are different

so far so good (modulo my first comment about the use of "NLRI", of course), but:

   and where the announcing route server
   client is in a different autonomous system to the route server client
   which uses the next hop address, 

Is the RS sincerely expected to enforce the above? I suppose it could be implemented automatically although imperfectly, by noticing that multiple clients are in the same neighbor AS and noticing when they use each other as third-party next hops, but AFAIK people generally don't try to figure this out, they just do what you've said in the preceding sentence -- make sure the NH matches the interface address. If you really do propose that the RS should allow third-party next hops but only from clients in a common AS, I think you should talk about it specifically and in more detail. If you didn't really mean that, then I suggest you drop the clause. 

- S. 5:

   On route server installations which do not employ path hiding
   mitigation techniques, the path hiding problem outlined in section
   Section 4.1 can be used in certain circumstances to proactively block
   third party prefix announcements from other route server clients.

I don't understand what this means. Specifically, I don't know what it means to "proactively block third party prefix announcements" or for that matter, even what you mean by "third party prefix announcements" in this context. (As a term of art, I normally understand "third party announcement" in a BGP context to mean announcing a third-party next hop as you discuss in S. 4.8). I also don't know what the "certain circumstances" are, quite likely these should be given at least a little color if not entirely spelled out.

Also, a nit -- the xref expansion has put "section section" into your text.

- S. 7:

   BIRD, OpenBGPD and Quagga, whose open source BGP implementations
   include route server capabilities 

Great, cool, but:

   which are compliant with this
   document.

I'm not sure what it actually means to be "compliant" with a document that "describes operational considerations". Perhaps just drop the phrase?


Nits:

- In S. 2, 
OLD:
	BGP sessions between each participant router
NEW:
	BGP sessions between each pair of participant routers

- In S. 4.2.1.1, 

OLD:
   In
   this situation, the multiple Loc-RIB views required by each client
   are merged into a single view.

As written, this implies that each client requires multiple Loc-RIB views, which I don't think is what was intended. I suggest:

NEW:
   In
   this situation, multiple Loc-RIB views
   are merged into a single view.

- I personally am strongly put off by the neologism "granular" to mean "fine-grained" and suggest the latter instead. I realize it's not an unusual usage so by all means disregard if you feel strongly about it.

- S. 4.6.2:

OLD:
   server operators to implement construct per-client routing policies.
NEW:
   server operators to construct per-client routing policies.