This is the ARTART review of draft-ietf-cats-usecases-requirements-10. It has no special standing and is offered as input to further discussion of the subject.

While I have never looked at ALTO, I spent 5+ years as an employee of AWS where a central everyday concern was the design and operation of distributed systems, so I feel I have some exposure to the issues being addressed.

I feel that this document is not suitable for publication as an RFC. Quoting from the Shepherd Report: 

  The WG milestones only explicitly say to adopt this document (not to publish as
  an RFC). However, the charter does not preclude this. The working group
  discussed this point and had strong consensus that publication as an
  Informational RFC would be helpful for future protocol work.

This document contains a lot of RFC 2119 language, which I don't think belongs in an informational RFC.  After my review, I am left dubious of the claim that this "would be helpful for future protocol work".  Perhaps this would be suitable for leaving as a draft for guiding the work of the WG?

I found this draft difficult (and very time-consuming) to read and am not convinced that it offers practical value.  Perhaps it is aimed at a class of system or protocol designer who is working on problems different from those I faced, so my experience is not relevant and the comments below are not helpful.  If so, sorry.

The draft is extremely verbose, 11K words in length. I found it difficult to read and understand because of this and because the language is often general and nontechnical.  (Also the quality of the language needs work, there are many grammatical errors.)  It would benefit from the attention of an editor with the goal of reducing its size and increasing its clarity.  For example, I think the entirety of Section 1 could be replaced by the following without loss of value: "It is often desirable to distribute compute workloads across multiple compute resources.  These resources can include servers and load balancers in data centers and compute capacity deployed in CDN POPs.  Routing requests for service to such nodes with the goals of providing good response to variable loads presents multiple complex problems."

2, 3.1 Edge computing could mean two different things: Resources at CDN POPs, or resources at infrastructure locations which are specialized at mediating access to internal servers and the Internet. These offer functions including load balancing and firewalling. The draft uses the term "edge" in a very generalized way.

I am unconvinced that some of the scenarios offered are realistic:

4.1 "Cloud VR/AR introduces the concept of cloud computing to the rendering of audiovisual assets in such applications. Here, the edge cloud helps encode/decode and render content.” I'm surprised. Rendering AR/VR requires considerable compute cycles and typically would be accomplished either on client hardware (mobile phone, AR/VR headset) or in a data center server, the results being cached by the edge. But rendering on edge devices? I don't think so? I haven't worked on AR in a few years so maybe I'm out of date, but this is still surprising.

4.2 Repeated discussions of the same problem which could be summarized “try to use the nearest edge PoP to reduce latency, unless it’s overloaded, in which case fall back to somewhere else, while reporting the problem”

4.5.2 “Distributed AI training” - Is this really a thing?  It’s not my understanding of how model building/training is done in practice.  This and the other use cases would benefit from citations to real-world research.

5.2, R5 “The Resource Model MUST be implementable in an interoperable manner.“ The use of RFC2119 language on such a vague, general statement feels like mis-use to me.  This comment applies to a high proportion of the requirement assertions.

R6: "The Resource Model MUST be executable in a scalable manner. That is, an agent implementing the Resource Model MUST be able to execute it at the required time scale and at an affordable cost (e.g., memory footprint, energy, etc.)” The absence of discussion of scaling metrics such as for example “p99 latencies” is striking. Note that 5.3 is about metrics, but provides no examples nor does it enumerate any specific metrics. 

R7: "The Resource Model MUST be useful." Once again, the 2119 language feels inapplicable.

R18: "CATS systems MUST maintain instance affinity for stateful sessions and transactions." This may be true in some service scenarios but in large-scale distributed systems it can cause all sorts of problems.  I personally was severely bitten by a misguided attempt to provide instance affinity in a large-scale cloud application, see https://www.tbray.org/ongoing/When/201x/2019/09/25/On-Sharding (also have a look at some of the other issues discussed there, which feel like they ought to be relevant to this subject matter)

There is no discussion of shuffle sharding, which is overwhelmingly seen as a best practice to make systems resilient in the face of inevitable server failures.  In fact, there is little discussion of resilience in the face of server failures. That feels like one of the big and hard problems in operating real-world distributed systems.

The Security Considerations section seems short.  One of the functions required of every system is authentication of its users, and not all classes of servers can perform this task; how does authentication figure in the CATS ecosystem?