This document has been reviewed as part of the transport area review team's
ongoing effort to review key IETF documents. These comments were written
primarily for the transport area directors, but are copied to the document's
authors and WG to allow them to address any issues raised and also to the IETF
discussion list for information.

When done at the time of IETF Last Call, the authors should consider this
review as part of the last-call comments they receive. Please always CC
tsv-art@ietf.org if you reply to or forward this review.

I didn't find any particual issues related to transport protocols. However, this document made me wonder about number of things hence I am not sure I understood the context and requirements properly. With better understanding my view on transport protocol issues might change.

As I had to read and understand this document. I encountered lots of issues which I noted below. I believe addressing those would make this document more understandbale. Form that point of view I don't think this document is ready to be published.

By reading this document it seems CATS is trying to solve issues from application layer to routing layer. However, it didn't give comprehensive hints where it should converge. Lots of times it describes an application requirement and tries to justify network responsibilities, which confuses me a number of times. It does not provide a clear description about ingestion points at the network and assumes the CATS service providers have full control of the service instance deployment.   
# Introduction    

     It says        

              Offloading compute intensive processing to the user devices is not acceptable, since it would place pressure on local resources such as the battery and incur some data privacy issues if the needed data for computation is not provided locally.    

     Is that even an option for CATS or other network resources deployments? If not, then why is it mentioned here?         
        
     What is an "edge site"? What is the "edge of a network"? I haven't found any definition or description of those in this document to fully understand the meaning in the context.

     The introduction gives an impression that CATS is only for edge computing. Is that the intention?

# Definition of terms

      It is not very clear to me about the difference between "Network edge" and "provider network" that is mentioned in the CATS framework draft. What are the differences? and why is "Network edge" defined here and not used in the rest of the document? 
    
      What kind of service identity are we talking about? Is this service identity that can be obtained by a TLS client? or an ALTO endpoint Identifier? or something else?

# Problem statement     

       It says -
             , a number of representative cities have deployed multi-edge sites and the typical applications, and there are more edge sites to be deployed in the future.    

       I find this unnecessary, specially when there is no provided information to verify this claim.

       Section 3.1 mentions one expired ALTO draft and ALTO protocol, but then it does not really say if ALTO helps picking up the best node in the network to reach then what is left to be fixed and what differentiate ALTO solution from CATS.
        
       As per charter CATS is for network nodes to pick the right place to deploy the services. But I fail to see that part in this problem statement. 
       
       What is the difference between an "edge node" and "edge site"? and what are their relation with "service site", "service instance" defined in CATS framework? If they are supposed to mean the same, why aren't the CATS framework defined terms used here?
        
        It says - 

           If the resources are insufficient to support new instances, the operator can be informed to increase the hardware resources.
        
        OK,fine. Does CATS do that? Do we need a protocol to inform the operator? Who tells them about the need to invest in hardware. Sorry, why are we talking about this here in the problem statement?        

         We have a section in the problem statement that talks about multi-deployment of the edge sites and services, but then it ends saying "where to locate service instances and when to create new ones in order to provide the right levels of resource to support user demands" is out of scope of CATS. So, what are really the problems here?        

         ## Section 3.2 says -

            Traffic is steered to an edge site that is "closest" or to one of a few "close" sites using load-balancing
        
         Who is steering this and is this load-balancing static? Is there support for mid-session steering and load-balancing? if these are dynamic and only done at the beginning of the client sessions then it should be already possible to pick the right edge site but not the closed one.

        "we assume" who are we? authors or the wg or the IETF?
        
        Please describe an "edge router".
        
        It says 
              selection of one of candidate service instances is done using traffic steering methods, where the steering decision may take into account pre-planned policies (assignment of certain clients to certain service instances), realize shortest-path to the 'closest' service instance, or utilize more complex and possibly dynamic metric information, such as load of service instances, latency experienced or similar, for a more dynamic selection of a suitable service instance.                

         Why can't anycast routing be used here? or is that the idea here?

         It says 
               
                 It is important to note that clients may move. This means that the service instance that was "best" at one moment might no longer be best when a new service request is issued.
            
        OK, will CATS solve the issue of mid-session mobility? The service instances will have states, those states need to be migrated to the new site so it is not a plug-n-play solution unless the service is completely stateless. Does this mean the services that CATS can entertain need to be stateless? This seems like a requirement to the service instances. I am asking this as afaik the big use cases written in the document AR or VR or vehicle that maintain lots of states in the servers and at the client and they even need stickiness. Just moving to a low load service instance might not be the ideal solution. So, what is the main problem CATS is going to solve in this context?

# Section 4.1
    
        It is not clear to me the meaning or "dynamically steer traffic". does it mean start of a service session or it mean mid-session steering or both? Can this be clarified?

# Section 4.2

        In the first paragraph, it describes the need to increase the transmission capacity, video processing and network bandwidth. I fail to see what is the relation toward CATS.

        It says -

            The notion of sending the request to the "nearest" edge node is important for being able to collate the video information of "nearby" vehicles, using relative location information among the vehicles. Furthermore, data privacy may lead to a requirement to process the data by an edge node (or an adjacent vehicle as a cluster node ) as close to the source as possible to limit the data's spread across many network components in the network
        
        So, here the video should not be processed anywhere else so it is kind of fixed with the "nearest edge node" policy. Do we need CATS for this?
        
        In the 3rd paragraph, it starts to talk about "closest" but it was discussing "nearest" in the previous paragraph. Do they mean the same thing? If yes, then please use the same terminologies.
    
        According to my understanding these scenarios can be satisfied by ALTO protocol as there the network provides information to the clients to pick the right service instance or even change. So, it is not clear to me why this is a CATS use case.
    
        Also I didn't find any discussion on moving speed, vehicles usually move fast, that means it is possible that by the time CATS realizes the compute or bandwidth is loaded the vehicle has moved to another basestation that might need a new PE/edge site all together. What are the considerations on the requirements regarding this?
   
# Section 4.3    

        This section is clear about the use case of having decentralized storage, but previous use cases are not clear about how they relate to CATS.

# Requirements
    
       R1 : is this to inform the clients accessing the service instances? or is this for the CATS system to decide where to send a client request? Throughout the document so far the discussion on applications made me wonder this.
        What is "real-time system state"?  
    
       R2 : see comments for R1, what is the periodicity for the "up-to-date status"? without proper understanding of that, it is hard to understand the requirement.
    
       R3 : are these service instances in the participating edges administered, implemented and deployed by several entities in the service provider's network? If not then why is this a requirement? Now, even if we agree on a certain metric, say "CPU load" and my instance has 5 fully loaded CPU and 5 ideal ones, then my instance is 50% loaded, will it be an understandable metric for another instance about my situation?    R4 : Who are "we" again? and I am not sure I understand the requirements ( there are actually two requirements here ) . What are the requirements on the CATS system?     R5 : it seems like a requirement on the resource model not on the CATS. 
    
       R6 : not clear at all. What is an "agent" here?
    
       R7 : not executable unless we have someone/something deciding on the usefulness. My resource model may be very useful to me, but can be garbage to anyone else. I picked up my last car just because it has the best sound system, my friend didn't find that useful at all.    

       R8 : not sure this is a system requirement rather seems like a requirement of how the workgroup should decide if they ever create with metrics.
   
       R9 : And again we have more than one requirement here. and I simply don't get the reason for a SHOULD in the applicability for non-CATS network. By the way, is CATS a system or Network?

       R10 and R11 are good requirements as they are easily understandable and executable. However, I am not sure how the fast changing compute load can be handled to avoid path oscillation. How do they work together? 
    
       R12 to R17, you can simply create two requirements from all of them - one for metric collection and one for metric distribution.
        
           This section uses "service provider" along with GPU utilizations, is this a CATS service provider or a cloud service provider, if it is CATS then I suggest to reference the CATS framework definition.
    
       R18 : it says 
            the affinity to a particular service instance may span more than one request, as in the AR/VR use case, where the previous client input is needed to render subsequent frames.
        
          This is to me more stickiness than affinity. That means when there is strike stickiness the CATS system MUST not migrate the service instance, rather try to pick a site where the stickiness can be preserved.
          How does CATS know if the session/transaction is stateful?
    
       R19 : what is the difference between R18 and this one?
    
       R21 : is this a requirement on CATS or the application clients?

# Appendix A:
    
       This might be helpful to some, but I am not sure why they are kept here while the text says - It is a temporary and procedural section which might be deleted or merged in future updates.

# Appendix B :

        I find it strange that Appendix B yields some Normative requirements but this is not part of the main body. I would suggest either to remove the normative reference, or to move them into the main body of the document use case ( and recharter the working group to work on this as they are so interesting and important that they cannot be removed all together).