Computing-Aware Traffic Steering Y. Kehan Internet-Draft China Mobile Intended status: Informational H. Shi Expires: 14 July 2025 C. Li Huawei Technologies L. M. Contreras Telefonica J. Ros-Giralt Qualcomm Europe, Inc. 10 January 2025 CATS Metrics Definition draft-ietf-cats-metric-definition-00 Abstract This document defines a set of computing metrics used for Computing- Aware Traffic Steering (CATS). Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 14 July 2025. Copyright Notice Copyright (c) 2025 IETF Trust and the persons identified as the document authors. All rights reserved. Kehan, et al. Expires 14 July 2025 [Page 1] Internet-Draft CATS Metrics January 2025 This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Conventions and Definitions . . . . . . . . . . . . . . . . . 3 3. Definition of Metrics . . . . . . . . . . . . . . . . . . . . 4 3.1. Level 0: Raw Metrics . . . . . . . . . . . . . . . . . . 4 3.2. Level 1: Normalized Metrics in Categories . . . . . . . . 5 3.3. Level 2: Fully Normalized Metric. . . . . . . . . . . . . 6 4. Representation of Metrics . . . . . . . . . . . . . . . . . . 6 4.1. Level 0 Metric Representation . . . . . . . . . . . . . . 7 4.1.1. Compute Raw Metrics . . . . . . . . . . . . . . . . . 7 4.1.2. Storage Raw Metrics . . . . . . . . . . . . . . . . . 8 4.1.3. Network Raw Metrics . . . . . . . . . . . . . . . . . 8 4.1.4. Delay Raw Metrics . . . . . . . . . . . . . . . . . . 8 4.1.5. Considerations on the Sources of Metrics and the Statistics . . . . . . . . . . . . . . . . . . . . . 9 4.2. Level 1 Metric Representation . . . . . . . . . . . . . . 9 4.2.1. Normalized Compute Metrics . . . . . . . . . . . . . 9 4.2.2. Normalized Storage Metrics . . . . . . . . . . . . . 10 4.2.3. Normalized Network Metrics . . . . . . . . . . . . . 10 4.2.4. Normalized Delay . . . . . . . . . . . . . . . . . . 10 4.2.5. Considerations on the Sources of Metrics and the Statistics . . . . . . . . . . . . . . . . . . . . . 11 4.3. Level 2 Metric Representation . . . . . . . . . . . . . . 11 5. Comparison of three layers of metric . . . . . . . . . . . . 11 6. Security Considerations . . . . . . . . . . . . . . . . . . . 13 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13 8. References . . . . . . . . . . . . . . . . . . . . . . . . . 13 8.1. Normative References . . . . . . . . . . . . . . . . . . 13 8.2. Informative References . . . . . . . . . . . . . . . . . 13 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 14 Kehan, et al. Expires 14 July 2025 [Page 2] Internet-Draft CATS Metrics January 2025 1. Introduction Service providers are deploying computing capabilities across the network for hosting applications such as distributed AI workloads, AR/VR and driverless vehicles, among others. In these deployments, multiple service instances are replicated across various sites to ensure sufficient capacity for maintaining the required Quality of Experience (QoE) expected by the application. To support the selection of these instances, a framework called Computing-Aware Traffic Steering (CATS) is introduced in [I-D.ietf-cats-framework]. CATS is a traffic engineering approach that optimizes the steering of traffic to a given service instance by considering the dynamic nature of computing and network resources. To achieve this, CATS components (C-PS, C-Forwarders, etc.) require performance metrics for both communication and compute resources. Since these resources are deployed by multiple providers, standardized metrics are essential to ensure interoperability and enable precise traffic steering decisions, thereby optimizing resource utilization and enhancing overall system performance. Various considerations for metric definition are proposed in [I-D.du-cats-computing-modeling-description], which are useful for defining computing metrics. This document categorizes the relevant compute and network metrics for CATS into three levels based on their complexity and granularity, following the considerations outlined in [I-D.du-cats-computing-modeling-description]. 2. Conventions and Definitions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. This document uses the following terms defined in [I-D.ietf-cats-framework]: * Computing-Aware Traffic Steering (CATS) * Service * Service contact instance Kehan, et al. Expires 14 July 2025 [Page 3] Internet-Draft CATS Metrics January 2025 3. Definition of Metrics Introducing a definition of metrics requires balancing the following trade-off: if the metrics are too fine-grained, they become unscalable due to the excessive number of metrics that must be communicated through the metrics distribution protocol. (See [I-D.rcr-opsawg-operational-compute-metrics] for a discussion of metrics distribution protocols.) Conversely, if the metrics are too coarse-grained, they may lack the necessary information to make informed decisions. To ensure scalability while providing sufficient detail for effective decision-making, we propose a definition of metrics that incorporates three levels of abstraction: * *Level 0 (L0): Raw metrics.* These metrics are presented without abstraction, with each metric using its own unit and format as defined by the underlying resource. * *Level 1 (L1): Normalized metrics in categories.* These metrics are derived by aggregating L0 metrics into multiple categories, such as network, computing, and storage. Each category is summarized with a single L1 metric by normalizing it into a value within a defined range of scores. * *Level 2 (L2): Fully normalized metric.* These metrics are derived by aggregating lower level metrics (L0 or L1) into a single L2 metric, which is then normalized into a value within a defined range of scores. 3.1. Level 0: Raw Metrics Level 0 metrics encompass detailed, raw metrics, including but not limit to: * CPU: Base Frequency, boosted frequency, number of cores, core utilization, memory bandwidth, memory size, memory utilization, power consumption. * GPU: Frequency, number of render units, memory bandwidth, memory size, memory utilization, core utilization, power consumption. * NPU: Computing power, utilization, power consumption. * Network: Bandwidth, capacity, throughput, transmit bytes, receive bytes, host bus utilization. * Storage: Available space, read speed, write speed. * Delay: Time taken to process a request. Kehan, et al. Expires 14 July 2025 [Page 4] Internet-Draft CATS Metrics January 2025 L0 metrics can be encoded into an Application Programming Interface (API), such as a RESTful API, and can be solution-specific. Different resources can have their own metrics, each conveying unique information about their status. These metrics can generally have units, such as bits per second (bps) or floating point instructions per second (flops). Regarding network-related information, the IPPM WG has defined various types of metrics in [performance-metrics]. Additionally, in [RFC9439], the ALTO WG has introduced an extended set of metrics related to packet performance and throughput/bandwidth. For compute metrics, [I-D.rcr-opsawg-operational-compute-metrics] lists a set of cloud resource metrics. 3.2. Level 1: Normalized Metrics in Categories L1 metrics are organized into distinct categories, such as computing, networking, storage, and delay. Each L0 metric is classified into one of these categories. Within each category, a single L1 metric is computed using an _aggregation function_ and normalized to a unitless score that represents the performance of the underlying resources according to that category. Potential categories include: * *Computing:* A normalized value derived from computing-related L0 metrics, such as CPU, GPU, and NPU metrics. * *Networking:* A normalized value derived from network-related L0 metrics. * *Storage:* A normalized value derived from storage-related L0 metrics. * *Delay:* A normalized value derived from computing, networking, and storage metrics, reflecting the end-to-end processing delay of a request. Editor note: detailed categories can be updated according to the CATS WG discussion. The L0 metrics, such as those defined in [performance-metrics], [RFC9439], and [I-D.rcr-opsawg-operational-compute-metrics], can be categorized into the aforementioned categories. Each category will employ its own aggregation function (e.g., weighted summary) to generate the normalized value. This approach allows the protocol to focus solely on the metric categories and their normalized values, thereby avoiding the need to process solution-specific detailed metrics. Kehan, et al. Expires 14 July 2025 [Page 5] Internet-Draft CATS Metrics January 2025 3.3. Level 2: Fully Normalized Metric. The L2 metric is a single score value derived from the lower level metrics (L0 or L1) using an aggregation function. Different implementations may employ different aggregation functions to characterize the overall performance of the underlying compute and communication resources. The definition of the L2 metric simplifies the complexity of collecting and distributing numerous lower-level metrics by consolidating them into a single, unified score. TODO: Some implementations may support configuration of Ingress CATS- Forwarders with the metric normalizing method so that it can decode the affection from the L1 or L0 metrics. Figure 1 shows the logic of metrics in Level 0, Level 1, and Level 2. +--------+ L2 Metric: | M2 | +---^----+ | +-----------------+---------------+ | | | +---+----+ +---+----+ +---+----+ L1 Metrics: | M1-1 | | M1-2 | | M1-3 | (...) +---^----+ +---^----+ +----^---+ | | | +--------+-+-------+ +-+-------+ | | | | | | | +--+---+ +--+---+ +---+--+ +--+---+ +---+--+ +--+---+ L0 Metrics:| M0-1 | | M0-2 | | M0-3 | | M0-4 | | M0-5 | | M0-6 | (...) +------+ +------+ +------+ +------+ +------+ +------+ Figure 1: Logic of CATS Metrics in levels 4. Representation of Metrics This section includes the detailed representation of metrics. [RFC9439] gives a good way to show the representation of some network metrics which is used for network capabilities exposure to applications. This document further describes the representation of CATS metrics. Basically, in each metric level and for each metric, there will be some common fields for representation, including metric type, unit, and precision. Metric type is a label for network devices to recognize what the metric is. "unit" and "precision" are usually associated with the metric. How many bits a metric occupies in protocols is also required. Kehan, et al. Expires 14 July 2025 [Page 6] Internet-Draft CATS Metrics January 2025 Beyond these basic representations, the source of the metrics must also be declared, since there are multiple levels of metrics and their sources are different. As defined in [RFC9439], there are three cost-sources, nominal, sla, and estimation. This document further divide the estimation type into three sub-types, direct measurement, aggregation, and normalization, since different levels of metrics require different sources to acquire CATS metrics. Directly measured metrics have physical meanings and units without any processing. Aggregated metrics can be either physically meaningful or not, and they maintain their meanings compared to the directly measured metrics. Normalized metrics can have physical meanings or not, but they do not have units, and they are just numbers that used for routing decision making. To be more fine-grained, this document refers to the definition of [RFC9439] on the metrics statistics. 4.1. Level 0 Metric Representation Raw metrics have exact physical meanings and units. They are directly measured from the underlying computing resources providers. Lots of definition on this level of metrics have been defined in IT industry and other standardizations[DMTF], and this document only show some examples for different categories of metrics for reference. 4.1.1. Compute Raw Metrics The metric type of compute resources are named as “compute_type: CPU” or “compute_type: GPU”. Their frequency unit is GHZ, the compute capabilities unit is FLOPS. Format should support integer and FP8. It will occupy 4 octets. Example: Basic fields: Metric type: “compute type_CPU” Format: integer, FP8 Bits occupation: 4 octets Special fields: Frequency unit: GHZ Compute capabilities unit: FLOPs Source: Direct measurement Statistics: Mean Figure 2: An Example for Compute Raw Metrics Kehan, et al. Expires 14 July 2025 [Page 7] Internet-Draft CATS Metrics January 2025 4.1.2. Storage Raw Metrics The metric type of storage resources like SSD are named as “storage_type: SSD”. The storage space unit is megaBytes(MBs). Format is integer. It will occupy 2 octets. The unit of read or write speed is denoted as MB per second. Example: Basic fields: Metric type: “storage type_SSD” Format: integer Unit: GB Bits occupation: 2 octets Source: nominal Statistics: cur Figure 3: An Example for Storage Raw Metrics 4.1.3. Network Raw Metrics The metric type of network resources like bandwidth are named as "network_type: Bandwidth”. The unit is gigabits per second(Gb/s). Format is integer. It will occupy 2 octets. The unit of TXBytes and RXBytes is denoted as MB per second. Example: Basic fields: Metric type: “network type_Bandwidth” Format: integer Unit: Gb/s Bits occupation: 2 octets Source: nominal Statistics: cur Figure 4: An Example for Network Raw Metrics 4.1.4. Delay Raw Metrics Delay is a kind of synthesized metric which is influenced by computing, storage access, and network transmission. It is named as “delay_raw”. Format should support integer and FP8. Its unit is microsecond. It will occupy 4 octets. Example: Kehan, et al. Expires 14 July 2025 [Page 8] Internet-Draft CATS Metrics January 2025 Basic fields: Metric type: “delay_raw” Format: integer, FP8 Unit: Microsecond(us) Bits occupation: 4 octets Source: aggregation Statistics: max Figure 5: An Example for Delay Raw Metrics 4.1.5. Considerations on the Sources of Metrics and the Statistics The sources of L0 metrics can be nominal, directly measured, or aggregated. Nominal L0 metrics are provided initially by resource providers. Dynamic L0 metrics are measured and updated during service stage. L0 metrics also support aggregation, in case that there are multiple service instances. The statistics of L0 metrics will follow the definition of Section 3.2 of [RFC9439]. 4.2. Level 1 Metric Representation Normalized metrics in categories have physical meanings but they do not have unit. They are numbers after some ways of abstraction, but they can represent their type, in case that in some use cases, some specific types of metrics require more attention. 4.2.1. Normalized Compute Metrics The metric type of normalized compute metrics is “compute_norm”, and its format is integer. It has no unit. It will occupy an octet. Example: Basic fields: Metric type: “compute_norm” Format: integer Bits occupation: an octet Score: 1 Source: normalization Figure 6: An Example for Normalized Compute Metrics Kehan, et al. Expires 14 July 2025 [Page 9] Internet-Draft CATS Metrics January 2025 4.2.2. Normalized Storage Metrics The metric type of normalized compute metrics is “storage_norm”, and its format is integer. It has no unit. It will occupy a octet. Example: Basic fields: Metric type: “storage_norm” Format: integer Bits occupation: an octet Score: 1 Source: normalization Figure 7: An Example for Normalized Storage Metrics 4.2.3. Normalized Network Metrics The metric type of normalized compute metrics is “network_norm”, and its format is integer. It has no unit. It will occupy a octet. Example: Basic fields: Metric type: “network_norm” Format: integer Bits occupation: an octet Score: 1 Source: normalization Figure 8: An Example for Normalized Network Metrics 4.2.4. Normalized Delay The metric type of normalized compute metrics is “delay_norm”, and its format is integer. It has no unit. It will occupy a octet. Example: Basic fields: Metric type: “delay_norm” Format: integer Bits occupation: an octet Score: 1 Source: normalization Figure 9: An Example for Normalized Delay Metrics Kehan, et al. Expires 14 July 2025 [Page 10] Internet-Draft CATS Metrics January 2025 4.2.5. Considerations on the Sources of Metrics and the Statistics The sources of L1 metrics is normalized. Based on L0 metrics, service providers design their own algorithms to normalize metrics. For example, assigning different cost values to each raw metric and do summation. L1 metric do not need further statistical values. 4.3. Level 2 Metric Representation A fully normalized metric is a single value which does not have any physical meaning or unit. Each provider may have its own methods to derive the value, but all providers must follow the definition in this section to represent the fully normalized value. Metric type is “norm_fi”. The format of the value is non-negative integer. It has no unit. It will occupy a octet. Example: Basic fields: Metric type: “norm_fi” Format: non-negative integer Bits occupation: an octet Score: 1 Source: normalization Figure 10: An Example for Fully Normalized Metric The fully normalized value also supports aggregation when there are multiple service instances providing these fully normalized values. When providing fully normalized values, service instances do not need to do further statistics. 5. Comparison of three layers of metric From L0 to L1 to L2, the computing metric is consolidated. Different level of abstraction can meet the requirements from different services. Table 1 shows the comparison among metric levels. Kehan, et al. Expires 14 July 2025 [Page 11] Internet-Draft CATS Metrics January 2025 +=======+=============+===============+===========+==========+ | Level | Encoding | Extensibility | Stability | Accuracy | | | Complexity | | | | +=======+=============+===============+===========+==========+ | Level | Complicated | Bad | Bad | Good | | 0 | | | | | +-------+-------------+---------------+-----------+----------+ | Level | Medium | Medium | Medium | Medium | | 1 | | | | | +-------+-------------+---------------+-----------+----------+ | Level | Simple | Good | Good | Medium | | 2 | | | | | +-------+-------------+---------------+-----------+----------+ Table 1: Comparison among Metrics Levels Since Level 0 metrics are raw metrics, therefore, different services may have their own metrics, resulting in hundreds or thousands of metrics in total, this brings huge complexity in protocol encoding and standardization. Therefore, this kind of metrics are always used in customized IT systems case by case. In Level 1 metrics, metrics are categorized into several categories and each category is normalized into a value, therefore they can be encoded into the protocol and standardized. Regarding the Level 2 metrics, all the metrics are normalized into one single metric, it is easier to be encoded in protocol and standardized. Therefore, from the encoding complexity aspect, Level 2 and Level 1 metrics are suggested. Similarly, when considering extensibility, new services can define their own new L0 metrics, which requires protocol to be extended as needed. Too many metrics type can create a lot of overhead to the protocol resulting in a bad extensibility of the protocol. Level 1 introduce only several metrics categories, which is acceptable for protocol extension. Level 2 metric only need one single metric, so it brings least burden to the protocol. Therefore, from the extensibility aspect, Level 2 and Level 1 metrics are suggested. Regarding Stability, new Level 0 raw metrics may require new extension in protocol, which brings unstable format for protocol, therefore, this document does not recommend to standardize Level 0 metrics in protocol. Level 1 metrics request only few categories, and Level 2 Metric only introduce one metric to the protocol, so they are preferred from the stability aspect. Kehan, et al. Expires 14 July 2025 [Page 12] Internet-Draft CATS Metrics January 2025 In conclusion, for computing-aware traffic steering, it is recommended to use the L2 metric due to its simplicity. If advanced scheduling is needed, L1 metric can be used. L2 metrics are the most comprehensive and dynamic, therefore transferring them to network devices is discouraged due to their high overhead. Editor notes: this draft can be updated according to the discussion of metric definition in CATS WG. 6. Security Considerations TBD 7. IANA Considerations TBD 8. References 8.1. Normative References [I-D.ietf-cats-framework] Li, C., Du, Z., Boucadair, M., Contreras, L. M., and J. Drake, "A Framework for Computing-Aware Traffic Steering (CATS)", Work in Progress, Internet-Draft, draft-ietf- cats-framework-04, 17 October 2024, . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . 8.2. Informative References [DMTF] "DMTF", n.d., . Kehan, et al. Expires 14 July 2025 [Page 13] Internet-Draft CATS Metrics January 2025 [I-D.du-cats-computing-modeling-description] Du, Z., Yao, K., Li, C., Huang, D., and Z. Fu, "Computing Information Description in Computing-Aware Traffic Steering", Work in Progress, Internet-Draft, draft-du- cats-computing-modeling-description-03, 6 July 2024, . [I-D.rcr-opsawg-operational-compute-metrics] Randriamasy, S., Contreras, L. M., Ros-Giralt, J., and R. Schott, "Joint Exposure of Network and Compute Information for Infrastructure-Aware Service Deployment", Work in Progress, Internet-Draft, draft-rcr-opsawg-operational- compute-metrics-08, 21 October 2024, . [performance-metrics] "performance-metrics", n.d., . [RFC9439] Wu, Q., Yang, Y., Lee, Y., Dhody, D., Randriamasy, S., and L. Contreras, "Application-Layer Traffic Optimization (ALTO) Performance Cost Metrics", RFC 9439, DOI 10.17487/RFC9439, August 2023, . Authors' Addresses Kehan Yao China Mobile China Email: yaokehan@chinamobile.com Hang Shi Huawei Technologies China Email: shihang9@huawei.com Cheng Li Huawei Technologies China Email: c.l@huawei.com Kehan, et al. Expires 14 July 2025 [Page 14] Internet-Draft CATS Metrics January 2025 L. M. Contreras Telefonica Email: luismiguel.contrerasmurillo@telefonica.com Jordi Ros-Giralt Qualcomm Europe, Inc. Email: jros@qti.qualcomm.com Kehan, et al. Expires 14 July 2025 [Page 15]