cats N. Zhang, Ed. Internet-Draft M. Han, Ed. Intended status: Standards Track X. Yi, Ed. Expires: 17 September 2026 China Unicom 16 March 2026 A token-aware traffic steering solution for agent service draft-zhang-cats-token-aware-ts-00 Abstract This document proposes a token-aware traffic steering mechanism. By parsing estimated token length, task type, semantic urgency, and comprehensively incorporating network link status, model capabilities, and compute resource states into routing decisions, this mechanism achieves joint optimal scheduling of resources. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 17 September 2026. Copyright Notice Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Zhang, et al. Expires 17 September 2026 [Page 1] Internet-Draft A token-aware traffic steering solution March 2026 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Conventions and Definitions . . . . . . . . . . . . . . . . . 3 3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1. Low-latency interactive control . . . . . . . . . . . . . 3 3.2. Complex infer3ence and content generation . . . . . . . . 3 3.3. Resource contention under mixed load . . . . . . . . . . 4 4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 4 4.1. Token features awareness capability . . . . . . . . . . . 4 4.2. Real-time multi-dimensional state monitoring . . . . . . 4 4.3. Dynamic joint scheduling decision . . . . . . . . . . . . 4 4.4. Standardized Interfaces and Protocol Extensions . . . . . 5 5. Token-aware traffic steering deployment solution . . . . . . 5 6. Deployment Effect . . . . . . . . . . . . . . . . . . . . . . 6 7. Security Considerations . . . . . . . . . . . . . . . . . . . 6 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 6 9.1. Normative References . . . . . . . . . . . . . . . . . . 6 9.2. Informative References . . . . . . . . . . . . . . . . . 7 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 7 1. Introduction In the era of thriving AI applications, agents have become a critical link connecting users and cloud services, where their response speed directly determines user experience. Collaboration between terminal and the cloud can enhance resource utilization efficiency and promote the popularization of intelligent services. However, challenges exist when agents invoke large model services on the cloud: 1. Data processing on cloud (especially inference computation) accounts for about 85% of the end-to-end total latency. Simply relying on network path optimization or bandwidth guarantees cannot fundamentally resolve latency issues caused by computing queuing or model overload. 2. Requests initiated by agents show significant differences. Short-context tasks (such as command control) and long-context tasks (such as complex inference) have different requirements for latency sensitivity and computing resources. Traditional "best- effort" scheduling modes cannot meet these differentiated service demands. 3. The network lacks awareness of the token characteristics of upper-layer applications, leading to resource mismatches and user experience bottlenecks. Zhang, et al. Expires 17 September 2026 [Page 2] Internet-Draft A token-aware traffic steering solution March 2026 To solve these problems, the network should act as a collaborative scheduling medium connecting agents and cloud computing. By perceiving the token features of requests, the network can perform global optimal scheduling combined with network status, model capabilities, and computing load. While Compute-Aware Traffic Steering (CATS) has proposed joint routing and scheduling based on network and computing status[I-D.ietf-cats-framework], existing CATS solutions lack awareness of model capabilities and token features. This draft proposes a token-aware traffic steering method to improve the end-to-end service experience for agents. 2. Conventions and Definitions The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. Abbreviations and definitions used in this document: *Token: The fundamental unit for Large Language Models (LLMs) to process text. *TTFT: Time to First Token, a key metric for measuring the response speed of inference services. *SLA: Service Level Agreement. 3. Use Cases 3.1. Low-latency interactive control When agents respond to immediate user commands (e.g., "turn on the light", "play music"), they are expected to provide an extremely fast interactive experience. However, such tasks usually involve few tokens and simple logic, if blocked by longer tasks, it can lead to severe response delays. Therefore, the network needs to identify these short-token featuress and schedule them onto low-latency path and edge computing to ensure millisecond-level response times. 3.2. Complex infer3ence and content generation When agents process complex tasks like code generation or long document summarization, they are expected to provide deep intellectual assistance to users. However, such tasks involve hundreds or even thousands of Tokens for computation, requiring high GPU memory and stable computing. Insufficient resources may cause task timeouts or interruptions. Therefore, long-token tasks need to be precise identified and guided to center computing nodes with ample computing to ensure service continuity. Zhang, et al. Expires 17 September 2026 [Page 3] Internet-Draft A token-aware traffic steering solution March 2026 3.3. Resource contention under mixed load During peak business hours, when a large number of simple requests and a small number of complex inference tasks concurrently flood into the cloud cluster, it is expected to maximize the utilization of computing resources. However, traditional First-In-First-Out queue mechanisms often result in low-latency and short tasks being blocked by long-computation tasks, significantly degrading overall user experience. Therefore, the network needs to identify task priority based on token features, dynamically scheduling short tasks to lightly loaded nodes or dedicated channels, achieving isolated operation and differentiated assurance for tasks of varying complexities. 4. Requirements To achieve the aforementioned goals, the system needs to meet the following key requirements. 4.1. Token features awareness capability Terminal agent can extract key token fearures, including: estimated token length, task type, semantic urgency. Terminal agents or agent gateways support transferring these features to a Network ID and embedding it into network packets, enabling identification by network devices. 4.2. Real-time multi-dimensional state monitoring Network can monitor real-time network state such as link latency, jitter, available bandwidth, and packet loss rate. Computing can obtain real-time information on each inference node's GPU load, memory usage, current queue length, and estimated inference time (TTFT). 4.3. Dynamic joint scheduling decision Policy-based routing algorithms are supported, capable of dynamically selecting optimal network paths (e.g., low-delay dedicated line vs. public internet) and optimal computing nodes (e.g., edge small model vs. cloud large model) based on token features and real-time states. Priority queue management capabilities are possessed to ensure high- priority short tasks are not blocked by long tasks. Zhang, et al. Expires 17 September 2026 [Page 4] Internet-Draft A token-aware traffic steering solution March 2026 4.4. Standardized Interfaces and Protocol Extensions A standard format for token feature descriptions needs to be defined. Signaling interaction mechanisms between the network controller and orchestration system, and between the cloud management (Mgmt) and orchestration system. 5. Token-aware traffic steering deployment solution Figure 1 shows the architectural diagram of this deployment solution. +---------------------------+ | Orchestration | | System | +---------------------------+ ^ ^ | | +------+--------+ +-------+ | Network | | Cloud | | Controller | | Mgmt | +------+--------+ +-------+ / | \ / | \ / | \ +---+---+ +----+----+ | +----+----+ | | | Agent | | | | agent |---| Gateway |--low-latency path --| Edge | | | |(identify| | /center | +---+---+ | Token |--high-latency path--|Computing| | feature)| | | +----+----+ +----+----+ The specific implementation process is as follows: a. Token feature extraction: When a request is initiated, the agent or agent gateway instantly analyzes the prompt content, estimates the number of tokens, and identifies the task type (e.g., "immediate control" or "complex inference"). Subsequently, the feature is encoded as a token feature ID and encapsulated in the packet header using the following format: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Length |D| Unassigned | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Token feature ID | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Length: 8-bit unsigned integer that indicates the total number of the octets of the value field. D: 1-bit field used to indicate whether the current data transaction is directed. Zhang, et al. Expires 17 September 2026 [Page 5] Internet-Draft A token-aware traffic steering solution March 2026 Unassigned: 15-bit field reserved for future use. They MUST be set to 0 on transmission and MUST be ignored on receipt. Token feature ID: 64-bit group ID of token feature. b. Real-time state awareness: The orchestration system continuously collects full-network link quality (bandwidth, delay, jitter) and real-time status of each compute node (GPU utilization, queue length, estimated TTFT). c. Dynamic joint routing decision: The orchestration system matches request features with real-time states: Short/urgent tasks: Prioritize routing to the nearest edge node with the lowest network latency. Long/heavy tasks: Prioritize routing to a central cloud node with ample computing and shorter queues. d. Execution and feedback: The network forwards according to the specified path, and the computing processes the queue according to priority. After the task is completed, actual latency data is fed back to optimize subsequent scheduling strategies. 6. Deployment Effect By employing a refined task priority identification mode, it eliminates long-task blocking and resource contention, improving the accuracy of service guarantees for critical business (e.g., interactive control). This solution can simultaneously achieve acceleration for low-latency short tasks and stability guarantees for computationally intensive tasks by jointly monitoring network link latency and node TTFT. 7. Security Considerations TBD 8. IANA Considerations TBD 9. References 9.1. Normative References Zhang, et al. Expires 17 September 2026 [Page 6] Internet-Draft A token-aware traffic steering solution March 2026 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . 9.2. Informative References [I-D.ietf-cats-framework] Li, C., Du, Z., Boucadair, M., Contreras, L. M., and J. Drake, "A Framework for Computing-Aware Traffic Steering (CATS)", Work in Progress, Internet-Draft, draft-ietf- cats-framework-22, 14 March 2026, . Authors' Addresses Naihan Zhang (editor) China Unicom Beijing China Email: zhangnh12@chinaunicom.cn Mengyao Han (editor) China Unicom Beijing China Email: hanmy12@chinaunicom.cn Xinxin Yi (editor) China Unicom Beijing China Email: yixx3@chinaunicom.cn Zhang, et al. Expires 17 September 2026 [Page 7]