Problem Statements of Service Mesh Infrastructure and Requirements of DMSC

Internet-Draft	Service Mesh Problem Statement and DMSC	January 2025
Song, et al.	Expires 10 July 2025	[Page]

Abstract

Service meshes, as one infrastructure, has been widely used in the major public cloud providers. Its main function is to accomplish the policy routing, precise traffic allocation, and traffic throttling etc. Currently, the design and implementation of service mesh takes the centralized control approach, which bring various challenges for its current deployments and further developments. This document analyzes the problems that exists in current service mesh implementations, and provide the requirements for the future distributed micro service communication(DMSC) infrastructure.¶

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶

This Internet-Draft will expire on 10 July 2025.¶

1. Introduction

Service meshes, as an infrastructure component, facilitate communication between services. Major public cloud providers such as AWS, Azure, GCP, and Alibaba Cloud have all introduced service mesh-based products to simplify the building and management of microservices-based applications. In many service mesh frameworks, a key component is the sidecar proxy, which is responsible for managing pod traffic and implementing functionalities such as policy routing, precise traffic allocation, and traffic throttling. By decoupling network functionalities into the sidecar, flexible traffic management can be achieved without altering the user business logic. However, deploying sidecars in production environments reveals certain performance bottlenecks,which have also been mentioned in other literature [Dissecting]¶

This document analyzes the problems that exists in current service mesh implementations, and provide the requirements for the future distributed micro service communication(DMSC) infrastructure.¶

2. Probem Statements of current Service Mesh Infrastructure

2.1. Service Mesh is highly Coupled with User Service

In the model where a sidecar (such as Istio Mesh [Istio]) is deployed within each pod, the sidecar is embedded within the user application's pod and is responsible for handling the communication tasks of the application. The sidecar coexists with the user application, sharing the pod’s resources[SPRIGHT] [CanalMesh]. To ensure uninterrupted communication between applications and to avoid resource waste caused by isolated sidecars, both the sidecar and the application are designed to be created, destroyed, and scaled simultaneously, sharing the same life cycle. However, this design introduces stability and security issues; for example, memory leaks in the sidecar may lead to application crashes, and upgrading the sidecar requires restarting the pod, resulting in interruptions to application operation.¶

2.2. Service Mesh Introduces Additional Performance Overhead

Since traffic needs to be processed through the sidecar, the outgoing traffic from the user application is redirected to the sidecar (for example, using iptables), which introduces additional processing steps[SPRIGHT] [CanalMesh] Specifically, at both the source and destination, the traffic redirection introduces two additional context switches, memory copying, and protocol stack processing overhead [SPRIGHT] Furthermore, the sidecar is required to perform complex Layer 7 (L7) tasks, such as CPU-intensive TLS encryption and decryption operations, which may further lead to significant performance degradation.¶

2.3. Service Mesh Results in High Resource Consumption

Since the sidecar is deployed within the user pod, it consumes resources that would otherwise be allocated to the user application [6]. For example, a customer with 500 nodes and 15,000 pods found that the sidecars consumed 1,500 CPU cores (10% of the total) and 5,000 GB of memory (10% of the total)[CanalMesh]. In extreme cases, the CPU and memory usage of the sidecar can even exceed that of the application itself due to the complex Layer 7 functionalities it provides. This issue has raised concerns among customers, as the pod resources they purchased are not fully utilized for running their applications. Additionally, measurement results indicate that to achieve optimal performance, it may even be necessary to oversupply resources for the sidecar.¶

2.4. Service Mesh Incurs Overhead in Control Plane

With the growing popularity of service meshes, an increasing number of customers are choosing to use them to deploy micro services, which has rapidly increased the number of sidecars that the control plane needs to manage. Sidecars can handle many types of configurations; however, orchestrating service dependency configurations for each sidecar individually is both time-consuming and error prone, and any misconfiguration could potentially affect service continuity. To reduce complexity, a common practice is to download the same configuration set to all sidecars. This configuration set contains all possibly relevant configurations, ensuring that any pod can freely communicate with other pods as needed. However, pushing the complete configuration to all pods during each update significantly increases southbound bandwidth overhead. This is because whenever a sidecar is updated even if the updates are not related to other side cars they still need to be pushed to all sidecars. In scenarios involving cross-region or multi-cloud deployments within a Kubernetes cluster (such as on-premises deployments or multi-site disaster recovery), the significant southbound configuration bandwidth overhead may lead to configuration delays or even losses. Since cross-region/cross-cloud communication requires VPNs or dedicated lines, the communication costs are relatively high. As a result, most customers opt for a more conservative bandwidth purchasing strategy. This means that when managing cross-region or multi-cloud clusters, the controller's configuration updates to geographically distributed sidecars can deplete the customer's cross-region/cross-cloud bandwidth, potentially resulting in delays or losses of configuration data.¶

3. Requirements of Distributed Micro Services Communication (DMSC)

3.1. Non-intrusive Service Mesh for User Applications

Current mainstream service mesh solutions like Istio and Ambient exhibit a high degree of intrusiveness toward user services. This is manifested in components such as sidecars that share the life cycle with pods (L4 + L7 proxies), L4 proxies that share resources with other pods within the same node, and L7 proxies that share resources across all nodes in the Kubernetes cluster. These components not only occupy resources that users allocate for their business operations but also introduce potential failure risks. To ensure equivalence in service mesh functionalities, Canal Mesh [5] still retains lightweight proxies locally. Therefore, there is a pressing need for service meshes to further reduce their intrusiveness to users, with the ultimate goal of achieving a completely non-intrusive service mesh.¶

3.2. Reduce Control Plane Overhead

The control plane of the service mesh needs to handle tasks such as full configuration orchestration and mass sidecar configuration pushing. When the overhead is too high, it can lead to issues like prolonged configuration effectiveness time and excessive consumption of dedicated line bandwidth during cross-cloud or IDC deployments. Additionally, this overhead is directly proportional to the scale of the cluster, which severely hinders the scalable deployment of service meshes. Therefore, there is an urgent need to reduce the overhead of the service mesh control plane. One potential solution is the centralized mesh gateway configuration in Canal Mesh [CanalMesh]. Moreover, further optimizing the configuration orchestration and pushing methods (for example, transforming full pushes into incremental pushes) is also a potentially viable direction.¶

3.3. Improve Data Plane Performance

The service mesh takes over the user's advanced network communication needs by inserting proxy nodes into the user's communication path. While this provides the convenience of allowing users to focus solely on business development, redirecting traffic through the proxy inevitably affects the data plane transmission latency and throughput. Whether the service mesh proxies are located remotely in the cloud or retained locally in a limited capacity, improving the data plane performance of the service mesh is crucial. For example, leveraging SmartNICs to offload proxy functions can help reduce the performance degradation that deploying a service mesh may bring to user applications. This represents an important direction for evolution.¶

3.4. Implement an Application Mesh that is Not Limited to Kubernetes

In addition to Kubernetes users, there are many business scenarios that also wish to introduce the concept of service mesh to reduce repetitive development for network communication needs. For example, AWS’s VPC Lattice service unifies advanced network communication capabilities across various forms such as VMs, bare metal, and Kubernetes, providing a broader range of service mesh functionalities [1]. Some operators also hope to extend the concept of service mesh into the backbone network, offering advanced network features at a cloud and IDC granularity through routers[I-D.li-dmsc-architecture]. In summary, expanding the concept of service mesh beyond Kubernetes to achieve a more generalized application mesh is a potential research direction.¶

7. References

7.1. Normative References

[RFC2119]: Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]: Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>.

7.2. Informative References

[CanalMesh]: Song, E., Song, Y., Lu, C., Pan, T., Zhang, S., Lu, J., Zhao, J., Wang, X., Wu, X., and M. Gao, "Canal mesh: A cloud-scale sidecar-free multitenant service mesh architecture.", ACM SIGCOMM 2024 Conference, 860–875, 2024..
[Dissecting]: Zhu, X., She, G., Xue, B., Zhang, Y., Zhang, Y., Zou, X., Duan, X., He, P., Krishnamurthy, A., and L. Lentz, "Dissecting overheads of service mesh sidecars.", ACM SoCC pages 142–157, 2023..
[I-D.li-dmsc-architecture]: Li, X., Wang, A., Wang, W., and D. KUTSCHER, "Distributed Micro Service Communication architecture based on Content Semantic", Work in Progress, Internet-Draft, draft-li-dmsc-architecture-00, 2 January 2025, <https://datatracker.ietf.org/doc/html/draft-li-dmsc-architecture-00>.
[Istio]: Calcote, L. and Z. Butcher, "Istio: Up and running: Using a service mesh to connect, secure, control, and observe.", O’Reilly Media, 2019 .
[SPRIGHT]: Qi, S., Monis, L., Zeng, Z., Wang, I., and K. Ramakrishnan, "Extracting the Server from Serverless Computing, High-performance eBPF-based Event-driven, Shared-memory Processing.", ACM SIGCOMM pages 780–794, 2022..