This document has been reviewed as part of the transport area review team's ongoing effort to review key IETF documents. These comments were written primarily for the transport area directors, but are copied to the document's authors and WG to allow them to address any issues raised and also to the IETF discussion list for information. When done at the time of IETF Last Call, the authors should consider this review as part of the last-call comments they receive. Please always CC tsv-art@ietf.org if you reply to or forward this review. High level issue: I think this document is not clear enough on the different alternatives that is actually supported for transmitting the ATLAS data and the component video data. Section 4.1 gives the impression that one can combine all data needed for one V3C represenation into a single video stream, i.e. being sent over a single RTP SSRC. Section 9.2 instead talks about how to have seperate V3C with the atlas data, and then component video streams over other RTP streams (SSRC). For the later there exists a plentora of possible multiplexing models. With what is being defined in section 9.2-9.4. With the defined grouping of V3C one can clearly do both RTP session based multiplexing as well as bundled. The examples in Section 9.3 appears to indicate that one need uniquie media lines in SDP per complete V3C representation and that one can't setup one media line per type and simple use multiple SSRC in each one complete set across the media line to generate one media representation? Or even by just establishing one payload type per type and then use RFC 5576 ssrc-group to indicate a set of SSRCs that are part of one representation. Wouldn't it make sense to have a ssrc-group for V3C? Having read the document I think there is a need for a dedicated section that defines which combinations that are possible and what external from RTP/RTCP support these needs in providing the grouping. Can you confirm that you have not identified anyway of using RTP/RTCP mechanisms that exist to identify the set of SSRCs that are part of one representation? Another significant issue is the one for Section 8: regarding bit-rate adaptation for this payload format and its component stream. Section 7.1: Published specification: Please refer to [ISO.IEC.23090-5] I think this needs to indicate the RFC that defined the RTP payload format, as that is the specification for which the media type is being registered. Restrictions on usage: N/A I think the recommened text from RFC 8088 for this field still applies: This media type depends on RTP framing and, hence, is only defined for transfer via RTP [RFC3550]. Transport within other framing protocols is not defined at this time. Section 8: Due to how the full media representation when using V3C is dependent on having both the ATLAS as well as the component video streams the response to congestion control limitations are far from trivial. I think some clarification to the implementer here is needed on how it should behave when forced to reduce the aggregate bandwidth and how to consider inter stream prioritization. This issue is clearly different from what scalable video codecs encounter when being bandwidth limited where it is usually clear how to reduce the bit-rate. Section 9. Please add a reference to RFC 8866 in the first sentence. Section 9.1: I would recommend that one are clear that "byte-string" is using the definition that exists in RFC 8866. Section 11: I think this format needs an additional security consideration due to the grouping. That is that for correct decoding the signalling system needs to correctly indicate the combination of the V3C Atlas stream and the component streams. If an attacker is able to manipulate this information the senders intention will not be represented. Secondly: This RTP payload format and its media decoder do not exhibit any significant non-uniformity in the receiver-side computational complexity for packet processing, and thus are unlikely to pose a denial-of-service threat due to the receipt of pathological data. Nor does the RTP payload format contain any active content. If I manipulate the ATLAS information can I significantly increase the decoding information. For example forcing magnitude more iterations over the underlying component video stream data to create the Volumetric representation?