I am the assigned Gen-ART reviewer for this draft. For background on Gen-ART, please see the FAQ at < http://wiki.tools.ietf.org/area/gen/trac/wiki/GenArtfaq>. Please resolve these comments along with any other Last Call comments you may receive. Document: draft-ietf-payload-vp8-16.txt Reviewer: Elwyn Davies Review Date: 2015/07/10 IETF LC End Date: 2015/07/13 IESG Telechat date: (if known) - Summary: Gosh, it has been a long time since the last review! Almost ready. Still needs various minor nits fixed and some clarifications. Major issues: Minor issues: s4.2: Various of the frame indices either SHOULD or, in the case of KEYIDX, MAY start at random values. The rationale(s) for these requirements are not explained - nor is the reason why it it is only a MAY for KEYIDX whereas it is SHOULD for PictureID and what might happen were the MAY/SHOULD to be ignored. Is the rationale something that should be mentioned in the security considerations? s10.1: The downref to RFC 6386 identified by idnits was duly noted in the last call announcement. I don't have a problem with the downref. Nits/editorial comments: General: Consistent usage of "prediction and mode" vs "prediction/mode" would be desirable. s1, para 2: Needs to introduce the 'macroblock' jargon defined in RFC 6386. Suggest: OLD: VP8 is based on decomposition of frames into square sub-blocks of pixels, prediction of such sub-blocks using previously constructed blocks, and... NEW: VP8 is based on decomposition of frames into square sub-blocks of pixels known as "macroblocks" (see Section 2 of [RFC6386]). Prediction of such sub-blocks using previously constructed blocks, and... END s2: There are a number of technical terms brought over from RFC 6386. These should be listed in s2. At least the following would be in the list: key frame, interframe, golden frame, altref frame, macroblock. Also the terms picture selection index (RPSI) and slice loss indication (SLI) from RFC 4585. s3, para 2: Providing a reference to explain what temporal scalability and the hierarchy selection is all about would be useful. One possibility would be: Lim, J. Yang, and B. Jeon,”Fast Coding Mode Decision for Scalable Video Coding”, 10th Int’l Conf. on Advanced Communication Technology, vol. 3, pp. 1897-1900, 2008. It would also help to add a pointer to the TL0PICIDX description in s4.2, thus: ADD: Support for temporal scalability is provided by the optional TL0PICIDX and TID/Y/KEYIDX fields described in Section 4.2. END ss3 and 4: The discussion of how the frame payload might or might not be distributed across multiple packets is not very clear. The draft has in mind a 'preferred use case' (para 1 of s4.4 says: In the preferred use case, the S bit and PID fields described in Section 4.2 should be used to indicate what the packet contains. I think it would be helpful to be a bit more positive about this in s3, para 3: OLD: This memo allows the partitions to be sent separately or in the same RTP packet. It may be beneficial for decoder error-concealment to send the partitions in different packets, even though it is not mandatory according to this specification. NEW: Accordingly, this document RECOMMENDS that the frame is packetized by the sender with each data partition in a separate packet or packets. This may be beneficial for decoder error concealment and the payload format described in Section 4 provides fields that allow the partitions to be identified even if the first partition is not available. The sender can, alternatively, aggregate the data partitions into a single data stream and, optionally, split it into several packets without consideration of the partition boundaries. The receiver can use the length information in the first partition to identify the partitions during decoding. END s4/Figure 1: s/bytes/octets/ (2 places) Figure 1, caption: s/sequel/Sections 4.2 and 4.3/ Figure 1: The format shown is correct only for the first packet in a frame. Subsequent packets will not contain the payload header and will have later octets of the frame payload. This should be mentioned in drawing and/or in the caption. s4.1, last two paras: Sequence number: The sequence numbers are monotonically increasing and set as packets are sent. The remaining RTP header fields are used as specified in [RFC3550]. Redefining the Sequence Number field seems gratuitous and potentially dangerous, given that it is *very carefully* defined in RFC 3550 and the overall intent does not seem any different from that in RFC 3550. OTOH a note about the (non?-)use of the X bit and the Payload Type field (PT) would be useful. I suggest replacing the paras above with: NEW: X bit: MUST be 0. The VP8 RTP profile does not use the RTP Header Extension capability. PT (Payload Type): In line with the policy in Section 3 of [RFC3551], applications using the VP8 RTP payload profile MUST assign a dynamic payload type number to be used in each RTP session and provide a mechanism to indicate the mapping. See Section 6.2 for the mechanism to be used with the Session Description Protocol (SDP) [RFC4566]. The remaining RTP Fixed Header Fields (V, P, CC, sequence number, SSRC and CSRC identfiers) are used as specified in Section 5.1 of [RFC5330]. END Note that this may be considered to make the reference to RFC 3551 normative. s4.2, X bit: There is potential for confusion between the X bit in the fixed header and the X bit in the VP8 Payload Descriptor. Perhaps changing this to C for control bits would avoid the problem. s4.2, M bit: Similarly, it might be better to use B (for BIG) in place of M to avoid confusion. s4.2, para 7 after Fig 2: For consistency: OLD: When the X bit is set to 1 in the first octet, the extension bit field octet MUST be provided as the second octet. If the X bit is 0, the extension bit field octet MUST NOT be present, and no extensions (I, L, T, or K) are permitted. NEW: When the X [or C] bit is set to 1 in the first octet, the Extended Control Bits field octet MUST be provided as the second octet. If the X [or C] bit is 0, the extension bit field octet MUST NOT be present, and no extensions (I, L, T, or K) are permitted. END s4.2: The issue of using either 'wrap' or 'restart' but not both for restarting number sequences has been raised in various comments by IESG members. s4.2, TL0PICIDX: I think, given the statements in s3 about not defining the hierarchy, that more explanation is needed of what the 'lowest or base layer' actually means. s4.2, TL0PICIDX: If TID is larger than 0, TL0PICIDX indicates which base layer frame the current image depends on. TL0PICIDX MUST be incremented when TID is 0. What happens if L=1 but both T=0 and K=0 so that there is no TID value present? Or indeed if T=0 but K=1 so that the TID field is there but 'MUST be ignored by the receiver' (definition of TID field)? s4.3: It would be useful to include the section number (9.1) where the uncompressed data chunk specification is found. I was somewhat surprised that the ordering is reversed in the protocol spec. s4.3, VER: The receiver should validate this field - currently only values 0-3 are valid. s4.3, SizeN: Make it clear these are integer fields: s/in Size0,/in integer fields Size0,/ s4.3 and s4.4: It would be helpful to implementers to explain what the offsets in the first partition payload are for the additional partition count and additional partition sizes. Having rummaged around in RFC 6386, I have to say that I am unclear whether the values are placed after the full chunk of data indicated by 1stPartitionSize or are part of this data. And if the latter is the case, how to work out where the partition sizes are. I guess that they ought to be at offset (1stPartitionSize+10 - key frame - or 1stPartitionSize+3 - interframe) in the VP8 Payload field as it is difficult to work out where they are without decoding the partition otherwise. Also exactly how many partition sizes to expect - I think I read in RFC 6386 that the last partition size is implicit. Help! In the course of clearing this up, it might be appropriate to move the last sentence of the first para of s4.4 and the second para of s4.4 into s4.3 as part of the explanation. This is the relevant text: Note that the length of the first partition can always be obtained from the first partition size parameter in the VP8 payload header. The VP8 bitstream format [RFC6386] specifies that if multiple DCT/WHT partitions are produced, the location of each partition start is found at the end of the first (prediction/mode) partition. In this RTP payload specification, the location offsets are considered to be part of the first partition. s4.4: I found this section very difficult to parse. It is a bit 'stream of consciousness' and would benefit from a reordering and rewrite. Suggestion (assuming the second para gets moved to s4.3): NEW SECTION 4.4: An encoded VP8 frame can be divided into two or more partitions, as described in Section 1. It is OPTIONAL for a packetizer implementing this RTP specification to pay attention to the partition boundaries within an encoded frame. If packetization of a frame is done without considering the partition boundaries, the PID field MAY be set to zero for all packets, and the S bit MUST NOT be set to one for any other packet than the first. If the preferred usage suggested in Section 3 is followed, with each packet carrying data from exactly one partition, the S bit and PID fields described in Section 4.2 SHOULD be used to indicate what the packet contains. The PID field should indicate which partition the first octet of the payload belongs to, and the S bit indicates that the packet starts on a new partition. If the packetizer does not pay attention to the partition boundaries, one packet can contain a fragment of a partition, a complete partition, or an aggregate of fragments and partitions. There is no explicit signaling of partition boundaries in the payload and the partition lengths at the end of the first partition have to be used to identify the boundaries. Partitions MUST be aggregated in decoding order. Two fragments from different partitions MAY be aggregated into the same packet along with one or more complete partitions. In all cases, the payload of a packet MUST contain data from only one video frame. Consequently the set of packets carrying the data from a particular frame will contain exactly one VP8 Payload Header (see Section 4.3) carried in the first packet of the frame. The last, or only, packet carrying data for the frame MUST have the M bit set in the RTP header. s4.5.1: Th algorithm only applies for the RECOMMENDED use case with partitions in separate packets. This should be noted. s5.2, last para: s/only parts of the frame is corrupted./only part of the frame is corrupted./ s6: s/This payload format has two required parameters./This payload format has two optional parameters./ [I think this is probably a hangover from the previous version.] s7: s/Cryptographic system may also allow/The cryptographic system may also allow/ s7: the usage of SRTP [RFC3711] is recommended. RECOMMENDED? s10.2: I suspect that RFC 3551 is normative.