I am the assigned Gen-ART reviewer for this draft. The General Area
Review Team (Gen-ART) reviews all IETF documents being processed
by the IESG for the IETF Chair.  Please treat these comments just
like any other last call comments.

For more information, please see the FAQ at

<https://wiki.ietf.org/en/group/gen/GenArtFAQ>.

Document: draft-ietf-cellar-tags-19
Reviewer: Ines Robles
Review Date: 2025-10-13
IETF LC End Date: 2025-10-13
IESG Telechat date: Not scheduled for a telechat

Summary:

This document defines the Matroska multimedia container tags, namely the tag names and their respective semantic meaning.

I have a few comments and questions below that I would appreciate being addressed before publication.

Comments:

1- Section 3.2.2, states "Multiple items MUST NOT be stored as a list in a single TagString. If there is more than one tag value with the same name to be stored, then more than one SimpleTag MUST be used."

However, several tag definitions (for example, INSTRUMENTS in Section 4.4 and KEYWORDS in Section 4.6) explicitly describe values as being “separated by a comma.” This wording suggests that multiple items may appear within a single TagString, which seems to contradict the rule in Section 3.2.2.

Could you please clarify whether these tags are intended to be exceptions to that rule, or if the text should instead indicate that each value must be stored in a separate SimpleTag?

2- Section 3.3: In Table 2 (“TargetTypeValue for Video”), the draft lists MOVIE / EPISODE / CONCERT and describes them as “the most common grouping level of video (e.g., an episode for a TV series).” This correctly indicates that movie is intended as a representative example.

However, in the document, several tag descriptions (e.g., DIRECTOR, ACTOR, LAW_RATING, etc.) refer specifically to “a movie.”

For precision and inclusivity, these occurrences should be generalized, since the tagging system applies to any audiovisual work; including films, television episodes, animated content, image-based sequences, podcasts, concerts, or other recorded video content.

It is therefore suggested to replace movie with a broader term such as video work, video content, or audiovisual work, as appropriate to the context.

What do you think?

3- Section 3.3, states: “Tags from a TargetTypeValue apply to the all lower TargetTypeValues.”

It is not always clear whether “lower” refers to numerically smaller values or to semantically subordinate entities. It is implicit that smaller numbers indicate lower levels in the hierarchy; however, the current wording could confuse newcomers.

What about to add a clarification such as:

“A tag defined for a given TargetTypeValue applies to all Targets with numerically smaller TargetTypeValues in the same hierarchy, that is, from higher-level groups to lower-level entities.”

What do you think?

4- Section 3.3 defines TargetTypeValue and provides two tables: Table 1 for audio and Table 2 for video. Both tables list the same numeric values (e.g., 50, 40, 30, etc.) but associate them with different semantic examples. For instance, Table 1 maps 50 to Album, while Table 2 maps 50 to Movie / Episode / Concert.

It would be helpful to clarify whether these tables represent one shared TargetTypeValue numbering system that applies to all media types (where the numbers define structural hierarchy levels, and the examples simply illustrate common use cases for each media type), or two independent numbering systems (one for audio and one for video) that happen to reuse the same numeric values for different purposes.

For example, how should this be interpreted in a Matroska file that contains both audio and video streams, such as a concert film?

5- Section 3.3.1: The current description of PART_OFFSET (“... which is the number of tracks on the first CD”) correctly implies that it represents a cumulative or absolute offset, i.e., the number of lower-level items that precede the current group in the overall collection. To avoid potential misinterpretation as a relative (per-disc) offset, it might be clearer to rephrase to something like:

“PART_OFFSET, at TargetTypeValue 30 (TRACK), represents the number of lower-level items that precede the current group in the overall collection. For example, if CD 1 contains 5 tracks, then the first track of CD 2 has PART_OFFSET = 5.”

What do you think?

6- Section 4.10: It appears to be an inconsistent treatment of numeric tags with respect to their encoding type. 

For example: The EBU_R128_* tags (e.g., EBU_R128_LOUDNESS) are defined as binary and store floating-point values in <TagBinary>. The REPLAYGAIN_* tags (e.g., REPLAYGAIN_GAIN, REPLAYGAIN_PEAK) represent similar floating-point values but are defined as UTF-8 strings in <TagString>. This means that two groups of tags describing essentially the same kind of data (gain/loudness values in dB or LUFS) are stored using different data types. 

6.1- Could you please clarify whether this distinction is intentional (for example, due to backward compatibility) or whether a consistent approach is intended?

6.2- It might be helpful to include a short explanatory note in Section 4.10 such as "..ReplayGain tags retain textual representation for compatibility with legacy implementations, whereas EBU R128 tags use binary floats for higher precision..."?

6.3- Additionally, it may be useful to provide brief guidance for future tag definitions on when to prefer binary versus textual representation for numeric values. For example, recommending binary floats for precision-critical engineering data, and UTF-8 strings for human-readable or legacy-compatible values. This would help ensure consistent design choices in future extensions.

7- Section 5, states: "Most of the time strings are kept as-is and don't pose a security issue, apart from invalid UTF-8 values."

While the mention of “invalid UTF-8 values” is helpful, this phrasing might still understate the potential risk. Implementations that handle TagStrings without proper UTF-8 validation or size checks could encounter parsing errors, crashes, or buffer overruns if presented with malformed or excessively large input data. It may be useful to add a clarifying sentence such as:

"Implementations MUST validate TagString inputs for UTF-8 correctness and reasonable length before use, in accordance with the security considerations in [RFC 3629]"

What do you think?

8- The draft describes how multiple SimpleTag elements may appear under the same Tag element, allowing multiple values for the same tag name.

However, how should applications interpret or prioritize these values if conflicting tags occur. For example, two TITLE tags with different TagString values within the same Targets element?

Nits:

9- choregrapher → choreographer

10- the values is stored → the value is stored

11- parts that are inside or outside a given file → ambiguous. Consider clarifying to something like: “parts located either within or externally referenced by a given file” ?

12- Due to the various nature of tag sources → Due to the varied nature of tag sources

13- each demand needs to balance if it makes sense… → each request needs to be evaluated to determine if it makes sense…

14- an host app → a host app

15- A Tag element has a single Targets element with a single TargetTypeValue element. But the Targets element… → replace “But..” with “However,...”

16- It is RECOMMENDED to start a tag name… → It is RECOMMENDED that tag names start…

17- for non official tags than are not meant to make it to the list… → for non-official tags that are not meant to be added to the list of official tags...

18- apply to the all lower TargetTypeValues → “…apply to all lower TargetTypeValues..”


Thanks for this document,

Ines.