I have reviewed this document as part of the security directorate's

ongoing effort to review all IETF documents being processed by the

IESG.  Document editors and WG chairs should treat these comments just

like any other last call comments.

This standards track document describes what is essentially an enhanced data model for negotiating telepresence configurations in cases where a given party may have multiple capture devices offering multiple streams. Choice of streams may be constrained by device capabilities. A camera may offer a closeup of the speaker or a wide view of the panel but not be capable of providing both.

Security considerations.

One context issue I am having here is understanding what the relation of this document is to the others it is referencing. For example, there is a normative reference to 

draft-ietf-clue-protocol-06

. Is that to be considered by the IESG at this point? If so it does not have a security considerations.

If the point is to publish the framework doc as an RFC so as to set the context for further discussions of the protocol, this is OK. But otherwise there is a normative reference to a document that doesn't have a security considerations section and desperately needs one.

This is a big problem as the Security Considerations section in framework is pointing forward to 'authorization mechanisms' that are presumably to be described in protocol.

Given this situation, these comments may be taken as input to the framework doc or the documents to be written using framework as the architecture. 

As a general matter, it would be easier to analyze security if terms such as 'confidentiality' and 'integrity' were used. This is particular the case when the specification in question is dealing with audio and video. for example the phrase 

"

an endpoint attempting to listen to sessions in which 

it is not authorized to participate" is almost certainly intended to cover video as well which is seen and not heard.

Looking at the considerations in this way gives us the following considerations:

Confidentiality:

   Disclosure of media streams to an unauthorized endpoint.

   Disclosure of metadata to capture devices.

   Failure to terminate access to media streams at completion of a session.

Integrity

   Modification of media stream data

   Introduction of spurious media streams.

Service

   Denial of Service against capture devices

   Denial of service against output devices

I think this approach would be helpful when it comes to writing the protocol authorization sections.

As a general rule, the term 'endpoint' is now meaningless and should not be used. Yes, end-to-end security is a good thing. But you show me which are the 'endpoints' here. 

End to end is Alice's brain to Bob's brain. 

Between that we have mouth/face -> cameras/ mics -> capture host(s) -> inter-network -> output host(s) -> displays/speakers -> eyes/ears.

An attacker may target any of those modules and any of the interfaces between them. Using the term 'endpoint' is ambiguous.

The metadata disclosure problem can be quite insidious. Let us say we are using CLUE to collect media streams from a home security grid. I have 11 cameras on the perimeter pointing in and another 7 on the residence pointing out and one on my desk. The one on my desk can be considered to be trustworthy, if someone has compromised that, I am screwed. But that isn't the case for the perimeter net which is cobbled together from Raspberry Pis and cheapo cameras. That net is placed in a location I know is vulnerable.

Lets say we have an intrusion. First thing I do is to fire up a conference call with my security contractor. I don't want someone to be able to compromise one of my perimeter cameras in a way that tips them off to the fact the intrusion has been detected.

Introduction of spurious streams might be one of the best ways to attack a conferencing system. If I can see the main speaker and the audio is a little fuzzy, attacker introduces an additional stream with filtering that makes it more attractive to whatever AI is managing the conference. Now the attacker can literally put words in people's mouths. Could be fun for politicians giving town halls.