I am the assigned Gen-ART reviewer for this draft. The General Area
Review Team (Gen-ART) reviews all IETF documents being processed
by the IESG for the IETF Chair.  Please treat these comments just
like any other last call comments.

For more information, please see the FAQ at

<https://trac.ietf.org/trac/gen/wiki/GenArtfaq>.

Document:  draft-ietf-dots-rfc8782-bis-05
Reviewer:  Dale R. Worley
Review Date:  2021-03-22
IETF LC End Date:  2021-03-22
IESG Telechat date:  unknown

Summary:

    This draft is on the right track but has open issues, described in
    the review.

I've provided a long list of minor editorial issues, and a short list
of technical issues.  I suspect that the technical issues have been
resolved in the practices of the community and that their apparent
status as problems stems from not getting the wording properly
aligned with practice.

Major issues:

The condition of two DOTS mitigation requests overlapping depends on
addresses (and alternatives to them) but as defined in section 4.4.1,
does NOT depend on port numbers.  However, other parts of the text
seem to presume that port numbers are involved in testing for
overlapping.  The correct choice needs to be established and the text
made consistent.

Does the requesting of a mitigation only withdraw overlapping
mitigations that were requested using the same signal channel, or is
the effect global?  If a mitigation request with trigger-mitigation =
false is activated by ending of a signal channel, does reestablishing
the channel withdraw it?  (Naively I thought it would, but that isn't
stated.)  If so, how are the former and the current signal channels
correlated, given that cuid collisions can prevent them from using the
same common identifiers?  Indeed, the text does not make it clear how
a mitigation that is triggered by the ending of a signal channel can
be withdrawn, other than by the expiration of its timer.

Minor issues:

The 4.09 response is used to report cuid conflicts, but also various
other conflicts.  Given that cuid conflicts require specific
processing, and can happen when other conflicts could also be
reported, it seems to me that for cuid conflicts, you want that the
response MUST include conflict-information.

In section 4.4.1 there is a discussion of a configuration where a
client communicates through two different gateways to one same server
using a different certificate to communicate with each gateway.  The
text discusses a configuration where we want the two transaction
streams to be treated as one by the client and server.  It seems to me
that this is an unusual situation which can only succeed if both the
client and server have specific configuration for it.  As a
consequence, the situation doesn't need to be discussed in this
document.  Conversely, the default result of this topology is that the
client and server treat both transactions streams separately (and
perhaps neither of them is aware of the overall topology).  It seems
like this case should work correctly without any special
considerations, and so does not need to be documented specifically,
either.

The overall framework for signal channel configuration is not clear.
By default, I assume that the client sets the channel configuration,
constrained by the limits on parameter values imposed by the server,
and that these values apply to communication in both directions (when
applicable).  The text in 4.5 and 4.5.1 is consistent with this model.
However text in 4.5.2 talks about "agents" changing configuration
values, which implies it's possible for the server to set channel
configuration.  There is discussion in section 4.5.3 of a server
sending "a validity time with a configuration it sends", which makes
no sense if only the client can change the configuration -- the
configuration won't change until the client changes it.  Also "the
update of the configuration data if a change occurs at the DOTS server
side".  The model needs to be established, and the text aligned with
it.

Nits/editorial comments:

Global editorial issues:

There is a lot of special terminology, and it would help if
definitions were gathered in section 2.  Additionally, this would help
reveal where the text uses undefined synonyms of defined terms,
several cases of which I have spotted.

There are issues involving "Observe".  One is at the start of section
4.4, where the text refers to "subscribe", but that is not the term
used in CoAP, indeed CoAP deliberately avoids that term.  Also, unless
one is familiar with CoAP, one thinks GET has no side-effects, and
thus cannot possibly establish a subscription.  There are related
issues in sections 4.4.2.1 and 4.4.2.2 that left me wondering for
which GET requests Observe was mandatory and/or permitted and what
values (0 and/or 1) were permitted.  I think it would help to start
4.4.2.1 with an overview discussion of the permitted/required uses of
Observe in DOTS GET requests.

It would help to have adjectives for a mitigation request with
trigger-mitigation = false, and for a mitigation request with
trigger-mitigation = true.

It seems that "deactivating" a mitigation request is used as an
undefined synonym of "withdrawing" it, but (on my first two reads), I
thought it meant "delete".  At this point, I suspect that the words
hide complexity which has not been made explicit:  the client
"requests" a mitigation with trigger-mitigation = false, but the loss
of the channel "activates" it.  Worse, "activation" causes the actions
that are described as being caused by "requesting" a mitigation with
trigger-mitigation = true.  A description of the states, the
transitions between them, and the verbs to describe them should be
given, perhaps in section 2.

Section 4.4.1 is 16 pages long and really should be cut into a number
of subsections.

Section 4.4.1 contains two parallel but different
definitions/discussions of conflict-information.  Not being in a
position to print the document, I can't quite make out what is going
on, but I suspect some reorganization of the section is in order to
replace the two partial definitions with one complete one.  (This
might be connected with the entries in section 9 and/or section 10.3.)
The two parallel definitions are partially excerpted below, and both
have the problem that the contextual text says that the response will
include "enough information for a DOTS client to recognize ...", but
the definition of conflict-information states that it is optional:

-----

   The response includes enough information for a DOTS client to
   recognize the source of the conflict as described below in the
   'conflict-information' subtree with only the relevant nodes listed:

   conflict-information:  Indicates that a mitigation request is
      conflicting with another mitigation request.  This optional
      attribute has the following structure:

-----

   For both 2.01
   (Created) and 4.09 (Conflict) responses, the response includes enough
   information for a DOTS client to recognize the source of the conflict
   as described below:

   conflict-information:  Indicates that a mitigation request is
      conflicting with another mitigation request(s) from other DOTS
      client(s).  This optional attribute has the following structure:

-----

Detailed editorial issues:

(Note that some of these are summarized in a clearer way above.)

1.  Introduction

The example of Figure 1 is introduced by this paragraph:

   An example of a network diagram that illustrates a deployment of DOTS
   agents is shown in Figure 1.  In this example, a DOTS server is
   operating on the access network.  A DOTS client is located on the LAN
   (Local Area Network), while a DOTS gateway is embedded in the CPE
   (Customer Premises Equipment).

But the example also includes a DOTS gateway, and would have been
clearer to me if the statement introducing DOTS gateways was made
before the start of the example rather than after it:

   The DOTS client can
   communicate directly with a DOTS server or indirectly via a DOTS
   gateway.

3.  Design Overview

   support for asynchronous Non-confirmable messaging

It might be worth noting here or in section 2 that "Non-confirmable"
(and "Confirmable") are CoAP technical terms.

   Absent such mutual agreement, the DOTS
   signal channel MUST run over port number 4646 as defined in
   Section 10.1, for both UDP and TCP.

It might be worth stating this port number is for both the client and
the server to use (or that 4646 is just the listening port for
servers).

   Also, the DOTS server may rely on the signal
   channel session loss to trigger mitigation for preconfigured
   mitigation requests (if any).

This doesn't carry quite the right idea.  What is really going on is
that the DOTS client may configure mitigation requests that will be
automatically acted upon by the server if the signal channel session
is lost.  This is a required facility of the server, but it may be
relied upon by the client.

   DOTS signaling can happen with DTLS over UDP and TLS over TCP.

s/can happen/can use/ or perhaps "can happen over".

   In deployments where multiple DOTS clients are enabled in a network
   (owned and operated by the same entity) ...

I think you want something like "In deployments with multiple DOTS
clients in a single network and administrative domain ...".

   o  Port Control Protocol (PCP) [RFC6887] or Session Traversal
      Utilities for NAT (STUN) [RFC8489] may be used to retrieve the
      external addresses/prefixes and/or port numbers.

Would be clearer if it is "may be used by the client to retrieve ...",
as the preceding paragraph is about the translator and here we are
talking about the client without explicitly mentioning it.

4.4.  DOTS Mitigation Methods

   GET:    DOTS clients may use the GET method to subscribe to DOTS
           server status messages or to retrieve the list of its
           mitigations maintained by a DOTS server (Section 4.4.2).

Unless one is aware of the "Observe" option of CoAP, using GET to
establish a subscription seems impossible, as it is a side-effect.
The reader could be warned by wording like:

   GET:    DOTS clients may use the GET method to retrieve the list
           of its mitigations maintained by a DOTS server (Section
           4.4.2), or (using the CoAP Observe option [RFC7641]) to
           subscribe to DOTS server status messages.

--

   Mitigation requests MUST NOT be delayed
   because of checks on probing rate (Section 4.7 of [RFC7252]).

How does this sentence connect with the preceding sentences of the
paragraph?  Also, what does "probing" refer to?  I suspect you mean
that mitigation requests can be Non-confirmable and would by default
fall under the rules of the preceding sentences, but you don't want
that.  So the sentence could be clarified as "However, mitigation
requests MUST NOT be delayed by these limitations."

4.4.1.  Request Mitigation

         with the trailing "=" removed from the encoding

Should be 'the trailing two "="', 'the trailing "="s', or similar,
since the base64 encoding of a string of 16 bytes will always end in
two "=".

         DOTS servers MUST return 4.09 (Conflict) error code to a DOTS
         peer to notify that the 'cuid' is already in use by another
         DOTS client.

The error code 4.09 has other defined uses in the signal channel.
Given the special and "global" action needed based on this error code,
there must be an unambiguous way for the client to identify cuid
collision.  Unfortunately, there is no "session initiation handshake"
message for which a 4.09 response would be unambiguous.  It seems like
the best choice is to look for conflict-information in the response,
since it has a conflict-cause value "CUID Collision".  But
conflict-information is optional.  I recommend making
conflict-information mandatory in this situation.  However, see my
comments at the end of the section regarding the lack of clarity
whether conflict-information is mandatory or optional.

         If the 'mid' value has reached 3/4 of (2^(32) - 1) (i.e.,
         3221225471) and no attack is detected, the DOTS client MUST
         reset 'mid' to 0 to handle 'mid' rollover.

It sounds like, but does not say explicitly, that mid rollover automatically
invalidates any active high-mid mitigation request, and thus, if the
client wants to maintain any existing request, it must recreate them
(necessarily with small mid values).  This needs to be clarified.

      The default value of the parameter is 'true' (that is, the
      mitigation starts immediately).  If 'trigger-mitigation' is not
      present in a request, this is equivalent to receiving a request
      with 'trigger-mitigation' set to 'true'.

The second sentence is completely redundant, but I suspect that a
practical need for it has been discovered.

         ... or the 'cuid' was generated from a rogue DOTS client.

Probably s/from/by/.

But it seems that there is a valid situation where duplicate cuids are
plausible, when two DOTS clients are using the same certificate to
peer with a server because that certificate is what the server
administrator provided to peer with the server.  I don't know if that
is worth mentioning here, though.

         If a DOTS client is provisioned, for example, with distinct
         certificates as a function of the peer server-domain DOTS
         gateway, distinct 'cdid' values may be supplied by a server-
         domain DOTS gateway.  The ultimate DOTS server MUST treat those
         'cdid' values as equivalent.

I'm having a hard time following this, probably because I am not
familiar with the language used to describe these situations.  I think
it means

         If a DOTS client is provisioned, for example, with distinct
         certificates to use to peer with distinct server-domain DOTS
         gateways that peer to the same DOTS server, distinct 'cdid'
         values may be supplied by the gateways to the server.  The
         ultimate DOTS server MUST treat those 'cdid' values as
         equivalent.

The final normative statement is clear, but it isn't clear to me how
the server can implement that, unless it is provisioned with the
knowledge that the two certificates are used by the same client.

More subtly, if the server must treat them as equivalent, dependencies
between transactions in one transaction stream apply to the union of
the transaction streams through the two servers.  E.g. the rule that
mid is nearly-monotonic and the consequences thereof.  Handling this
correctly requires that the client knows that transactions through the
two gateways will be handled equivalently by one same server, and that
seems to require that the client also be configured with particular
knowledge.

It seems to me that there are actually two cases (1) a "dumb" case
where the client happens to access the same server through two
gateways, but neither the client nor the server knows that.  In that
case, the signal channel protocol "just works" normally. (2) a "smart"
case where both the client and serve must know that access through the
two gateways is considered equivalent (but the gateways do not need to
know).  In that case, as long as both the client and server agree on
this equivalence, the signal channel protocol also "just works".

It's not clear that it is necessary to document here the "smart" case,
as the needed adjustments are logically determined by the intended use
case.  If it is not needed, the quoted paragraph is probably best
omitted, because trying to implement it generally would tend to cause
the "dumb" case to fail.

   If the mitigation request
   contains the 'alias-name' and other parameters identifying the target
   resources (such as 'target-prefix', 'target-port-range', 'target-
   fqdn', or 'target-uri'), the DOTS server appends the parameter values
   in 'alias-name' with the corresponding parameter values in 'target-
   prefix', 'target-port-range', 'target-fqdn', or 'target-uri'.

This sentence is not connected with any other processing -- what use
is the concatenated value put to?  Also, the processing described will
NOT be done if alias-name is not present, suggesting that in some way
it is optional.  Also, the phrase "the parameter values in
'alias-name'" is undefined, as alias-name is an opaque string value.
I suspect that some aspect of the processing has not been described.

Perhaps the meaning is that an alias is always configured as a set of
values for the other parameters, and that if a request contains both
an alias name and other parameters, the effective request is formed by
merging the two sets of parameter values.  Though if that is meant,
some provision must be made for the situation where the alias gives a
value for a parameter that is contradicted by an explicit parameter in
the request.

   If the DOTS server does not find the 'mid' parameter value conveyed
   in the PUT request in its configuration data [it may interpret it
   in a certain way]

It's not clear what is going on here, as "mid=..." is a mandatory part
of the Uri-Path, and any such request must be rejected.

   A DOTS server could reject mitigation requests when it is
   near capacity or needs to rate-limit a particular client, for
   example.

This should be a separate paragraph, as it applies more broadly than
the conditions of the first sentence of the paragraph.  Also, it
probably merits s/could/MAY/.

   Two mitigation requests from a DOTS
   client have overlapping scopes if there is a common IP address, IP
   prefix, FQDN, URI, or alias.

Probably worth stating explicitly that a common port number is NOT a
factor in determining overlapping scopes.

   If the DOTS server receives a mitigation request that overlaps with
   an active mitigation request, but both have distinct 'trigger-
   mitigation' types, the DOTS server SHOULD deactivate (absent explicit
   policy/configuration otherwise) the mitigation request with 'trigger-
   mitigation' set to 'false'.

I'm pretty sure I don't know what this means.  What does "deactivate"
mean?  The first time I read it, I thought it meant "delete", The
second time, I suspected it meant the opposite action of "activate",
which is what happens to a trigger-mitigation = false mitigation when
the signal channel is lost.  The third time, I was wondering why
the reestablishment of the signal channel didn't automatically
cause the trigger-mitigation = false mitigation to be deactivated.

      conflict-scope:  Characterizes the exact conflict scope.  It may
         include a list of IP addresses, a list of prefixes, a list of
         port numbers, a list of target protocols, a list of FQDNs, a
         list of URIs, a list of aliases, or references to conflicting
         ACLs (by an 'acl-name', typically [RFC8783]).

Note this text includes a "list of port numbers", but port numbers are
not a factor in conflicts.

Also, is it really intended that this parameter is, effectively, only
human-readable, since there is no particular way to specify what type
of datum the value contains?

4.4.2.  Retrieve Information Related to a Mitigation

    +-----------+----------------------------------------------------+
    |         4 | Attack has exceeded the mitigation provider        |
    |           | capability.                                        |
    +-----------+----------------------------------------------------+

"mitigation provider" is used in a few places but it appears that the
intended term is "mitigator".

    +-----------+----------------------------------------------------+
    |         6 | Attack mitigation is now terminated.               |
    +-----------+----------------------------------------------------+

It seems like code 6 includes codes 5 and 7.  Is this ambiguity
intended?  I suspect the text that is actually wanted is "DOTS client
has withdrawn the mitigation request and the attack mitigation is now
terminated."  There is a parallel issue in section 10.6.2.

4.4.2.1.  DOTS Servers Sending Mitigation Status

   DOTS
   implementations MUST use the Observe Option for both 'mitigate' and
   'config' (Section 4.2).

It's not clear what "MUST use the Observe Option" means.  Does it mean
that clients MUST use it in GET requests for 'mitigate' and 'config'?
If so, is the client allowed to use "Observe: 1", despite that this
section only discusses the "Observe: 0" case?  Or does it just mean
that servers must implement it, and thus respond correctly if a client
sends it?

4.4.2.2.  DOTS Clients Polling for Mitigation Status

   In such case, the DOTS client recalls the mitigation request by
   issuing a DELETE request for this mitigation request (Section 4.4.4).

The term "recall" is used in a few places but it seems like the
correct term is "withdraw" (section 4.4.4).

4.4.3.  Efficacy Update from DOTS Clients

In what way is an "efficacy update" different from an "update"?  Can
"efficacy" be removed without loss, or is it a term of art for updates
to mitigation requests sent during attacks?

It appears that an update is an "efficacy update" if and only if
"attack-status" is present.  This should be stated at the beginning of
the section, as otherwise it's a mystery what distinguishes "efficacy
updates".

4.4.4.  Withdraw a Mitigation

   Once the request is validated, the DOTS server immediately
   acknowledges a DOTS client's request to withdraw the DOTS signal
   using 2.02 (Deleted) Response Code with no response payload.  

s/DOTS signal/DOTS mitigation request/

4.5.  DOTS Signal Channel Session Configuration

   d.  Acceptable signal loss ratio: Maximum retransmissions,
       retransmission timeout value, and other message transmission
       parameters for Confirmable messages over the DOTS signal channel.

What are the names of these parameters in the signal-config structure?

   As such, the transmission-related
   parameters ('missing-hb-allowed' and acceptable signal loss ratio)
   are negotiated only for DOTS over unreliable transports.

It seems this could be said more clearly by listing the permitted
fields:  "only the 'heartbeat-interval' parameter [or whatever] is
negotiated for DOTS over reliable transports".

4.5.1.  Discover Configuration Parameters

   At least one of the attributes 'heartbeat-interval', 'missing-hb-
   allowed', 'probing-rate', 'max-retransmit', 'ack-timeout', and 'ack-
   random-factor' MUST be present in the PUT request.  Note that
   'heartbeat-interval', 'missing-hb-allowed', 'probing-rate', 'max-
   retransmit', 'ack-timeout', and 'ack-random-factor', if present, do
   not need to be provided for both 'mitigating-config', and 'idle-
   config' in a PUT request.

Must both the mitigating and idle configuration sections be present in
the PUT?  Does the requirement "At least one..." apply to both
sections together or each section alone?  If e.g. missing-hb-allowed
is present in one section but not the other, the wording gives a vague
suggestion that the same value is implicitly provided for the other
section.  Is this true?

   The PUT request with a higher numeric 'sid' value overrides the DOTS
   signal channel session configuration data installed by a PUT request
   with a lower numeric 'sid' value.  To avoid maintaining a long list
   of 'sid' requests from a DOTS client, the lower numeric 'sid' MUST be
   automatically deleted and no longer available at the DOTS server.

Does this mean that the PUT with the higher sid installs what values
it provides on top of the current configuration, or does it mean that
the previous PUT's effect is entirely removed, that is, parameters not
given in the higher-sid PUT take their default values?  Note that the
latter is resistant to problems from lost PUT requests but the former
is not.

   o  If the DOTS server finds the 'sid' parameter value conveyed in the
      PUT request in its configuration data and if the DOTS server has
      accepted the updated configuration parameters, 2.04 (Changed) MUST
      be returned in the response.

Given the earlier statement "'sid' values MUST increase monotonically
(when a new PUT is generated by a DOTS client to convey the
configuration parameters for the signal channel).", if a server
receives a PUT with the same sid as a previous PUT then the client is
misbehaving and the server should send an error response.

   A DOTS client may issue a GET message with a 'sid' Uri-Path parameter
   to retrieve the negotiated configuration.

Does this sid value matter, or is only its presence important?  Also,
you probably want to expand this to "a GET message for 'config' with a
'sid' Uri-Path parameter ...".

4.5.3.  Configuration Freshness and Notifications

The underlying processing is not made clear.  Roughly, it seems that
the idea is the server has the right to change the configuration
unilaterally at any time, but if the client does a GET of the
configuration, the server is required to commit that it won't change
the configuration given in the response within Max-Age Option seconds.

Or is this talking about a mechanism where the server can, at its
initiative, tell the client how the client should behave?  Which is
completely different from section 4.5.2 where the client tells the
server how to behave.

4.5.4.  Delete DOTS Signal Channel Session Configuration

   Upon bootstrapping or reboot, a DOTS client MAY send a DELETE request
   to set the configuration parameters to default values.  Such a
   request does not include any 'sid'.

I would take it as assumed that when the (D)TLS connection is
established, that is, when the DOTS signal channel session is
initiated, it has the default configuration parameters.  Thus the
DELETE described here is guaranteed to have no effect.  But perhaps
the intention is that the signal channel is conceptualized as
persisting longer than the (D)TLS connection, and (perhaps) associated
with the cuid/cdid value.  If so, that should be stated clearly.

4.6.  Redirected Signaling

   If a DOTS server wants to redirect a DOTS client to an alternative
   DOTS server for a signal session, then the Response Code 5.03
   (Service Unavailable) will be returned in the response to the DOTS
   client.

What is "the response"?  It seems that this is only sensible if the
session is just being established, but there doesn't seem to be a
specific session-initiation message.  If you really mean that the
server can redirect the session in response to any request, it would
be helpful to state that directly.  Also, you need to specify whether
the connection to the alternate server is a new session (with
independent state) or whether it is expected to be a continuation of
the existing session (carrying the same state).

4.7.  Heartbeat Mechanism

   For
   example, if a DOTS client receives a 2.04 response for its heartbeat
   messages but no server-initiated heartbeat messages, the DOTS client
   sets 'peer-hb-status' to 'false'.  The DOTS server then will ...

There is a lot of detail left out here, as there are messages and
events involved that are not mentioned explicitly.  I think what is
meant is "For example, if a DOTS client receives a 2.04 response for
its heartbeat messages but no server-initiated heartbeat messages, the
DOTS client sets 'peer-hb-status' to 'false' in its next heartbeat
message.  Upon receiving that message, the DOTS server then will ..."

It might be useful to explicitly state that the bodies of responses to
heartbeat requests are empty.

6.  YANG/JSON Mapping Parameters to CBOR

It might help the implementors to tell whether this is the same as
section 6 of RFC 8782 or not.

10.1.  DOTS Signal Channel UDP and TCP Port Number

   IANA has assigned the port number 4646 (the ASCII decimal value for
   ".." (DOTS)) ...

Ow!

[END]