Hi,
A little bit of context…
Diameter base protocol requires the usage of an application layer watchdog which is defined in the RFC 3539 - Authentication, Authorization and Accounting (AAA) Transport Profile , in this section: 3.4. Application Layer Watchdog.
RFC 3539 also provides a “Detailed Watchdog Algorithm” in Appendix A.
The pseudo-C implementation of the watchdog algorithm in Appendix A is much stricter regarding the resetting of Tw timer, specifically in the REOPEN state than the requirements in section Algorithm Overview. While the requirements say that, quote:
[2] When any AAA message is received, Tw is reset. This need not be
a response to a watchdog request. Receiving a watchdog response
from a peer constitutes activity, and Tw should be reset. If the
watchdog timer expires and no watchdog response is pending, then
a watchdog message is sent. On sending a watchdog request, Tw is
reset.
the pseudo-C implementation, in the REOPEN state does not even reset the timer on receiving the DWA; it will clear the pending DWA flag and send the next watchdog transaction when the Tw expires.
Assuming the default value of 30s for Twinit and 3 watchdog successful transactions for transitioning from REOPEN to OKAY it always takes approx. 1 minute to transition to OKAY state regardless of what the peer actually does.
Diameter module in Erlang/OTP implements the stricter version, the one in Appendix A. Initial value for the watchdog timer (Tw) and the number of watchdog transactions required for transitioning from one state to another are however configurable. See here:
- diameter — diameter v2.6
- diameter — diameter v2.6 -
{watchdog_config, [{okay|suspect, non_neg_integer()}]}
However, as always, in production things tend to be more complicated when different diameter peers communicate to each other; especially when they are based on different implementations of the diameter base protocol and RFC 3539.
For example:
-
two peers have already one SCTP connection with diameter traffic on it (OKAY state)
-
the connection goes down due to transport layer or application layer errors
-
both peers mark the connection as DOWN, try to reopen it
-
as soon as the connection is re-established, after the CER/CEA drill the watchdog kicks in on both sides in the REOPEN state
-
one peer is implementing the “relaxed” watchdog requirements: Tw is reset on any message answered back (including DWA). The three watchdog transactions finish in about 200 ms because they basically sync on the RTT of the respective link. This is where the peer starts sending diameter requests at whatever rate it is receiving them from the upstream peer; if it just happens that the rate is high, some other interesting things happen.
-
the other peer (Erlang/OTP based, using the default watchdog configuration) completes the 3 watchdog transactions within 1 minute (!), dropping within this time window all the traffic received from its peer.
-
the “fast” watchdog peer detects too many outstanding diameter transactions, eventually timeouts based on some kind of application layer metrics and shuts down the connection again. everything starts from step 3 all over again.
My questions:
- Is the strict watchdog algorithm (1 minute, 3 WD transactions) necessary on SCTP? For example SCTP does even not support half-open/half-closed connections. Why not synchronizing on the traffic received from the other side?
- Are the default min=6s /max=30s values for Twinit relevant for the Diameter applications used today and for the RTTs and bw of the links used for diameter transport layer connections (e.g. 100Mb - 1 Gb)?
Thanks a lot,
Cristian