297-safer-protover-shutdowns - Tor design proposals

Filename: 297-safer-protover-shutdowns.txt
Title: Relaxing the protover-based shutdown rules
Author: Nick Mathewson
Created: 19-Sep-2018
Status: Closed
Target: 0.3.5.x
Implemented-In: 0.4.0.x

IMPLEMENTATION NOTE:

   We went with the proposed change in section 2.  The "release date" is
   now updated by the "make update-versions" target whenever the version
   number is incremented.  Maintainers may also manually set the "release
   date" to the future.

1. Introduction

   In proposal 264 (now implemented) we introduced the subprotocol
   versioning mechanism to better handle forward-compatibility in
   the Tor network.  Included was a mechanism for safely disabling
   obsolete versions of Tor that no longer ran any supported
   protocols.  If a version of Tor receives a consensus that lists
   as "required" any protocol version that it cannot speak, Tor will
   not start--even if the consensus is in its cache.

   The intended use case for this is that once some protocol has
   been provided by all supported versions for a long time, the
   authorities can mark it as "required".  We had thought about the
   "adding a requirement" case mostly.

   This past weekend, though, we found an unwanted side-effect: it
   is hard to safely *un*-require a currently required protocol.

   Here's what happened:

      - Long ago, we created the LinkAuth=1 protocol, which required
        direct access to the ClientRandom and ServerRandom fields.
        (0.2.3.6-alpha)

      - Later, once we implemented Ed25519 identity keys, we added
        an improved LinkAuth=3 protocol, which uses the RFC5705 "key
        export" mechanism. (0.3.0.1-alpha)

      - When we added the subprotocols mechanism, we listed
        LinkAuth=1 as required. (backported to 0.2.9.x)

      - While porting Tor to NSS, we found that LinkAuth=1 couldn't
        be supported, because NSS wisely declines to expose the TLS
        fields it uses.  So we removed "LinkAuth=1" from the
        required list (backported to 0.3.2.x), and got a bunch of
        authorities to upgrade.

      - In 0.3.5.1-alpha, once enough authorities had upgraded, we
        removed "LinkAuth=1" from the supported subprotocols list
        when Tor is running with NSS. [*]

      - We found, however, that this change could cause a bug when
        Tor+NSS started with a cached consensus that was created before
        LinkAuth=1 was removed from the requirements.  Tor would
        decline to start, because the (old) consensus told it that
        LinkAuth=1 was required.

   This proposal discusses two alternatives for making it safe to
   remove required subprotocol versions in the future.


   [*] There was actually a bug here where OpenSSL removed LinkAuth=1
       too, but that's mostly beside the point for this timeline, other
       than the fact it would have made things waaay worse if people
       hadn't caught it.

2. Recommended change: consider the consensus date.

   I propose that when deciding whether to shut down because of
   subprotocol requirements, a Tor implementation should only shut
   down if the consensus is dated to some time after the
   implementation's release date.

   With this change, an old cached consensus cannot cause the
   implementation to shut down, but a newer one can.  This makes it
   safe to put out a release that does not support a formerly
   required protocol, so long as the authorities have upgraded to
   stop requiring that protocol.

   (It is safe to use the *scheduled* release date for the
   implementation, plus a few months -- just so long as we don't
   plan to start requiring a subprotocol that's not supported by the
   latest version of Tor.)

3. Not-recommended change: ignore the cached consensus.

   Was it a mistake to have Tor consider a cached consensus when
   deciding whether to shut down?

   The rationale for considering the cached consensus was that when
   a Tor implementation is obsolete, we don't want it hammering on
   the network, probing for new consensuses, and possibly
   reconnecting aggressively as its handshakes fail.  That still
   seems compelling to me, though it's possible that if we find some
   problem with the methodology from section 2 above, we'll need to
   find some other way to achieve this goal.