307-onionbalance-v3 - Tor design proposals

Filename: 307-onionbalance-v3.txt
Title: Onion Balance Support for Onion Service v3
Author: Nick Mathewson
Created: 03-April-2019
Status: Reserve

   [This proposal is currently in reserve status because bug tor#29583 makes
   it unnecessary. (2020 July 31)]

0. Draft Notes

   2019-07-25:

      At this point in time, the cross-certification is not implemented
      correctly in >= tor-0.3.2.1-alpha. See https://trac.torproject.org/29583
      for more details.

      This proposal assumes that this bug is fixed.

1. Introduction

   The OnionBalance tool allows several independent Tor instances to host an
   onion service, while clients can access that onion service without having
   to take its distributed status into account. OnionBalance works by having
   each instance run a separate onion service. Then, a management server
   periodically downloads the descriptors from those onion services, and
   generates a new descriptor containing the introduction points from each
   instance's onion service.

   OnionBalance is used by several high-profile onion services, including
   Facebook and The Tor Project.

   Unfortunately, because of the cross-certification features in v3 onion
   services, OnionBalance no longer works for them. To a certain extent, this
   breakage is because of a security improvement: It's probably a good thing
   that random third parties can no longer grab a onion service's introduction
   points and claim that they are introduction points for a different service.
   But nonetheless, a lack of a working OnionBalance remains an obstacle for
   v3 onion service migration.

   This proposal describes extensions to v3 onion service design to
   accommodate OnionBalance.

2. Background and Solution

   If an OnionBalance management server wants to provide an aggregate
   descriptor for a v3 onion service, it faces several obstacles that it
   didn't have in v2.

   When the management server goes to construct an aggregated descriptor, it
   will have a mismatch on the "auth-key", "enc-key-cert", and
   "legacy-key-cert" fields: these fields are supposed to certify the onion
   service's current descriptor-signing key, but each of these keys will be
   generated independently by each instance. Because they won't match each
   other, there is no possible key that the aggregated descriptor could use
   for its descriptor signing key.

   In this design, we require that each instance should know in advance about
   a descriptor-signing public key that the aggregate descriptor will use for
   each time period. (I'll explain how they can do this later, in section 3
   below.) They don't have to know the corresponding private key.

   When generating their own onion service descriptors for a given time
   period, the instances generate these additional fields to be used for the
   aggregate descriptor:

       "meta-auth-key"
       "meta-enc-key-cert"
       "meta-legacy-key-cert"

   These fields correspond to "auth-key", "enc-key-cert", and
   "legacy-key-cert" respectively, but differ in one regard: the
   descriptor-signing public key that they certify is _not_ the instance's own
   descriptor-signing key, but rather the aggregate public key for the time
   period.

   Ordinary clients ignore these new fields.

   When the management server creates the aggregate descriptor, it checks that
   the signing key for each of these "meta" fields matches the signing key for
   its corresponding non-"meta" field, and that they certify the correct
   descriptor-signing key-- and then uses these fields in place of their
   corresponding non-"meta" variants.

2.1. A quick note on synchronization

   In the design above, and in the section below, I frequently refer to "the
   current time period". By this, I mean the time period for which the
   descriptor is encoded, not the time period in which it is generated.

   Instances and management servers should generate descriptors for the two
   closest time periods, as they do today: no additional synchronization
   should needed here.

3. How to distribute descriptor-signing keys

   The design requires that every instance of the onion service knows about
   the public descriptor-signing key that will be used for the aggregate onion
   service. Here I'll discuss how this can be achieved.

3.1. If the instances are trusted.

   If the management server trusts each of the instances, it can distribute a
   shared secret to each one of them, and use this shared secret to derive
   each time period's private key.

   For example, if the shared secret is SK, then the private descriptor-
   signing key for each time period could be derived as:

        H("meta-descriptor-signing-key-deriv" |
           onion_service_identity
           INT_8(period_num) |
           INT_8(period_length) |
           SK )

   (Remember that in the terminology of rend-spec-v3, INT_8() denotes a 64-bit
   integer, see section 0.2 in rend-spec-v3.txt.)

   If shared secret is ever compromised, then an attacker can impersonate the
   onion service until the shared secret is changed, and can correlate all
   past descriptors for the onion service.

3.2. If the instances are not trusted: Option One

   If the management server does not trust the instances with
   descriptor-signing public keys, another option for it is to simply
   distribute a load of public keys in advance, and use them according to a
   schedule.

   In this design, the management server would pre-generate the
   "descriptor-signing-key-cert" fields for a long time in advance, and
   distribute them to the instances offline. Each one would be
   associated with its corresponding time period.

   If these certificates were revealed to an attacker, the attacker
   could correlate descriptors for the onion service with one another,
   but could not impersonate the service.

3.3. If the instances are not trusted: Option Two

   Another option for the trust model of 3.2 above is to use the same
   key-blinding method as used for v3 onion services. The management server
   would hold a private descriptor-signing key, and use it to derive a
   different private descriptor-signing key for each time period. The instance
   servers would hold the corresponding public key, and use it to derive a
   different public descriptor-signing key for each time period.

   (For security, the key-blinding function in this case should use a
   different nonce than used in the)

   This design would allow the instances to only be configured once, which
   would be simpler than 3.2 above-- but at a cost. The management server's
   use of a long-term private descriptor-signing key would require it to keep
   that key online. (It could keep the derived private descriptor-signing keys
   online, but the parent key could be derived from them.)

   Here, if the instance's knowledge were revealed to an attack, the attacker
   could correlate descriptors for the onion service with one another, but
   could not impersonate the service.

4. Some features of this proposal

   We retain the property that each instance service remains accessible as a
   working onion service. However, anyone who can access it can identify it as
   an instance of an OnionBalance service, and correlate its descriptor to the
   aggregate descriptor.

   Instances could use client authorization to ensure that only the management
   server can decrypt their introduction points. However, because of the
   key-blinding features of v3 onion services, nobody who doesn't know the
   onion addresses for the instances can access them anyway: It would be
   sufficient to keep these addresses secret.

   Although anybody who successfully accesses an instance can correlate its
   descriptor to the meta-descriptor, this only works for two descriptors
   within a single time period: You can't match an instance descriptor from
   one time period to a meta-descriptor from another.

A. Acknowledgments

   Thanks to the network team for helping me clarify my ideas here, explore
   options, and better understand some of the implementations and challenges
   in this problem space.

   This research was supported by NSF grants CNS-1526306 and CNS-1619454.