Learning when to give up ("timeout") on circuit construction

Since version, Tor clients attempt to learn when to give up on circuits based on network conditions.

Distribution choice

Based on studies of build times, we found that the distribution of circuit build times appears to be a Frechet distribution (and a multi-modal Frechet distribution, if more than one guard or bridge is used). However, estimators and quantile functions of the Frechet distribution are difficult to work with and slow to converge. So instead, since we are only interested in the accuracy of the tail, clients approximate the tail of the multi-modal distribution with a single Pareto curve.

How much data to record

From our observations, the minimum number of circuit build times for a reasonable fit appears to be on the order of 100. However, to keep a good fit over the long term, clients store 1000 most recent circuit build times in a circular array.

These build times only include the times required to build three-hop circuits, and the times required to build the first three hops of circuits with more than three hops. Circuits of fewer than three hops are not recorded, and hops past the third are not recorded.

The Tor client should build test circuits at a rate of one every 'cbttestfreq' (10 seconds) until 'cbtmincircs' (100 circuits) are built, with a maximum of 'cbtmaxopencircs' (default: 10) circuits open at once. This allows a fresh Tor to have a CircuitBuildTimeout estimated within 30 minutes after install or network change (see Detecting Changing Network Conditions below.)

Timeouts are stored on disk in a histogram of 10ms bin width, the same width used to calculate the Xm value above. The timeouts recorded in the histogram must be shuffled after being read from disk, to preserve a proper expiration of old values after restart.

Thus, some build time resolution is lost during restart. Implementations may choose a different persistence mechanism than this histogram, but be aware that build time binning is still needed for parameter estimation.

Parameter estimation

Once 'cbtmincircs' build times are recorded, Tor clients update the distribution parameters and recompute the timeout every circuit completion (though see below for when to pause and reset timeout due to too many circuits timing out).

Tor clients calculate the parameters for a Pareto distribution fitting the data using the maximum likelihood estimator. For derivation, see: https://en.wikipedia.org/wiki/Pareto_distribution#Estimation_of_parameters

Because build times are not a true Pareto distribution, we alter how Xm is computed. In a max likelihood estimator, the mode of the distribution is used directly as Xm.

Instead of using the mode of discrete build times directly, Tor clients compute the Xm parameter using the weighted average of the midpoints of the 'cbtnummodes' (10) most frequently occurring 10ms histogram bins. Ties are broken in favor of earlier bins (that is, in favor of bins corresponding to shorter build times).

(The use of 10 modes was found to minimize error from the selected cbtquantile, with 10ms bins for quantiles 60-80, compared to many other heuristics).

To avoid ln(1.0+epsilon) precision issues, use log laws to rewrite the estimator for 'alpha' as the sum of logs followed by subtraction, rather than multiplication and division:

alpha = n/(Sum_n{ln(MAX(Xm, x_i))} - n\*ln(Xm))

In this, n is the total number of build times that have completed, x_i is the ith recorded build time, and Xm is the modes of x_i as above.

All times below Xm are counted as having the Xm value via the MAX(), because in Pareto estimators, Xm is supposed to be the lowest value. However, since clients use mode averaging to estimate Xm, there can be values below our Xm. Effectively, the Pareto estimator then treats that everything smaller than Xm happened at Xm. One can also see that if clients did not do this, alpha could underflow to become negative, which results in an exponential curve, not a Pareto probability distribution.

The timeout itself is calculated by using the Pareto Quantile function (the inverted CDF) to give us the value on the CDF such that 80% of the mass of the distribution is below the timeout value (parameter 'cbtquantile').

The Pareto Quantile Function (inverse CDF) is:

F(q) = Xm/((1.0-q)^(1.0/alpha))

Thus, clients obtain the circuit build timeout for 3-hop circuits by computing:

timeout_ms = F(0.8) # 'cbtquantile' == 0.8

With this, we expect that the Tor client will accept the fastest 80% of the total number of paths on the network.

Clients obtain the circuit close time to completely abandon circuits as:

close_ms = F(0.99) # 'cbtclosequantile' == 0.99

To avoid waiting an unreasonably long period of time for circuits that simply have relays that are down, Tor clients cap timeout_ms at the max build time actually observed so far, and cap close_ms at twice this max, but at least 60 seconds:

     timeout_ms = MIN(timeout_ms, max_observed_timeout)
     close_ms = MAX(MIN(close_ms, 2*max_observed_timeout), 'cbtinitialtimeout')

Calculating timeouts thresholds for circuits of different lengths

The timeout_ms and close_ms estimates above are good only for 3-hop circuits, since only 3-hop circuits are recorded in the list of build times.

To calculate the appropriate timeouts and close timeouts for circuits of other lengths, the client multiples the timeout_ms and close_ms values by a scaling factor determined by the number of communication hops needed to build their circuits:

timeout_ms\[hops=n\] = timeout_ms * Actions(N) / Actions(3)

close_ms\[hops=n\] = close_ms * Actions(N) / Actions(3)

where Actions(N) = N * (N + 1) / 2.

To calculate timeouts for operations other than circuit building, the client should add X to Actions(N) for every round-trip communication required with the Xth hop.

How to record timeouts

Pareto estimators begin to lose their accuracy if the tail is omitted. Hence, Tor clients actually calculate two timeouts: a usage timeout, and a close timeout.

Circuits that pass the usage timeout are marked as measurement circuits, and are allowed to continue to build until the close timeout corresponding to the point 'cbtclosequantile' (default 99) on the Pareto curve, or 60 seconds, whichever is greater.

The actual completion times for these measurement circuits should be recorded.

Implementations should completely abandon a circuit and ignore the circuit if the total build time exceeds the close threshold. Such closed circuits should be ignored, as this typically means one of the relays in the path is offline.

Detecting Changing Network Conditions

Tor clients attempt to detect both network connectivity loss and drastic changes in the timeout characteristics.

To detect changing network conditions, clients keep a history of the timeout or non-timeout status of the past 'cbtrecentcount' circuits (20 circuits) that successfully completed at least one hop. If more than 90% of these circuits timeout, the client discards all buildtimes history, resets the timeout to 'cbtinitialtimeout' (60 seconds), and then begins recomputing the timeout.

If the timeout was already at least cbtinitialtimeout, the client doubles the timeout.

The records here (of how many circuits succeeded or failed among the most recent 'cbrrecentcount') are not stored as persistent state. On reload, we start with a new, empty state.

Consensus parameters governing behavior

Clients that implement circuit build timeout learning should obey the following consensus parameters that govern behavior, in order to allow us to handle bugs or other emergent behaviors due to client circuit construction. If these parameters are not present in the consensus, the listed default values should be used instead.

        Default: 0
        Min: 0
        Max: 1
        Effect: If 1, all CircuitBuildTime learning code should be
                disabled and history should be discarded. For use in
                emergency situations only.

        Default: 10
        Min: 1
        Max: 20
        Effect: This value governs how many modes to use in the weighted
        average calculation of Pareto parameter Xm. Selecting Xm as the
        average of multiple modes improves accuracy of the Pareto tail
        for quantile cutoffs from 60-80% (see cbtquantile).

        Default: 20
        Min: 3
        Max: 1000
        Effect: This is the number of circuit build outcomes (success vs
                timeout) to keep track of for the following option.

        Default: 18
        Min: 3
        Max: 10000
        Effect: When this many timeouts happen in the last 'cbtrecentcount'
                circuit attempts, the client should discard all of its
                history and begin learning a fresh timeout value.

                Note that if this parameter's value is greater than the value
                of 'cbtrecentcount', then the history will never be
                discarded because of this feature.

        Default: 100
        Min: 1
        Max: 10000
        Effect: This is the minimum number of circuits to build before
                computing a timeout.

                Note that if this parameter's value is higher than 1000 (the
                number of time observations that a client keeps in its
                circular buffer), circuit build timeout calculation is
                effectively disabled, and the default timeouts are used

        Default: 80
        Min: 10
        Max: 99
        Effect: This is the position on the quantile curve to use to set the
                timeout value. It is a percent (10-99).

        Default: 99
        Min: Value of cbtquantile parameter
        Max: 99
        Effect: This is the position on the quantile curve to use to set the
                timeout value to use to actually close circuits. It is a
                percent (0-99).

        Default: 10
        Min: 1
        Max: 2147483647 (INT32_MAX)
        Effect: Describes how often in seconds to build a test circuit to
                gather timeout values. Only applies if less than 'cbtmincircs'
                have been recorded.

        Default: 10
        Min: 10
        Max: 2147483647 (INT32_MAX)
        Effect: This is the minimum allowed timeout value in milliseconds.

        Default: 60000
        Min: Value of cbtmintimeout
        Max: 2147483647 (INT32_MAX)
        Effect: This is the timeout value to use before we have enough data
                to compute a timeout, in milliseconds.  If we do not have
                enough data to compute a timeout estimate (see cbtmincircs),
                then we use this interval both for the close timeout and the
                abandon timeout.

        Default: 180
        Min: 10
        Max: 60000
        Effect: This is how long idle circuits will be kept open while cbt is
                learning a new timeout value.

        Default: 10
        Min: 0
        Max: 14
        Effect: This is the maximum number of circuits that can be open at
                at the same time during the circuit build time learning phase.