Analysis and discussion

Warning: Take all the PoW performance numbers on this page with a large grain of salt. Most of this is based on very early analysis that has not been updated for the current state of implementation.

For current performance numbers on a specific piece of hardware, please run cargo bench from the equix/bench crate within Arti. This framework tests both the C and Rust implementations side-by-side.

Attacker strategies

To design a protocol and choose its parameters, we first need to understand a few high-level attacker strategies to see what we are fighting against.

Overwhelm PoW verification ("top half")

A basic attack here is the adversary spamming with bogus INTRODUCE messages so that the service does not have computing capacity to even verify the proof-of-work. This adversary tries to overwhelm the procedure in the v1 verification algorithm section.

That's why we need the PoW algorithm to have a cheap verification time so that this attack is not possible: we explore this PoW parameter below in the section on PoW verification.

Overwhelm rendezvous capacity ("bottom half")

Given the way the introduction queue works, a very effective strategy for the attacker is to totally overwhelm the queue processing by sending more high-effort introductions than the onion service can handle at any given tick. This adversary tries to overwhelm the process of handling queued introductions.

To do so, the attacker would have to send at least 20 high-effort INTRODUCE messages every 100ms, where high-effort is a PoW which is above the estimated level of "the motivated user".

An easier attack for the adversary, is the same strategy but with INTRODUCE messages that are all above the comfortable level of "the standard user". This would block out all standard users and only allow motivated users to pass.

Hybrid overwhelm strategy

If both the top- and bottom- halves are processed by the same thread, this opens up the possibility for a "hybrid" attack. Given the performance figures for the bottom half (0.31 ms/req.) and the top half (5.5 ms/req.), the attacker can optimally deny service by submitting 91 high-effort requests and 1520 invalid requests per second. This will completely saturate the main loop because:

  0.31*(1520+91) ~ 0.5 sec.
  5.5*91         ~ 0.5 sec.

This attack only has half the bandwidth requirement of a top-half attack and half the compute requirement of a bottom-half attack..

Alternatively, the attacker can adjust the ratio between invalid and high-effort requests depending on their bandwidth and compute capabilities.

Gaming the effort control logic

Another way to beat this system is for the attacker to game the effort control logic. Essentially, there are two attacks that we are trying to avoid:

  • Attacker sets descriptor suggested-effort to a very high value effectively making it impossible for most clients to produce a PoW token in a reasonable timeframe.
  • Attacker sets descriptor suggested-effort to a very small value so that most clients aim for a small value while the attacker comfortably launches an bottom-half attack using medium effort PoW (see this post by tevador on tor-dev from May 2020).

Precomputed PoW attack

The attacker may precompute many valid PoW nonces and submit them all at once before the current seed expires, overwhelming the service temporarily even using a single computer. The current scheme gives the attackers 4 hours to launch this attack since each seed lasts 2 hours and the service caches two seeds.

An attacker with this attack might be aiming to DoS the service for a limited amount of time, or to cause an effort control attack.

Parameter tuning

There are various parameters in this PoW system that need to be tuned:

We first start by tuning the time it takes to verify a PoW token. We do this first because it's fundamental to the performance of onion services and can turn into a DoS vector of its own. We will do this tuning in a way that's agnostic to the chosen PoW function.

We previously considered the concept of a nonzero starting difficulty setting. This analysis still references such a concept, even though the currently recommended implementation uses a starting effort of zero. (We now expect early increases in effort during an attack to be driven primarily by client retry behavior.)

At the end of this section we will estimate the resources that an attacker needs to overwhelm the onion service, the resources that the service needs to verify introduction requests, and the resources that legitimate clients need to get to the onion service.

PoW verification

Verifying a PoW token is the first thing that a service does when it receives an INTRODUCE2 message. Our current implementation is described by the v1 verification algorithm specification.

Verification time is a critical performance parameter. Actual times can be measured by cargo bench now, and the verification speeds we achieve are more like 50-120 microseconds. The specific numbers below are dated, but the analysys below is preserved as an illustration of the design space we are optimizing within.

To defend against a top-half attack it's important that we can quickly perform all the steps in-between receiving an introduction request over the network and adding it to our effort-prioritized queue.

All time spent verifying PoW adds overhead to the already existing "top half" part of handling an INTRODUCE message. Hence we should be careful to add minimal overhead here.

During our performance measurements on tor we learned that the "top half" takes about 0.26 msecs in average, without doing any sort of PoW verification. Using that value we compute the following table, that describes the number of requests we can queue per second (aka times we can perform the "top half" process) for different values of PoW verification time:

PoW Verification TimeTotal "top half" timeRequests Queued per second
0 msec0.26 msec3846
1 msec1.26 msec793
2 msec2.26 msec442
3 msec3.26 msec306
4 msec4.26 msec234
5 msec5.26 msec190
6 msec6.26 msec159
7 msec7.26 msec137
8 msec8.26 msec121
9 msec9.26 msec107
10 msec10.26 msec97

Here is how you can read the table above:

  • For a PoW function with a 1ms verification time, an attacker needs to send 793 dummy introduction requests per second to succeed in a top-half attack.
  • For a PoW function with a 2ms verification time, an attacker needs to send 442 dummy requests per second to succeed in a top-half attack.
  • For a PoW function with a 10ms verification time, an attacker needs to send 97 dummy requests per second to succeed in a top-half attack.

Whether an attacker can succeed at that depends on the attacker's resources, but also on the network's capacity.

Our purpose here is to have the smallest PoW verification overhead possible that also allows us to achieve all our other goals.

Note that the table above is simply the result of a naive multiplication and does not take into account all the auxiliary overheads that happen every second like the time to invoke the mainloop, the bottom-half processes, or pretty much anything other than the "top-half" processing.

During our measurements the time to handle introduction requests dominates any other action time: There might be events that require a long processing time, but these are pretty infrequent (like uploading a new HS descriptor) and hence over a long time they smooth out. Hence extrapolating the total introduction requests queued per second based on a single "top half" time seems like good enough to get some initial intuition. That said, the values of "Requests queued per second" from the table above, are likely much smaller than displayed above because of all the auxiliary overheads.

PoW difficulty analysis

The difficulty setting of our PoW basically dictates how difficult it should be to get a success in our PoW system. An attacker who can get many successes per second can pull a successful bottom-half attack against our system.

In classic PoW systems, "success" is defined as getting a hash output below the "target". However, since our system is dynamic, we define "success" as an abstract high-effort computation.

The original analysis here concluded that we still need a starting difficulty setting that will be used for bootstrapping the system. The client and attacker can still aim higher or lower but for UX purposes and for analysis purposes it was useful to define a starting difficulty, to minimize retries by clients.

In current use it was found that an effort of 1 makes a fine minimum, so we don't normally have a concept of minimum effort. Consider the actual "minimum effort" in v1 now to simply be the expected runtime of one single Equi-X solve.

Analysis based on adversary power

In this section we will try to do an analysis of PoW difficulty without using any sort of Tor-related or PoW-related benchmark numbers.

We created the table (see [REF_TABLE]) below which shows how much time a legitimate client with a single machine should expect to burn before they get a single success.

The x-axis is how many successes we want the attacker to be able to do per second: the more successes we allow the adversary, the more they can overwhelm our introduction queue. The y-axis is how many machines the adversary has in her disposal, ranging from just 5 to 1000.

       ===============================================================
       |    Expected Time (in seconds) Per Success For One Machine   |
 ===========================================================================
 |                                                                          |
 |   Attacker Succeses        1       5       10      20      30      50    |
 |       per second                                                         |
 |                                                                          |
 |            5               5       1       0       0       0       0     |
 |            50              50      10      5       2       1       1     |
 |            100             100     20      10      5       3       2     |
 | Attacker   200             200     40      20      10      6       4     |
 |  Boxes     300             300     60      30      15      10      6     |
 |            400             400     80      40      20      13      8     |
 |            500             500     100     50      25      16      10    |
 |            1000            1000    200     100     50      33      20    |
 |                                                                          |
 ============================================================================

Here is how you can read the table above:

  • If an adversary has a botnet with 1000 boxes, and we want to limit her to 1 success per second, then a legitimate client with a single box should be expected to spend 1000 seconds getting a single success.
  • If an adversary has a botnet with 1000 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 200 seconds getting a single success.
  • If an adversary has a botnet with 500 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 100 seconds getting a single success.
  • If an adversary has access to 50 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 10 seconds getting a single success.
  • If an adversary has access to 5 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 1 seconds getting a single success.

With the above table we can create some profiles for starting values of our PoW difficulty.

Analysis based on Tor's performance

To go deeper here, we can use the performance measurements on tor to get a more specific intuition on the starting difficulty. In particular, we learned that completely handling an introduction request takes 5.55 msecs in average. Using that value, we can compute the following table, that describes the number of introduction requests we can handle per second for different values of PoW verification:

PoW Verification TimeTotal time to handle requestRequests handled per second
0 msec5.55 msec180.18
1 msec6.55 msec152.67
2 msec7.55 msec132.45
3 msec8.55 msec116.96
4 msec9.55 mesc104.71
5 msec10.55 msec94.79
6 msec11.55 msec86.58
7 msec12.55 msec79.68
8 msec13.55 msec73.80
9 msec14.55 msec68.73
10 msec15.55 msec64.31

Here is how you can read the table above:

  • For a PoW function with a 1ms verification time, an attacker needs to send 152 high-effort introduction requests per second to succeed in a bottom-half attack attack.
  • For a PoW function with a 10ms verification time, an attacker needs to send 64 high-effort introduction requests per second to succeed in a bottom-half attack attack.

We can use this table to specify a starting difficulty that won't allow our target adversary to succeed in an bottom-half attack attack.

Note that in practice verification times are much lower; the scale of the above table does not match the current implementation's reality.

User experience

This proposal has user facing UX consequences.

When the client first attempts a pow, it can note how long iterations of the hash function take, and then use this to determine an estimation of the duration of the PoW. This estimation could be communicated via the control port or other mechanism, such that the browser could display how long the PoW is expected to take on their device. If the device is a mobile platform, and this time estimation is large, it could recommend that the user try from a desktop machine.

Future work

Incremental improvements to this proposal

There are various improvements that can be done in this proposal, and while we are trying to keep this v1 version simple, we need to keep the design extensible so that we build more features into it. In particular:

  • End-to-end introduction ACKs

    This proposal suffers from various UX issues because there is no end-to-end mechanism for an onion service to inform the client about its introduction request. If we had end-to-end introduction ACKs many of the problems seen in client-side effort estimation would be alleviated. The problem here is that end-to-end ACKs require modifications on the introduction point code and a network update which is a lengthy process.

  • Multithreading scheduler

    Our scheduler is pretty limited by the fact that Tor has a single-threaded design. If we improve our multithreading support we could handle a much greater amount of introduction requests per second.

Future designs

This is just the beginning in DoS defences for Tor and there are various future designs and schemes that we can investigate. Here is a brief summary of these:

  • "More advanced PoW schemes" -- We could use more advanced memory-hard PoW schemes like MTP-argon2 or Itsuku to make it even harder for adversaries to create successful PoWs. Unfortunately these schemes have much bigger proof sizes, and they won't fit in INTRODUCE1 messages. See #31223 for more details.

  • "Third-party anonymous credentials" -- We can use anonymous credentials and a third-party token issuance server on the clearnet to issue tokens based on PoW or CAPTCHA and then use those tokens to get access to the service. See [REF_CREDS] for more details.

  • "PoW + Anonymous Credentials" -- We can make a hybrid of the above ideas where we present a hard puzzle to the user when connecting to the onion service, and if they solve it we then give the user a bunch of anonymous tokens that can be used in the future. This can all happen between the client and the service without a need for a third party.

All of the above approaches are much more complicated than the v1 design, and hence we want to start easy before we get into more serious projects. The current implementation requires complexity within the Equi-X implementation but its impact on the overall tor network can be relatively simple.

Environment

This algorithm shares a broad concept, proof of work, with some notoriously power hungry and wasteful software. We love the environment, and we too are concerned with how proof of work schemes typically waste huge amounts of energy by doing useless hash iterations.

Nevertheless, there are some massive differences in both the scale and the dynamics of what we are doing here: we are performing fairly small amounts of computation, and it's used as part of a scheme to disincentivize attacks entirely. If we do our job well, people stop computing these proof-of-work functions entirely and find something else to attack.

We think we aren't making a bad situation worse: DoS attacks on the Tor network are already happening and attackers are already burning energy to carry them out. As we see in the denial-of-service overview, attacks on onion services are in a position to cause downstream resource consumption of nearly every type. Each relay involved experiences increased CPU load from the circuit floods they process. We think that asking legitimate clients to carry out PoW computations doesn't affect the equation too much, since an attacker right now can very quickly use the same resources that hundreds of legitimate clients do in a whole day.

We hope to make things better: The hope is that systems like this will make the DoS actors go away and hence the PoW system will not be used. As long as DoS is happening there will be a waste of energy, but if we manage to demotivate them with technical means, the network as a whole will less wasteful. Also see The DoS Catch-22.

Acknowledgements

Thanks a lot to tevador for the various improvements to the proposal and for helping us understand and tweak the RandomX scheme.

Thanks to Solar Designer for the help in understanding the current PoW landscape, the various approaches we could take, and teaching us a few neat tricks.

Scheduler implementation for C tor

This section describes how we will implement this proposal in the "tor" software (little-t tor).

The following should be read as if tor is an onion service and thus the end point of all inbound data.

The Main Loop

Tor uses libevent for its mainloop. For network I/O operations, a mainloop event is used to inform tor if it can read on a certain socket, or a connection object in tor.

From there, this event will empty the connection input buffer (inbuf) by extracting and processing a request at a time. The mainloop is single threaded and thus each request is handled sequentially.

Processing an INTRODUCE2 message at the onion service means a series of operations (in order):

  1. Unpack relay cell from inbuf to local buffer.
  2. Decrypt cell (AES operations).
  3. Parse relay message and process it depending on its RELAY_COMMAND.
  4. INTRODUCE2 handling which means building a rendezvous circuit:
    • Path selection
    • Launch circuit to first hop.
  5. Return to mainloop event which essentially means back to step (1).

Tor will read at most 32 messages out of the inbuf per mainloop round.

Requirements for PoW

With this proposal, in order to prioritize requests by the amount of PoW work it has done, requests can not be processed sequentially as described above.

Thus, we need a way to queue a certain number of requests, prioritize them and then process some request(s) from the top of the queue (that is, the requests that have done the most PoW effort).

We thus require a new request processing flow that is not compatible with current tor design. The elements are:

  • Validate PoW and place requests in a priority queue of introduction requests (the introduction queue).
  • Defer "bottom half" request processing for after requests have been queued into the priority queue.

Proposed scheduler

The intuitive way to address the queueing requirements above would be to do this simple and naive approach:

  1. Mainloop: Empty inbuf of introduction requests into priority queue
  2. Process all requests in pqueue
  3. Goto (1)

However, we are worried that handling all those requests before returning to the mainloop opens possibilities of attack by an adversary since the priority queue is not gonna be kept up to date while we process all those requests. This means that we might spend lots of time dealing with introductions that don't deserve it.

We thus propose to split the INTRODUCE2 handling into two different steps: "top half" and "bottom half" process.

Top half and bottom half

The top half process is responsible for queuing introductions into the priority queue as follows:

  1. Unpack cell from inbuf to local buffer.
  2. Decrypt cell (AES operations).
  3. Parse INTRODUCE2 message and validate PoW.
  4. Return to mainloop event which essentially means step (1).

The top-half basically does all operations from the main loop section above, excepting (4).

An then, the bottom-half process is responsible for handling introductions and doing rendezvous. To achieve this we introduce a new mainloop event to process the priority queue after the top-half event has completed. This new event would do these operations sequentially:

  1. Pop INTRODUCE2 message from priority queue.
  2. Parse and process INTRODUCE2 message.
  3. End event and yield back to mainloop.

Scheduling the bottom half process

The question now becomes: when should the "bottom half" event get triggered from the mainloop?

We propose that this event is scheduled in when the network I/O event queues at least 1 request into the priority queue. Then, as long as it has a request in the queue, it would re-schedule itself for immediate execution meaning at the next mainloop round, it would execute again.

The idea is to try to empty the queue as fast as it can in order to provide a fast response time to an introduction request but always leave a chance for more requests to appear between request processing by yielding back to the mainloop. With this we are aiming to always have the most up-to-date version of the priority queue when we are completing introductions: this way we are prioritizing clients that spent a lot of time and effort completing their PoW.

If the size of the queue drops to 0, it stops scheduling itself in order to not create a busy loop. The network I/O event will re-schedule it in time.

Notice that the proposed solution will make the service handle 1 single introduction request at every main loop event. However, when we do performance measurements we might learn that it's preferable to bump the number of requests in the future from 1 to N where N <= 32.

Performance measurements

This section will detail the performance measurements we've done on tor.git for handling an INTRODUCE2 message and then a discussion on how much more CPU time we can add (for PoW validation) before it badly degrades our performance.

Tor measurements

In this section we will derive measurement numbers for the "top half" and "bottom half" parts of handling an introduction request.

These measurements have been done on tor.git at commit 80031db32abebaf4d0a91c01db258fcdbd54a471.

We've measured several set of actions of the INTRODUCE2 message handling process on Intel(R) Xeon(R) CPU E5-2650 v4. Our service was accessed by an array of clients that sent introduction requests for a period of 60 seconds.

  1. Full Mainloop Event

    We start by measuring the full time it takes for a mainloop event to process an inbuf containing INTRODUCE2 messages. The mainloop event processed 2.42 messages per invocation on average during our measurements.

      Total measurements: 3279
    
        Min: 0.30 msec - 1st Q.: 5.47 msec - Median: 5.91 msec
        Mean: 13.43 msec - 3rd Q.: 16.20 msec - Max: 257.95 msec
    
  2. INTRODUCE2 message processing (bottom-half)

    We also measured how much time the "bottom half" part of the process takes. That's the heavy part of processing an introduction request as seen in step (4) of the main loop section above:

      Total measurements: 7931
    
        Min: 0.28 msec - 1st Q.: 5.06 msec - Median: 5.33 msec
        Mean: 5.29 msec - 3rd Q.: 5.57 msec - Max: 14.64 msec
    
  3. Connection data read (top half)

    Now that we have the above pieces, we can use them to measure just the "top half" part of the procedure. That's when bytes are taken from the connection inbound buffer and parsed into an INTRODUCE2 message where basic validation is done.

    There is an average of 2.42 INTRODUCE2 messages per mainloop event and so we divide that by the full mainloop event mean time to get the time for one message. From that we subtract the "bottom half" mean time to get how much the "top half" takes:

       => 13.43 / (7931 / 3279) = 5.55
       => 5.55 - 5.29 = 0.26
    
       Mean: 0.26 msec
    

To summarize, during our measurements the average number of INTRODUCE2 messages a mainloop event processed is ~2.42 messages (7931 messages for 3279 mainloop invocations).

This means that, taking the mean of mainloop event times, it takes ~5.55msec (13.43/2.42) to completely process an INTRODUCE2 messages. Then if we look deeper we see that the "top half" of INTRODUCE2 message processing takes 0.26 msec in average, whereas the "bottom half" takes around 5.33 msec.

The heavyness of the "bottom half" is to be expected since that's where 95% of the total work takes place: in particular the rendezvous path selection and circuit launch.

References

    [REF_EQUIX]: https://github.com/tevador/equix
                 https://github.com/tevador/equix/blob/master/devlog.md
    [REF_TABLE]: The table is based on the script below plus some manual editing for readability:
                 https://gist.github.com/asn-d6/99a936b0467b0cef88a677baaf0bbd04
    [REF_BOTNET]: https://media.kasperskycontenthub.com/wp-content/uploads/sites/43/2009/07/01121538/ynam_botnets_0907_en.pdf
    [REF_CREDS]: https://lists.torproject.org/pipermail/tor-dev/2020-March/014198.html
    [REF_TARGET]: https://en.bitcoin.it/wiki/Target
    [REF_TEVADOR_2]: https://lists.torproject.org/pipermail/tor-dev/2020-June/014358.html
    [REF_TEVADOR_SIM]: https://github.com/mikeperry-tor/scratchpad/blob/master/tor-pow/effort_sim.py#L57