Warning: Take all the PoW performance numbers on this page with a large grain of salt. Most of this is based on very early analysis that has not been updated for the current state of implementation.
To design a protocol and choose its parameters, we first need to understand a few high-level attacker strategies to see what we are fighting against.
A basic attack here is the adversary spamming with bogus INTRO cells so that the service does not have computing capacity to even verify the proof-of-work. This adversary tries to overwhelm the procedure in the
v1 verification algorithm section.
That's why we need the PoW algorithm to have a cheap verification time so that this attack is not possible: we explore this PoW parameter below in the section on PoW verification.
Given the way the introduction queue works, a very effective strategy for the attacker is to totally overwhelm the queue processing by sending more high-effort introductions than the onion service can handle at any given tick. This adversary tries to overwhelm the process of handling queued introductions.
To do so, the attacker would have to send at least 20 high-effort introduction cells every 100ms, where high-effort is a PoW which is above the estimated level of "the motivated user".
An easier attack for the adversary, is the same strategy but with introduction cells that are all above the comfortable level of "the standard user". This would block out all standard users and only allow motivated users to pass.
If both the top- and bottom- halves are processed by the same thread, this opens up the possibility for a "hybrid" attack. Given the performance figures for the bottom half (0.31 ms/req.) and the top half (5.5 ms/req.), the attacker can optimally deny service by submitting 91 high-effort requests and 1520 invalid requests per second. This will completely saturate the main loop because:
0.31*(1520+91) ~ 0.5 sec. 5.5*91 ~ 0.5 sec.
Alternatively, the attacker can adjust the ratio between invalid and high-effort requests depending on their bandwidth and compute capabilities.
Another way to beat this system is for the attacker to game the effort control logic. Essentially, there are two attacks that we are trying to avoid:
- Attacker sets descriptor suggested-effort to a very high value effectively making it impossible for most clients to produce a PoW token in a reasonable timeframe.
- Attacker sets descriptor suggested-effort to a very small value so that most clients aim for a small value while the attacker comfortably launches an bottom-half attack using medium effort PoW (see this post by tevador on tor-dev from May 2020).
The attacker may precompute many valid PoW nonces and submit them all at once before the current seed expires, overwhelming the service temporarily even using a single computer. The current scheme gives the attackers 4 hours to launch this attack since each seed lasts 2 hours and the service caches two seeds.
An attacker with this attack might be aiming to DoS the service for a limited amount of time, or to cause an effort control attack.
There are various parameters in this PoW system that need to be tuned:
We first start by tuning the time it takes to verify a PoW token. We do this first because it's fundamental to the performance of onion services and can turn into a DoS vector of its own. We will do this tuning in a way that's agnostic to the chosen PoW function.
We previously considered the concept of a nonzero starting difficulty setting. This analysis still references such a concept, even though the currently recommended implementation uses a starting effort of zero. (We now expect early increases in effort during an attack to be driven primarily by client retry behavior.)
At the end of this section we will estimate the resources that an attacker needs to overwhelm the onion service, the resources that the service needs to verify introduction requests, and the resources that legitimate clients need to get to the onion service.
Verifying a PoW token is the first thing that a service does when it receives an INTRODUCE2 cell. Our current implementation is described by the
v1 verification algorithm specification.
Verification time is a critical performance parameter. Actual times can be measured by
cargo bench now, and the verification speeds we achieve are more like 50-120 microseconds. The specific numbers below are dated, but the analysys below is preserved as an illustration of the design space we are optimizing within.
To defend against a top-half attack it's important that we can quickly perform all the steps in-between receiving an introduction request over the network and adding it to our effort-prioritized queue.
All time spent verifying PoW adds overhead to the already existing "top half" part of handling an introduction cell. Hence we should be careful to add minimal overhead here.
During our performance measurements on tor we learned that the "top half" takes about 0.26 msecs in average, without doing any sort of PoW verification. Using that value we compute the following table, that describes the number of cells we can queue per second (aka times we can perform the "top half" process) for different values of PoW verification time:
|PoW Verification Time||Total "top half" time||Cells Queued per second|
|0 msec||0.26 msec||3846|
|1 msec||1.26 msec||793|
|2 msec||2.26 msec||442|
|3 msec||3.26 msec||306|
|4 msec||4.26 msec||234|
|5 msec||5.26 msec||190|
|6 msec||6.26 msec||159|
|7 msec||7.26 msec||137|
|8 msec||8.26 msec||121|
|9 msec||9.26 msec||107|
|10 msec||10.26 msec||97|
Here is how you can read the table above:
- For a PoW function with a 1ms verification time, an attacker needs to send 793 dummy introduction cells per second to succeed in a top-half attack.
- For a PoW function with a 2ms verification time, an attacker needs to send 442 dummy introduction cells per second to succeed in a top-half attack.
- For a PoW function with a 10ms verification time, an attacker needs to send 97 dummy introduction cells per second to succeed in a top-half attack.
Whether an attacker can succeed at that depends on the attacker's resources, but also on the network's capacity.
Our purpose here is to have the smallest PoW verification overhead possible that also allows us to achieve all our other goals.
Note that the table above is simply the result of a naive multiplication and does not take into account all the auxiliary overheads that happen every second like the time to invoke the mainloop, the bottom-half processes, or pretty much anything other than the "top-half" processing.
During our measurements the time to handle INTRODUCE2 cells dominates any other action time: There might be events that require a long processing time, but these are pretty infrequent (like uploading a new HS descriptor) and hence over a long time they smooth out. Hence extrapolating the total cells queued per second based on a single "top half" time seems like good enough to get some initial intuition. That said, the values of "Cells queued per second" from the table above, are likely much smaller than displayed above because of all the auxiliary overheads.
The difficulty setting of our PoW basically dictates how difficult it should be to get a success in our PoW system. An attacker who can get many successes per second can pull a successful bottom-half attack against our system.
In classic PoW systems, "success" is defined as getting a hash output below the "target". However, since our system is dynamic, we define "success" as an abstract high-effort computation.
The original analysis here concluded that we still need a starting difficulty setting that will be used for bootstrapping the system. The client and attacker can still aim higher or lower but for UX purposes and for analysis purposes it was useful to define a starting difficulty, to minimize retries by clients.
In current use it was found that an effort of 1 makes a fine minimum, so we don't normally have a concept of minimum effort. Consider the actual "minimum effort" in
v1 now to simply be the expected runtime of one single Equi-X solve.
In this section we will try to do an analysis of PoW difficulty without using any sort of Tor-related or PoW-related benchmark numbers.
We created the table (see
[REF_TABLE]) below which shows how much time a legitimate client with a single machine should expect to burn before they get a single success.
The x-axis is how many successes we want the attacker to be able to do per second: the more successes we allow the adversary, the more they can overwhelm our introduction queue. The y-axis is how many machines the adversary has in her disposal, ranging from just 5 to 1000.
=============================================================== | Expected Time (in seconds) Per Success For One Machine | =========================================================================== | | | Attacker Succeses 1 5 10 20 30 50 | | per second | | | | 5 5 1 0 0 0 0 | | 50 50 10 5 2 1 1 | | 100 100 20 10 5 3 2 | | Attacker 200 200 40 20 10 6 4 | | Boxes 300 300 60 30 15 10 6 | | 400 400 80 40 20 13 8 | | 500 500 100 50 25 16 10 | | 1000 1000 200 100 50 33 20 | | | ============================================================================
Here is how you can read the table above:
- If an adversary has a botnet with 1000 boxes, and we want to limit her to 1 success per second, then a legitimate client with a single box should be expected to spend 1000 seconds getting a single success.
- If an adversary has a botnet with 1000 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 200 seconds getting a single success.
- If an adversary has a botnet with 500 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 100 seconds getting a single success.
- If an adversary has access to 50 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 10 seconds getting a single success.
- If an adversary has access to 5 boxes, and we want to limit her to 5 successes per second, then a legitimate client with a single box should be expected to spend 1 seconds getting a single success.
With the above table we can create some profiles for starting values of our PoW difficulty.
To go deeper here, we can use the performance measurements on tor to get a more specific intuition on the starting difficulty. In particular, we learned that completely handling an introduction cell takes 5.55 msecs in average. Using that value, we can compute the following table, that describes the number of introduction cells we can handle per second for different values of PoW verification:
|PoW Verification Time||Total time to handle introduction cell||Cells handled per second|
|0 msec||5.55 msec||180.18|
|1 msec||6.55 msec||152.67|
|2 msec||7.55 msec||132.45|
|3 msec||8.55 msec||116.96|
|4 msec||9.55 mesc||104.71|
|5 msec||10.55 msec||94.79|
|6 msec||11.55 msec||86.58|
|7 msec||12.55 msec||79.68|
|8 msec||13.55 msec||73.80|
|9 msec||14.55 msec||68.73|
|10 msec||15.55 msec||64.31|
Here is how you can read the table above:
- For a PoW function with a 1ms verification time, an attacker needs to send 152 high-effort introduction cells per second to succeed in a bottom-half attack attack.
- For a PoW function with a 10ms verification time, an attacker needs to send 64 high-effort introduction cells per second to succeed in a bottom-half attack attack.
We can use this table to specify a starting difficulty that won't allow our target adversary to succeed in an bottom-half attack attack.
Note that in practice verification times are much lower; the scale of the above table does not match the current implementation's reality.
This proposal has user facing UX consequences.
When the client first attempts a pow, it can note how long iterations of the hash function take, and then use this to determine an estimation of the duration of the PoW. This estimation could be communicated via the control port or other mechanism, such that the browser could display how long the PoW is expected to take on their device. If the device is a mobile platform, and this time estimation is large, it could recommend that the user try from a desktop machine.
There are various improvements that can be done in this proposal, and while we are trying to keep this
v1 version simple, we need to keep the design extensible so that we build more features into it. In particular:
End-to-end introduction ACKs
This proposal suffers from various UX issues because there is no end-to-end mechanism for an onion service to inform the client about its introduction request. If we had end-to-end introduction ACKs many of the problems seen in client-side effort estimation would be alleviated. The problem here is that end-to-end ACKs require modifications on the introduction point code and a network update which is a lengthy process.
Our scheduler is pretty limited by the fact that Tor has a single-threaded design. If we improve our multithreading support we could handle a much greater amount of introduction requests per second.
This is just the beginning in DoS defences for Tor and there are various future designs and schemes that we can investigate. Here is a brief summary of these:
"More advanced PoW schemes" -- We could use more advanced memory-hard PoW schemes like MTP-argon2 or Itsuku to make it even harder for adversaries to create successful PoWs. Unfortunately these schemes have much bigger proof sizes, and they won't fit in INTRODUCE1 cells. See #31223 for more details.
"Third-party anonymous credentials" -- We can use anonymous credentials and a third-party token issuance server on the clearnet to issue tokens based on PoW or CAPTCHA and then use those tokens to get access to the service. See
[REF_CREDS]for more details.
"PoW + Anonymous Credentials" -- We can make a hybrid of the above ideas where we present a hard puzzle to the user when connecting to the onion service, and if they solve it we then give the user a bunch of anonymous tokens that can be used in the future. This can all happen between the client and the service without a need for a third party.
All of the above approaches are much more complicated than the
v1 design, and hence we want to start easy before we get into more serious projects.
The current implementation requires complexity within the Equi-X implementation but its impact on the overall tor network can be relatively simple.
This algorithm shares a broad concept, proof of work, with some notoriously power hungry and wasteful software. We love the environment, and we too are concerned with how proof of work schemes typically waste huge amounts of energy by doing useless hash iterations.
Nevertheless, there are some massive differences in both the scale and the dynamics of what we are doing here: we are performing fairly small amounts of computation, and it's used as part of a scheme to disincentivize attacks entirely. If we do our job well, people stop computing these proof-of-work functions entirely and find something else to attack.
We think we aren't making a bad situation worse: DoS attacks on the Tor network are already happening and attackers are already burning energy to carry them out. As we see in the denial-of-service overview, attacks on onion services are in a position to cause downstream resource consumption of nearly every type. Each relay involved experiences increased CPU load from the circuit floods they process. We think that asking legitimate clients to carry out PoW computations doesn't affect the equation too much, since an attacker right now can very quickly use the same resources that hundreds of legitimate clients do in a whole day.
We hope to make things better: The hope is that systems like this will make the DoS actors go away and hence the PoW system will not be used. As long as DoS is happening there will be a waste of energy, but if we manage to demotivate them with technical means, the network as a whole will less wasteful. Also see The DoS Catch-22.
Thanks a lot to tevador for the various improvements to the proposal and for helping us understand and tweak the RandomX scheme.
Thanks to Solar Designer for the help in understanding the current PoW landscape, the various approaches we could take, and teaching us a few neat tricks.
This section describes how we will implement this proposal in the "tor" software (little-t tor).
The following should be read as if tor is an onion service and thus the end point of all inbound data.
Tor uses libevent for its mainloop. For network I/O operations, a mainloop event is used to inform tor if it can read on a certain socket, or a connection object in tor.
From there, this event will empty the connection input buffer (inbuf) by extracting and processing a cell at a time. The mainloop is single threaded and thus each cell is handled sequentially.
Processing an INTRODUCE2 cell at the onion service means a series of operations (in order):
- Unpack cell from inbuf to local buffer.
- Decrypt cell (AES operations).
- Parse cell header and process it depending on its RELAY_COMMAND.
- INTRODUCE2 cell handling which means building a rendezvous circuit:
- Path selection
- Launch circuit to first hop.
- Return to mainloop event which essentially means back to step (1).
Tor will read at most 32 cells out of the inbuf per mainloop round.
With this proposal, in order to prioritize cells by the amount of PoW work it has done, cells can not be processed sequentially as described above.
Thus, we need a way to queue a certain number of cells, prioritize them and then process some cell(s) from the top of the queue (that is, the cells that have done the most PoW effort).
We thus require a new cell processing flow that is not compatible with current tor design. The elements are:
- Validate PoW and place cells in a priority queue of INTRODUCE2 cells (the introduction queue).
- Defer "bottom half" INTRO2 cell processing for after cells have been queued into the priority queue.
The intuitive way to address the queueing requirements above would be to do this simple and naive approach:
- Mainloop: Empty inbuf INTRODUCE2 cells into priority queue
- Process all cells in pqueue
- Goto (1)
However, we are worried that handling all those cells before returning to the mainloop opens possibilities of attack by an adversary since the priority queue is not gonna be kept up to date while we process all those cells. This means that we might spend lots of time dealing with introductions that don't deserve it.
We thus propose to split the INTRODUCE2 handling into two different steps: "top half" and "bottom half" process.
The top half process is responsible for queuing introductions into the priority queue as follows:
- Unpack cell from inbuf to local buffer.
- Decrypt cell (AES operations).
- Parse INTRODUCE2 cell header and validate PoW.
- Return to mainloop event which essentially means step (1).
The top-half basically does all operations from the main loop section above, excepting (4).
An then, the bottom-half process is responsible for handling introductions and doing rendezvous. To achieve this we introduce a new mainloop event to process the priority queue after the top-half event has completed. This new event would do these operations sequentially:
- Pop INTRODUCE2 cell from priority queue.
- Parse and process INTRODUCE2 cell.
- End event and yield back to mainloop.
The question now becomes: when should the "bottom half" event get triggered from the mainloop?
We propose that this event is scheduled in when the network I/O event queues at least 1 cell into the priority queue. Then, as long as it has a cell in the queue, it would re-schedule itself for immediate execution meaning at the next mainloop round, it would execute again.
The idea is to try to empty the queue as fast as it can in order to provide a fast response time to an introduction request but always leave a chance for more cells to appear between cell processing by yielding back to the mainloop. With this we are aiming to always have the most up-to-date version of the priority queue when we are completing introductions: this way we are prioritizing clients that spent a lot of time and effort completing their PoW.
If the size of the queue drops to 0, it stops scheduling itself in order to not create a busy loop. The network I/O event will re-schedule it in time.
Notice that the proposed solution will make the service handle 1 single introduction request at every main loop event. However, when we do performance measurements we might learn that it's preferable to bump the number of cells in the future from 1 to N where N <= 32.
This section will detail the performance measurements we've done on
tor.git for handling an INTRODUCE2 cell and then a discussion on how much more CPU time we can add (for PoW validation) before it badly degrades our performance.
In this section we will derive measurement numbers for the "top half" and "bottom half" parts of handling an introduction cell.
These measurements have been done on tor.git at commit
We've measured several set of actions of the INTRODUCE2 cell handling process on Intel(R) Xeon(R) CPU E5-2650 v4. Our service was accessed by an array of clients that sent introduction requests for a period of 60 seconds.
Full Mainloop Event
We start by measuring the full time it takes for a mainloop event to process an inbuf containing INTRODUCE2 cells. The mainloop event processed 2.42 cells per invocation on average during our measurements.
Total measurements: 3279 Min: 0.30 msec - 1st Q.: 5.47 msec - Median: 5.91 msec Mean: 13.43 msec - 3rd Q.: 16.20 msec - Max: 257.95 msec
INTRODUCE2 cell processing (bottom-half)
We also measured how much time the "bottom half" part of the process takes. That's the heavy part of processing an introduction request as seen in step (4) of the main loop section above:
Total measurements: 7931 Min: 0.28 msec - 1st Q.: 5.06 msec - Median: 5.33 msec Mean: 5.29 msec - 3rd Q.: 5.57 msec - Max: 14.64 msec
Connection data read (top half)
Now that we have the above pieces, we can use them to measure just the "top half" part of the procedure. That's when bytes are taken from the connection inbound buffer and parsed into an INTRODUCE2 cell where basic validation is done.
There is an average of 2.42 INTRODUCE2 cells per mainloop event and so we divide that by the full mainloop event mean time to get the time for one cell. From that we subtract the "bottom half" mean time to get how much the "top half" takes:
=> 13.43 / (7931 / 3279) = 5.55 => 5.55 - 5.29 = 0.26 Mean: 0.26 msec
To summarize, during our measurements the average number of INTRODUCE2 cells a mainloop event processed is ~2.42 cells (7931 cells for 3279 mainloop invocations).
This means that, taking the mean of mainloop event times, it takes ~5.55msec (13.43/2.42) to completely process an INTRODUCE2 cell. Then if we look deeper we see that the "top half" of INTRODUCE2 cell processing takes 0.26 msec in average, whereas the "bottom half" takes around 5.33 msec.
The heavyness of the "bottom half" is to be expected since that's where 95% of the total work takes place: in particular the rendezvous path selection and circuit launch.
[REF_EQUIX]: https://github.com/tevador/equix https://github.com/tevador/equix/blob/master/devlog.md [REF_TABLE]: The table is based on the script below plus some manual editing for readability: https://gist.github.com/asn-d6/99a936b0467b0cef88a677baaf0bbd04 [REF_BOTNET]: https://media.kasperskycontenthub.com/wp-content/uploads/sites/43/2009/07/01121538/ynam_botnets_0907_en.pdf [REF_CREDS]: https://lists.torproject.org/pipermail/tor-dev/2020-March/014198.html [REF_TARGET]: https://en.bitcoin.it/wiki/Target [REF_TEVADOR_2]: https://lists.torproject.org/pipermail/tor-dev/2020-June/014358.html [REF_TEVADOR_SIM]: https://github.com/mikeperry-tor/scratchpad/blob/master/tor-pow/effort_sim.py#L57