Filename: 203-https-frontend.txt
Title: Avoiding censorship by impersonating an HTTPS server
Author: Nick Mathewson
Created: 24 Jun 2012
Status: Obsolete

Note: Obsoleted-by pluggable transports.


Overview:

   One frequently proposed approach for censorship resistance is that
   Tor bridges ought to act like another TLS-based service, and deliver
   traffic to Tor only if the client can demonstrate some shared
   knowledge with the bridge.

   In this document, I discuss some design considerations for building
   such systems, and propose a few possible architectures and designs.

Background:

   Most of our previous work on censorship resistance has focused on
   preventing passive attackers from identifying Tor bridges, or from
   doing so cheaply.  But active attackers exist, and exist in the wild:
   right now, the most sophisticated censors use their anti-Tor passive
   attacks only as a first round of filtering before launching a
   secondary active attack to confirm suspected Tor nodes.

   One idea we've been talking about for a while is that of having a
   service that looks like an HTTPS service unless a client does some
   particular secret thing to prove it is allowed to use it as a Tor
   bridge.  Such a system would still succumb to passive traffic
   analysis attacks (since the packet timings and sizes for HTTPS don't
   look that much like Tor), but it would be enough to beat many current
   censors.

Goals and requirements:

   We should make it impossible for a passive attacker who examines only
   a few packets at a time to distinguish Tor->Bridge traffic from an
   HTTPS client talking to an HTTPS server.

   We should make it impossible for an active attacker talking to the
   server to tell a Tor bridge server from a regular HTTPS server.

   We should make it impossible for an active attacker who can MITM the
   server to learn from the client whether it thought it was connecting
   to an HTTPS server or a Tor bridge.  (This implies that an MITM
   attacker shouldn't be able to learn anything that would help it
   convince the server to act like a bridge.)

   It would be nice to minimize the required code changes to Tor, and
   the required code changes to any other software.

   It would be good to avoid any requirement of close integration with
   any particular HTTP or HTTPS implementation.

   If we're replacing our own profile with that of an HTTPS service, we
   should do so in a way that lets us use the profile of a popular
   HTTPS implementation.

   Efficiency would be good: layering TLS inside TLS is best avoided if
   we can.

Discussion:

   We need an actual web server; HTTP and HTTPS are so complicated that
   there's no practical way to behave in a bug-compatible way with any
   popular webserver short of running that webserver.

   More obviously, we need a TLS implementation (or we can't implement
   HTTPS), and we need a Tor bridge (since that's the whole point of
   this exercise).

   So from a top-level point of view, the question becomes: how shall we
   wire these together?

   There are three obvious ways; I'll discuss them in turn below.

Design #1: TLS in Tor

   Under this design, Tor accepts HTTPS connections, decides which ones
   don't look like the Tor protocol, and relays them to a webserver.

                   +--------------------------------------+
     +------+  TLS |  +------------+  http +-----------+  |
     | User |<------> | Tor Bridge |<----->| Webserver |  |
     +------+      |  +------------+       +-----------+  |
                   |     trusted host/network             |
                   +--------------------------------------+

   This approach would let us use a completely unmodified webserver
   implementation, but would require the most extensive changes in Tor:
   we'd need to add yet another flavor to Tor's TLS ice cream parlor,
   and try to emulate a popular webserver's TLS behavior even more
   thoroughly.

   To authenticate, we would need to take a hybrid approach, and begin
   forwarding traffic to the webserver as soon as a webserver
   might respond to the traffic.  This could be pretty complicated,
   since it requires us to have a model of how the webserver would
   respond to any given set of bytes.  As a workaround, we might try
   relaying _all_ input to the webserver, and only replying as Tor in
   the cases where the website hasn't replied.  (This would likely
   create recognizable timing patterns, though.)

   The authentication itself could use a system akin to Tor proposals
   189/190, where an early AUTHORIZE cell shows knowledge of a shared
   secret if the client is a Tor client.

Design #2: TLS in the web server

                   +----------------------------------+
     +------+  TLS |  +------------+  tor0   +-----+  |
     | User |<------> | Webserver  |<------->| Tor |  |
     +------+      |  +------------+         +-----+  |
                   |     trusted host/network         |
                   +----------------------------------+

   In this design, we write an Apache module or something that can
   recognize an authenticator of some kind in an HTTPS header, or
   recognize a valid AUTHORIZE cell, and respond by forwarding the
   traffic to a Tor instance.

   To avoid the efficiency issue of doing an extra local
   encrypt/decrypt, we need to have the webserver talk to Tor over a
   local unencrypted connection. (I've denoted this as "tor0" in the
   diagram above.)  For implementation convenience, we might want to
   implement that as a NULL TLS connection, so that the Tor server code
   wouldn't have to change except to allow local NULL TLS connections in
   this configuration.

   For the Tor handshake to work properly here, we'll need a way for the
   Tor instance to know which public key the webserver is configured to
   use.

   We wouldn't need to support the parts of the Tor link protocol used
   to authenticate clients to servers: relays shouldn't be using this
   subsystem at all.

   The Tor client would need to connect and prove its status as a Tor
   client.  If the client uses some means other than AUTHORIZE cells, or
   if we want to do the authentication in a pluggable transport, and we
   therefore decided to offload the responsibility for TLS itself to the
   pluggable transport, that would scare me: Supporting pluggable
   transports that have the responsibility for TLS would make it fairly
   easy to mess up the crypto, and I'd rather not have it be so easy to
   write a pluggable transport that accidentally makes Tor less secure.

Design #3: Reverse proxy


                   +----------------------------------+
                   |  +-------+  http  +-----------+  |
                   |  |       |<------>| Webserver |  |
     +------+  TLS |  |       |        +-----------+  |
     | User |<------> | Proxy |                       |
     +------+      |  |       |  tor0  +-----------+  |
                   |  |       |<------>|    Tor    |  |
                   |  +-------+        +-----------+  |
                   |     trusted host/network         |
                   +----------------------------------+

   In this design, we write a server-side proxy to sit in front of Tor
   and a webserver, or repurpose some existing HTTPS proxy. Its role
   will be to do TLS, and then forward connections to Tor or the
   webserver as appropriate.  (In the web world, this kind of thing is
   called a "reverse proxy", so that's the term I'm using here.)

   To avoid fingerprinting, we should choose a proxy that's already in
   common use as a TLS front-end for webservers -- nginx, perhaps.
   Unfortunately, the more popular tools here seem to be pretty complex,
   and the simpler tools less widely deployed.  More investigation would
   be needed.

   The authorization considerations would be as in Design #2 above; for
   the reasons discussed there, it's probably a good idea to build the
   necessary authorization into Tor itself.

   I generally like this design best: it lets us isolate the "Check for
   a valid authenticator and/or a valid or invalid HTTP header, and
   react accordingly" question to a single program.

How to authenticate: The easiest way

   Designing a good MITM-resistant AUTHORIZE cell, or an equivalent
   HTTP header, is an open problem that we should solve in proposals
   190 and 191 and their successors.  I'm calling it out-of-scope here;
   please see those proposals, their attendant discussion, and their
   eventual successors.

How to authenticate: a slightly harder way

   Some proposals in this vein have in the past suggested a special
   HTTP header to distinguish Tor connections from non-Tor connections.
   This could work too, though it would require substantially larger
   changes on the Tor client's part, would still require the client
   take measures to avoid MITM attacks, and would also require the
   client to implement a particular browser's http profile.

Some considerations on distinguishability

   Against a passive eavesdropper, the easiest way to avoid
   distinguishability in server responses will be to use an actual web
   server or reverse web proxy's TLS implementation.
   (Distinguishability based on client TLS use is another topic
   entirely.)

   Against an active non-MITM attacker, the best probing attacks will be
   ones designed to provoke the system into acting in ways different from
   those in which a webserver would act: responding earlier than a web
   server would respond, or later, or differently.  We need to make sure
   that, whatever the front-end program is, it answers anything that
   would qualify as a well-formed or ill-formed HTTP request whenever
   the web server would.  This must mean, for example, that whatever the
   correct form of client authorization turns out to be, no prefix of
   that authorization is ever something that the webserver would respond
   to.  With some web servers (I believe), that's as easy as making sure
   that any valid authenticator isn't too long, and doesn't contain a CR
   or LF character.  With others, the authenticator would need to be a
   valid HTTP request, with all the attendant difficulty that would
   raise.

   Against an attacker who can MITM the bridge, the best attacks will be
   to wait for clients to connect and see how they behave.  In this
   case, the client probably needs to be able to authenticate the bridge
   certificate as presented in the initial TLS handshake -- or some
   other aspect of the TLS handshake if we're feeling insane.  If the
   certificate or handshake isn't as expected, the client should behave
   as a web browser that's just received a bad TLS certificate.  (The
   alternative there would be to try to impersonate an HTTPS client that
   has just accepted a self-signed certificate.  But that would probably
   require the Tor client to impersonate a full web browser, which isn't
   realistic.)

Side note: What to put on the webserver?

   To credibly pretend not to be ourselves, we must pretend to be
   something else in particular -- and something not easily identifiable
   or inherently worthless.  We should not, for example, have all
   deployments of this kind use a fixed website, even if that website is
   the default "Welcome to Apache" configuration: A censor would
   probably feel that they weren't breaking anything important by
   blocking all unconfigured websites with nothing on them.

   Therefore, we should probably conceive of a system like this as
   "Something to add to your HTTPS website" rather than as a standalone
   installation.

Related work:

   meek [1] is a pluggable transport that uses HTTP for carrying bytes
   and TLS for obfuscation. Traffic is relayed through a third-party
   server (Google App Engine). It uses a trick to talk to the third
   party so that it looks like it is talking to an unblocked server.

   meek itself is not really about HTTP at all. It uses HTTP only
   because it's convenient and the big Internet services we use as cover
   also use HTTP. meek uses HTTP as a transport, and TLS for
   obfuscation, but the key idea is really "domain fronting," where it
   appears to the censor you are talking to one domain (www.google.com),
   but behind the scenes you are talking to another
   (meek-reflect.appspot.com). The meek-server program is an ordinary
   HTTP (not necessarily even HTTPS!) server, whose communication is
   easily fingerprintable; but that doesn't matter because the censor
   never sees that part of the communication, only the communication
   between the client and CDN.

   One way to think about the difference: if a censor (somehow) learns
   the IP address of a bridge as described in this proposal, it's easy
   and low-cost for the censor to block that bridge by IP address. meek
   aims to make it much more expensive: even if you know a domain is
   being used (in part) for circumvention, in order to block it have to
   block something important like the Google frontend or CloudFlare
   (high collateral damage).

1. https://trac.torproject.org/projects/tor/wiki/doc/meek