Filename: 194-mnemonic-urls.txt
Title: Mnemonic .onion URLs
Author: Sai, Alex Fink
Created: 29-Feb-2012
Status: Superseded

1. Overview

  Currently, canonical Tor .onion URLs consist of a naked 80-bit hash[1]. This
  is not something that users can even recognize for validity, let alone produce
  directly. It is vulnerable to partial-match fuzzing attacks[2], where a
  would-be MITM attacker generates a very similar hash and uses various social
  engineering, wiki poisoning, or other methods to trick the user into visiting
  the spoof site.

  This proposal gives an alternative method for displaying and entering .onion
  and other URLs, such that they will be easily remembered and generated by end
  users, and easily published by hidden service websites, without any dependency
  on a full domain name type system like e.g. namecoin[3]. This makes it easier
  to implement (requiring only a change in the proxy).

  This proposal could equally be used for IPv4, IPv6, etc, if normal DNS is for
  some reason untrusted.

  This is not a petname system[4], in that it does not allow service providers
  or users[5] to associate a name of their choosing to an address[6]. Rather, it
  is a mnemonic system that encodes the 80 bit .onion address into a
  meaningful[7] and memorable sentence. A full petname system (based on
  registration of some kind, and allowing for shorter, service-chosen URLs) can
  be implemented in parallel[8].

  This system has the three properties of being secure, distributed, and
  human-meaningful — it just doesn't also have choice of name (except of course
  by brute force creation of multiple keys to see if one has an encoding the
  operator likes).

  This is inspired by Jonathan Ackerman's "Four Little Words" proposal[9] for
  doing the same thing with IPv4 addresses. We just need to handle 80+ bits, not
  just 32 bits.

  It is similar to Markus Jakobsson & Ruj Akavipat's FastWord system[10], except
  that it does not permit user choice of passphrase, does not know what URL a
  user will enter (vs verifying against a single stored password), and again has
  to encode significantly more data.

  This is also similar to RFC1751[11], RFC2289[12], and multiple other
  fingerprint encoding systems[13] (e.g.  PGPfone[14] using the PGP
  wordlist[15], and Arturo Filatsò's OnionURL[16]), but we aim to make something
  that's as easy as possible for users to remember — and significantly easier
  than just a list of words or pseudowords, which we consider only useful as an
  active confirmation tool, not as something that can be fully memorized and
  recalled, like a normal domain name.

2. Requirements

2.1. encodes at least 80 bits of random data (preferably more, eg for a

2.2. valid, visualizable English sentence — not just a series of words[17]

2.3. words are common enough that non-native speakers and bad spellers will have
minimum difficulty remembering and producing (perhaps with some spellcheck help)

2.4. not syntactically confusable (e.g. order should not matter)

2.5. short enough to be easily memorized and fully recalled at will, not just

2.6. no dependency on an external service

2.7. dictionary size small enough to be reasonable for end users to download as
part of the onion package

2.8. consistent across users (so that websites can e.g. reinforce their random
hash's phrase with a clever drawing)

2.9. not create offensive sentences that service providers will reject

2.10. resistant against semantic fuzzing (e.g. by having uniqueness against
WordNet synsets[18])

3. Possible implementations

  This section is intentionally left unfinished; full listing of template
  sentences and the details of their parser and generating implementation is
  co-dependent on the creation of word class dictionaries fulfilling the above
  criteria. Since that's fairly labor-intensive, we're pausing at this stage for
  input first, to avoid wasting work.

3.1. Have a fixed number of template sentences, such as:

  1. Adj subj adv vtrans adj obj
  2. Subj and subj vtrans adj obj
  3. … etc

  For a 6 word sentence, with 8 (3b) templates, we need ~12b (4k word)
  dictionaries for each word category.

  If multiple words of the same category are used, they must either play
  different grammatical roles (eg subj vs obj, or adj on a different item), be
  chosen from different dictionaries, or there needs to be an order-agnostic way
  to join them at the bit level. Preferably this should be avoided, just to
  prevent users forgetting the order.

3.2. As 3.1, but treat sentence generation as decoding a prefix code, and have
  a Huffman code for each word class.

  We suppose it’s okay if the generated sentence has a few more words than it
  might, as long as they’re common lean words.  E.g., for adjectives, "good"
  might cost only six bits while "unfortunate" costs twelve.

  Choice between different sentence syntaxes could be worked into the prefix
  code as well, and potentially done separately for each syntactic constituent.

4. Usage

  To form mnemonic .onion URL, just join the words with dashes or underscores,
  stripping minimal words like 'a', 'the', 'and' etc., and append '.onion'. This
  can be readily distinguished from standard hash-style .onion URLs by form.

  Translation should take place at the client — though hidden service servers
  should also be able to output the mnemonic form of hashes too, to assist
  website operators in publishing them (e.g. by posting an amusing drawing of
  the described situation on their website to reinforce the mnemonic).

  After the translation stage of name resolution, everything proceeds as normal
  for an 80-bit hash onion URL.

  The user should be notified of the mnemonic form of hash URL in some way, and
  have an easy way in the client UI to translate mnemonics to hashes and vice
  versa. For the purposes of browser URLs and the like though, the mnemonic
  should be treated on par with the hash; if the user enters a mnemonic URL they
  should not become redirected to the hash version. (If anything, the opposite
  may be true, so that users become used to seeing and verifying the mnemonic
  version of hash URLs, and gain the security benefits against partial-match

  Ideally, inputs that don't validly resolve should have a response page served
  by the proxy that uses a simple spell-check system to suggest alternate domain
  names that are valid hash encodings. This could hypothetically be done inline
  in URL input, but would require changes on the browser (normally domain names
  aren't subject so spellcheck), and this avoids that implementation problem.

5. International support

  It is not possible for this scheme to support non-English languages without
  a) (usually) Unicode in domains (which is not yet well supported by browsers),
  b) fully customized dictionaries and phrase patterns per language

  The scheme must not be used in an attempted 'translation' by simply replacing
  English words with glosses in the target language. Several of the necessary
  features would be completely mangled by this (e.g. other languages have
  different synonym, homonym, etc groupings, not to mention completely different

  It is unlikely a priori that URLs constructed using a non-English
  dictionary/pattern setup would in any sense 'translate' semantically to
  English; more likely is that each language would have completely unrelated
  encodings for a given hash.

  We intend to only make an English version at first, to avoid these issues
  during testing.


[6] However, service operators can generate a large number of hidden service
descriptors and check whether their hashes result in a desirable phrasal
encoding (much like certain hidden services currently use brute force generated
hashes to ensure their name is the prefix of their raw hash). This won't get you
whatever phrase you want, but will at least improve the likelihood that it's
something amusing and acceptable.
[7] "Meaningful" here inasmuch as e.g. "Barnaby thoughtfully mangles simplistic
yellow camels" is an absurdist but meaningful sentence. Absurdness is a feature,
not a bug; it decreases the probability of mistakes if the scenario described is
not one that the user would try to fit into a template of things they have
previously encountered IRL. See research into linguistic schema for further