Skip to content


Improving the security of your SSH private key files

Ever wondered how those key files in ~/.ssh actually work? How secure are they actually?

As you probably do too, I use ssh many times every single day — every git fetch and git push, every deploy, every login to a server. And recently I realised that to me, ssh was just some crypto voodoo that I had become accustomed to using, but I didn’t really understand. That’s a shame — I like to know how stuff works. So I went on a little journey of discovery, and here are some of the things I found.

When you start reading about “crypto stuff”, you very quickly get buried in an avalanche of acronyms. I will briefly mention the acronyms as we go along; they don’t help you understand the concepts, but they are useful in case you want to Google for further details.

Quick recap: If you’ve ever used public key authentication, you probably have a file ~/.ssh/id_rsa or ~/.ssh/id_dsa in your home directory. This is your RSA/DSA private key, and ~/.ssh/id_rsa.pub or ~/.ssh/id_dsa.pub is its public key counterpart. Any machine you want to log in to needs to have your public key in ~/.ssh/authorized_keys on that machine. When you try to log in, your SSH client uses a digital signature to prove that you have the private key; the server checks that the signature is valid, and that the public key is authorized for your username; if all is well, you are granted access.

So what is actually inside this private key file?

The unencrypted private key format

Everyone recommends that you protect your private key with a passphrase (otherwise anybody who steals the file from you can log into everything you have access to). If you leave the passphrase blank, the key is not encrypted. Let’s look at this unencrypted format first, and consider passphrase protection later.

A ssh private key file typically looks something like this:

-----BEGIN RSA PRIVATE KEY-----
MIIEogIBAAKCAQEArCQG213utzqE5YVjTVF5exGRCkE9OuM7LCp/FOuPdoHrFUXk
y2MQcwf29J3A4i8zxpES9RdSEU6iIEsow98wIi0x1/Lnfx6jG5Y0/iQsG1NRlNCC
aydGvGaC+PwwWiwYRc7PtBgV4KOAVXMZdMB5nFRaekQ1ksdH/360KCGgljPtzTNl
09e97QBwHFIZ3ea5Eih/HireTrRSnvF+ywmwuxX4ubDr0ZeSceuF2S5WLXH2+TV0
   ... etc ... lots of base64 blah blah ...
-----END RSA PRIVATE KEY-----

The private key is an ASN.1 data structure, serialized to a byte string using DER, and then Base64-encoded. ASN.1 is roughly comparable to JSON (it supports various data types such as integers, booleans, strings and lists/sequences that can be nested in a tree structure). It’s very widely used for cryptographic purposes, but it has somehow fallen out of fashion with the web generation (I don’t know why, it seems like a pretty decent format).

To look inside, let’s generate a fake RSA key without passphrase using ssh-keygen, and then decode it using asn1parse:

$ ssh-keygen -t rsa -N '' -f test_rsa_key
$ openssl asn1parse -in test_rsa_key
    0:d=0  hl=4 l=1189 cons: SEQUENCE
    4:d=1  hl=2 l=   1 prim: INTEGER           :00
    7:d=1  hl=4 l= 257 prim: INTEGER           :C36EB2429D429C7768AD9D879F98C...
  268:d=1  hl=2 l=   3 prim: INTEGER           :010001
  273:d=1  hl=4 l= 257 prim: INTEGER           :A27759F60AEA1F4D1D56878901E27...
  534:d=1  hl=3 l= 129 prim: INTEGER           :F9D23EF31A387694F03AD0D050265...
  666:d=1  hl=3 l= 129 prim: INTEGER           :C84415C26A468934F1037F99B6D14...
  798:d=1  hl=3 l= 129 prim: INTEGER           :D0ACED4635B5CA5FB896F88BB9177...
  930:d=1  hl=3 l= 128 prim: INTEGER           :511810DF9AFD590E11126397310A6...
 1061:d=1  hl=3 l= 129 prim: INTEGER           :E3A296AE14E7CAF32F7E493FDF474...

Alternatively, you can paste the Base64 string into Lapo Luchini’s excellent JavaScript ASN.1 decoder. You can see that ASN.1 structure is quite simple: a sequence of nine integers. Their meaning is defined in RFC2313. The first integer is a version number (0), and the third number is quite small (65537) – the public exponent e. The two important numbers are the 2048-bit integers that appear second and fourth in the sequence: the RSA modulus n, and the private exponent d. These numbers are used directly in the RSA algorithm. The remaining five numbers can be derived from n and d, and are only cached in the key file to speed up certain operations.

DSA keys are similar, a sequence of six integers:

$ ssh-keygen -t dsa -N '' -f test_dsa_key
$ openssl asn1parse -in test_dsa_key
    0:d=0  hl=4 l= 444 cons: SEQUENCE
    4:d=1  hl=2 l=   1 prim: INTEGER           :00
    7:d=1  hl=3 l= 129 prim: INTEGER           :E497DFBFB5610906D18BCFB4C3CCD...
  139:d=1  hl=2 l=  21 prim: INTEGER           :CF2478A96A941FB440C38A86F22CF...
  162:d=1  hl=3 l= 129 prim: INTEGER           :83218C0CA49BA8F11BE40EE1A7C72...
  294:d=1  hl=3 l= 128 prim: INTEGER           :16953EA4012988E914B466B9C37CB...
  425:d=1  hl=2 l=  21 prim: INTEGER           :89A356E922688EDEB1D388258C825...

Passphrase-protected keys

Next, in order to make life harder for an attacker who manages to steal your private key file, you protect it with a passphrase. How does this actually work?

$ ssh-keygen -t rsa -N 'super secret passphrase' -f test_rsa_key
$ cat test_rsa_key
-----BEGIN RSA PRIVATE KEY-----
Proc-Type: 4,ENCRYPTED
DEK-Info: AES-128-CBC,D54228DB5838E32589695E83A22595C7

3+Mz0A4wqbMuyzrvBIHx1HNc2ZUZU2cPPRagDc3M+rv+XnGJ6PpThbOeMawz4Cbu
lQX/Ahbx+UadJZOFrTx8aEWyZoI0ltBh9O5+ODov+vc25Hia3jtayE51McVWwSXg
wYeg2L6U7iZBk78yg+sIKFVijxiWnpA7W2dj2B9QV0X3ILQPxbU/cRAVTd7AVrKT
    ... etc ...
-----END RSA PRIVATE KEY-----

We’ve gained two header lines, and if you try to parse that Base64 text, you’ll find it’s no longer valid ASN.1. That’s because the entire ASN.1 structure we saw above has been encrypted, and the Base64-encoded text is the output of the encryption. The header tells us the encryption algorithm that was used: AES-128 in CBC mode. The 128-bit hex string in the DEK-Info header is the initialization vector (IV) for the cipher. This is pretty standard stuff; all common crypto libraries can handle it.

But how do you get from the passphrase to the AES encryption key? I couldn’t find it documented anywhere, so I had to dig through the OpenSSL source to find it:

  1. Append the first 8 bytes of the IV to the passphrase, without a separator (serves as a salt).
  2. Take the MD5 hash of the resulting string (once).

That’s it. To prove it, let’s decrypt the private key manually (using the IV/salt from the DEK-Info header above):

$ tail -n +4 test_rsa_key | grep -v 'END ' | base64 -D |    # get just the binary blob
  openssl aes-128-cbc -d -iv D54228DB5838E32589695E83A22595C7 -K $(
    ruby -rdigest/md5 -e 'puts Digest::MD5.hexdigest(["super secret passphrase",0xD5,0x42,0x28,0xDB,0x58,0x38,0xE3,0x25].pack("a*cccccccc"))'
  ) |
  openssl asn1parse -inform DER

…which prints out the sequence of integers from the RSA key in the clear. Of course, if you want to inspect the key, it’s much easier to do this:

$ openssl rsa -text -in test_rsa_key -passin 'pass:super secret passphrase'

but I wanted to demonstrate exactly how the AES key is derived from the password. This is important because the private key protection has two weaknesses:

  • The digest algorithm is hard-coded to be MD5, which means that without changing the format, it’s not possible to upgrade to another hash function (e.g. SHA-1). This could be a problem if MD5 turns out not to be good enough.
  • The hash function is only applied once — there is no stretching. This is a problem because MD5 and AES are both fast to compute, and thus a short passphrase is quite easy to break with brute force.

If your private SSH key ever gets into the wrong hands, e.g. because someone steals your laptop or your backup hard drive, the attacker can try a huge number of possible passphrases, even with moderate computing resources. If your passphrase is a dictionary word, it can probably be broken in a matter of seconds.

That was the bad news: the passphrase on your SSH key isn’t as useful as you thought it was. But there is good news: you can upgrade to a more secure private key format, and everything continues to work!

Better key protection with PKCS#8

What we want is to derive a symmetric encryption key from the passphrase, and we want this derivation to be slow to compute, so that an attacker needs to buy more computing time if they want to brute-force the passphrase. If you’ve seen the use bcrypt meme, this should sound very familiar.

For SSH private keys, there are a few standards with clumsy names (acronym alert!) that can help us out:

  • PKCS #5 (RFC 2898) defines PBKDF2 (Password-Based Key Derivation Function 2), an algorithm for deriving an encryption key from a password by applying a hash function repeatedly. PBES2 (Password-Based Encryption Scheme 2) is also defined here; it simply means using a PBKDF2-generated key with a symmetric cipher.
  • PKCS #8 (RFC 5208) defines a format for storing encrypted private keys that supports PBKDF2. OpenSSL transparently supports private keys in PKCS#8 format, and OpenSSH uses OpenSSL, so if you’re using OpenSSH that means you can swap your traditional SSH key files for PKCS#8 files and everything continues to work as normal!

I don’t know why ssh-keygen still generates keys in SSH’s traditional format, even though a better format has been available for years. Compatibility with servers is not a concern, because the private key never leaves your machine. Fortunately it’s easy enough to convert to PKCS#8:

$ mv test_rsa_key test_rsa_key.old
$ openssl pkcs8 -topk8 -v2 des3 \
    -in test_rsa_key.old -passin 'pass:super secret passphrase' \
    -out test_rsa_key -passout 'pass:super secret passphrase'

If you try using this new PKCS#8 file with a SSH client, you should find that it works exactly the same as the file generated by ssh-keygen. But what’s inside it?

$ cat test_rsa_key
-----BEGIN ENCRYPTED PRIVATE KEY-----
MIIFDjBABgkqhkiG9w0BBQ0wMzAbBgkqhkiG9w0BBQwwDgQIOu/S2/v547MCAggA
MBQGCCqGSIb3DQMHBAh4q+o4ELaHnwSCBMjA+ho9K816gN1h9MAof4stq0akPoO0
CNvXdtqLudIxBq0dNxX0AxvEW6exWxz45bUdLOjQ5miO6Bko0lFoNUrOeOo/Gq4H
dMyI7Ot1vL9UvZRqLNj51cj/7B/bmfa4msfJXeuFs8jMtDz9J19k6uuCLUGlJscP
    ... etc ...
-----END ENCRYPTED PRIVATE KEY-----

Notice that the header/footer lines have changed (BEGIN ENCRYPTED PRIVATE KEY instead of BEGIN RSA PRIVATE KEY), and the plaintext Proc-Type and DEK-Info headers have gone. In fact, the whole key file is once again a ASN.1 structure:

$ openssl asn1parse -in test_rsa_key
    0:d=0  hl=4 l=1294 cons: SEQUENCE
    4:d=1  hl=2 l=  64 cons: SEQUENCE
    6:d=2  hl=2 l=   9 prim: OBJECT            :PBES2
   17:d=2  hl=2 l=  51 cons: SEQUENCE
   19:d=3  hl=2 l=  27 cons: SEQUENCE
   21:d=4  hl=2 l=   9 prim: OBJECT            :PBKDF2
   32:d=4  hl=2 l=  14 cons: SEQUENCE
   34:d=5  hl=2 l=   8 prim: OCTET STRING      [HEX DUMP]:3AEFD2DBFBF9E3B3
   44:d=5  hl=2 l=   2 prim: INTEGER           :0800
   48:d=3  hl=2 l=  20 cons: SEQUENCE
   50:d=4  hl=2 l=   8 prim: OBJECT            :des-ede3-cbc
   60:d=4  hl=2 l=   8 prim: OCTET STRING      [HEX DUMP]:78ABEA3810B6879F
   70:d=1  hl=4 l=1224 prim: OCTET STRING      [HEX DUMP]:C0FA1A3D2BCD7A80DD61F4C0287F8B2D...

Use Lapo Luchini’s JavaScript ASN.1 decoder to display a nice ASN.1 tree structure:

Sequence (2 elements)
|- Sequence (2 elements)
|  |- Object identifier: 1.2.840.113549.1.5.13            // using PBES2 from PKCS#5
|  `- Sequence (2 elements)
|     |- Sequence (2 elements)
|     |  |- Object identifier: 1.2.840.113549.1.5.12      // using PBKDF2 -- yay! :)
|     |  `- Sequence (2 elements)
|     |     |- Byte string (8 bytes): 3AEFD2DBFBF9E3B3    // salt
|     |     `- Integer: 2048                              // iteration count
|     `- Sequence (2 elements)
|          Object identifier: 1.2.840.113549.3.7          // encrypted with Triple DES, CBC
|          Byte string (8 bytes): 78ABEA3810B6879F        // initialization vector
`- Byte string (1224 bytes): C0FA1A3D2BCD7A80DD61F4C0287F8B2DAB46A43E...  // encrypted key blob

The format uses OIDs, numeric codes allocated by a registration authority to unambiguously refer to algorithms. The OIDs in this key file tell us that the encryption scheme is pkcs5PBES2, that the key derivation function is PBKDF2, and that the encryption is performed using des-ede3-cbc. The hash function can be explicitly specified if needed; here it’s omitted, which means that it defaults to hMAC-SHA1.

The nice thing about having all those identifiers in the file is that if better algorithms are invented in future, we can upgrade the key file without having to change the container file format.

You can also see that the key derivation function uses an iteration count of 2,048. Compared to just one iteration in the traditional SSH key format, that’s good — it means that it’s much slower to brute-force the passphrase. The number 2,048 is currently hard-coded in OpenSSL; I hope that it will be configurable in future, as you could probably increase it without any noticeable slowdown on a modern computer.

Conclusion: better protection for your SSH private keys

If you already have a strong passphrase on your SSH private key, then converting it from the traditional private key format to PKCS#8 is roughly comparable to adding two extra keystrokes to your passphrase, for free. And if you have a weak passphrase, you can take your private key protection from “easily breakable” to “slightly harder to break”.

It’s so easy, you can do it right now:

$ mv ~/.ssh/id_rsa ~/.ssh/id_rsa.old
$ openssl pkcs8 -topk8 -v2 des3 -in ~/.ssh/id_rsa.old -out ~/.ssh/id_rsa
$ chmod 600 ~/.ssh/id_rsa
# Check that the converted key works; if yes, delete the old one:
$ rm ~/.ssh/id_rsa.old

The openssl pkcs8 command asks for a passphrase three times: once to unlock your existing private key, and twice for the passphrase for the new key. It doesn’t matter whether you use a new passphrase for the converted key or keep it the same as the old key.

Not all software can read the PKCS8 format, but that’s fine — only your SSH client needs to be able to read the private key, after all. From the server’s point of view, storing the private key in a different format changes nothing at all.

Schema evolution in Avro, Protocol Buffers and Thrift

So you have some data that you want to store in a file or send over the network. You may find yourself going through several phases of evolution:

  1. Using your programming language’s built-in serialization, such as Java serialization, Ruby’s marshal, or Python’s pickle. Or maybe you even invent your own format.
  2. Then you realise that being locked into one programming language sucks, so you move to using a widely supported, language-agnostic format like JSON (or XML if you like to party like it’s 1999).
  3. Then you decide that JSON is too verbose and too slow to parse, you’re annoyed that it doesn’t differentiate integers from floating point, and think that you’d quite like binary strings as well as Unicode strings. So you invent some sort of binary format that’s kinda like JSON, but binary (1, 2, 3, 4, 5, 6).
  4. Then you find that people are stuffing all sorts of random fields into their objects, using inconsistent types, and you’d quite like a schema and some documentation, thank you very much. Perhaps you’re also using a statically typed programming language and want to generate model classes from a schema. Also you realize that your binary JSON-lookalike actually isn’t all that compact, because you’re still storing field names over and over again; hey, if you had a schema, you could avoid storing objects’ field names, and you could save some more bytes!

Once you get to the fourth stage, your options are typically Thrift, Protocol Buffers or Avro. All three provide efficient, cross-language serialization of data using a schema, and code generation for the Java folks.

Plenty of comparisons have been written about them already (1, 2, 3, 4). However, many posts overlook a detail that seems mundane at first, but is actually cruicial: What happens if the schema changes?

In real life, data is always in flux. The moment you think you have finalised a schema, someone will come up with a use case that wasn’t anticipated, and wants to “just quickly add a field”. Fortunately Thrift, Protobuf and Avro all support schema evolution: you can change the schema, you can have producers and consumers with different versions of the schema at the same time, and it all continues to work. That is an extremely valuable feature when you’re dealing with a big production system, because it allows you to update different components of the system independently, at different times, without worrying about compatibility.

Which brings us to the topic of today’s post. I would like to explore how Protocol Buffers, Avro and Thrift actually encode data into bytes — and this will also help explain how each of them deals with schema changes. The design choices made by each of the frameworks are interesting, and by comparing them I think you can become a better engineer (by a little bit).

The example I will use is a little object describing a person. In JSON I would write it like this:

{
    "userName": "Martin",
    "favouriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

This JSON encoding can be our baseline. If I remove all the whitespace it consumes 82 bytes.

Protocol Buffers

The Protocol Buffers schema for the person object might look something like this:

message Person {
    required string user_name        = 1;
    optional int64  favourite_number = 2;
    repeated string interests        = 3;
}

When we encode the data above using this schema, it uses 33 bytes, as follows:

Look exactly at how the binary representation is structured, byte by byte. The person record is just the concatentation of its fields. Each field starts with a byte that indicates its tag number (the numbers 1, 2, 3 in the schema above), and the type of the field. If the first byte of a field indicates that the field is a string, it is followed by the number of bytes in the string, and then the UTF-8 encoding of the string. If the first byte indicates that the field is an integer, a variable-length encoding of the number follows. There is no array type, but a tag number can appear multiple times to represent a multi-valued field.

This encoding has consequences for schema evolution:

  • There is no difference in the encoding between optional, required and repeated fields (except for the number of times the tag number can appear). This means that you can change a field from optional to repeated and vice versa (if the parser is expecting an optional field but sees the same tag number multiple times in one record, it discards all but the last value). required has an additional validation check, so if you change it, you risk runtime errors (if the sender of a message thinks that it’s optional, but the recipient thinks that it’s required).
  • An optional field without a value, or a repeated field with zero values, does not appear in the encoded data at all — the field with that tag number is simply absent. Thus, it is safe to remove that kind of field from the schema. However, you must never reuse the tag number for another field in future, because you may still have data stored that uses that tag for the field you deleted.
  • You can add a field to your record, as long as it is given a new tag number. If the Protobuf parser parser sees a tag number that is not defined in its version of the schema, it has no way of knowing what that field is called. But it does roughly know what type it is, because a 3-bit type code is included in the first byte of the field. This means that even though the parser can’t exactly interpret the field, it can figure out how many bytes it needs to skip in order to find the next field in the record.
  • You can rename fields, because field names don’t exist in the binary serialization, but you can never change a tag number.

This approach of using a tag number to represent each field is simple and effective. But as we’ll see in a minute, it’s not the only way of doing things.

Avro

Avro schemas can be written in two ways, either in a JSON format:

{
    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "userName",        "type": "string"},
        {"name": "favouriteNumber", "type": ["null", "long"]},
        {"name": "interests",       "type": {"type": "array", "items": "string"}}
    ]
}

…or in an IDL:

record Person {
    string               userName;
    union { null, long } favouriteNumber;
    array<string>        interests;
}

Notice that there are no tag numbers in the schema! So how does it work?

Here is the same example data encoded in just 32 bytes:

Strings are just a length prefix followed by UTF-8 bytes, but there’s nothing in the bytestream that tells you that it is a string. It could just as well be a variable-length integer, or something else entirely. The only way you can parse this binary data is by reading it alongside the schema, and the schema tells you what type to expect next. You need to have the exact same version of the schema as the writer of the data used. If you have the wrong schema, the parser will not be able to make head or tail of the binary data.

So how does Avro support schema evolution? Well, although you need to know the exact schema with which the data was written (the writer’s schema), that doesn’t have to be the same as the schema the consumer is expecting (the reader’s schema). You can actually give two different schemas to the Avro parser, and it uses resolution rules to translate data from the writer schema into the reader schema.

This has some interesting consequences for schema evolution:

  • The Avro encoding doesn’t have an indicator to say which field is next; it just encodes one field after another, in the order they appear in the schema. Since there is no way for the parser to know that a field has been skipped, there is no such thing as an optional field in Avro. Instead, if you want to be able to leave out a value, you can use a union type, like union { null, long } above. This is encoded as a byte to tell the parser which of the possible union types to use, followed by the value itself. By making a union with the null type (which is simply encoded as zero bytes) you can make a field optional.
  • Union types are powerful, but you must take care when changing them. If you want to add a type to a union, you first need to update all readers with the new schema, so that they know what to expect. Only once all readers are updated, the writers may start putting this new type in the records they generate.
  • You can reorder fields in a record however you like. Although the fields are encoded in the order they are declared, the parser matches fields in the reader and writer schema by name, which is why no tag numbers are needed in Avro.
  • Because fields are matched by name, changing the name of a field is tricky. You need to first update all readers of the data to use the new field name, while keeping the old name as an alias (since the name matching uses aliases from the reader’s schema). Then you can update the writer’s schema to use the new field name.
  • You can add a field to a record, provided that you also give it a default value (e.g. null if the field’s type is a union with null). The default is necessary so that when a reader using the new schema parses a record written with the old schema (and hence lacking the field), it can fill in the default instead.
  • Conversely, you can remove a field from a record, provided that it previously had a default value. (This is a good reason to give all your fields default values if possible.) This is so that when a reader using the old schema parses a record written with the new schema, it can fall back to the default.

This leaves us with the problem of knowing the exact schema with which a given record was written. The best solution depends on the context in which your data is being used:

  • In Hadoop you typically have large files containing millions of records, all encoded with the same schema. Object container files handle this case: they just include the schema once at the beginning of the file, and the rest of the file can be decoded with that schema.
  • In an RPC context, it’s probably too much overhead to send the schema with every request and response. But if your RPC framework uses long-lived connections, it can negotiate the schema once at the start of the connection, and amortize that overhead over many requests.
  • If you’re storing records in a database one-by-one, you may end up with different schema versions written at different times, and so you have to annotate each record with its schema version. If storing the schema itself is too much overhead, you can use a hash of the schema, or a sequential schema version number. You then need a schema registry where you can look up the exact schema definition for a given version number.

One way of looking at it: in Protocol Buffers, every field in a record is tagged, whereas in Avro, the entire record, file or network connection is tagged with a schema version.

At first glance it may seem that Avro’s approach suffers from greater complexity, because you need to go to the additional effort of distributing schemas. However, I am beginning to think that Avro’s approach also has some distinct advantages:

  • Object container files are wonderfully self-describing: the writer schema embedded in the file contains all the field names and types, and even documentation strings (if the author of the schema bothered to write some). This means you can load these files directly into interactive tools like Pig, and it Just Works™ without any configuration.
  • As Avro schemas are JSON, you can add your own metadata to them, e.g. describing application-level semantics for a field. And as you distribute schemas, that metadata automatically gets distributed too.
  • A schema registry is probably a good thing in any case, serving as documentation and helping you to find and reuse data. And because you simply can’t parse Avro data without the schema, the schema registry is guaranteed to be up-to-date. Of course you can set up a protobuf schema registry too, but since it’s not required for operation, it’ll end up being on a best-effort basis.

Thrift

Thrift is a much bigger project than Avro or Protocol Buffers, as it’s not just a data serialization library, but also an entire RPC framework. It also has a somewhat different culture: whereas Avro and Protobuf standardize a single binary encoding, Thrift embraces a whole variety of different serialization formats (which it calls “protocols”).

Indeed, Thrift has two different JSON encodings, and no fewer than three different binary encodings. (However, one of the binary encodings, DenseProtocol, is only supported in the C++ implementation; since we’re interested in cross-language serialization, I will focus on the other two.)

All the encodings share the same schema definition, in Thrift IDL:

struct Person {
  1: string       userName,
  2: optional i64 favouriteNumber,
  3: list<string> interests
}

The BinaryProtocol encoding is very straightforward, but also fairly wasteful (it takes 59 bytes to encode our example record):

The CompactProtocol encoding is semantically equivalent, but uses variable-length integers and bit packing to reduce the size to 34 bytes:

As you can see, Thrift’s approach to schema evolution is the same as Protobuf’s: each field is manually assigned a tag in the IDL, and the tags and field types are stored in the binary encoding, which enables the parser to skip unknown fields. Thrift defines an explicit list type rather than Protobuf’s repeated field approach, but otherwise the two are very similar.

In terms of philosophy, the libraries are very different though. Thrift favours the “one-stop shop” style that gives you an entire integrated RPC framework and many choices (with varying cross-language support), whereas Protocol Buffers and Avro appear to follow much more of a “do one thing and do it well” style.

This post has been translated into Korean by Justin Song.

The complexity of user experience

The problem of overly complex software is nothing new; it is almost as old as software itself. Over and over again, software systems become so complex that they become very difficult to maintain and very time-consuming and expensive to modify. Most developers hate working on such systems, yet nevertheless we keep creating new, overly complex systems all the time.

Much has been written about this, including classic papers by Fred Brooks (No Silver Bullet), and Ben Moseley and Peter Marks (Out of the Tar Pit). They are much more worth reading than this post, and it is presumptuous of me to think I could add anything significant to this debate. But I will try nevertheless.

Pretty much everyone agrees that if you have a choice between a simpler software design and a more complex design, all else being equal, that simpler is better. It is also widely thought to be worthwhile to deliberately invest in simplicity — for example, to spend effort refactoring existing code into a cleaner design — because the one-off cost of refactoring today is easily offset by the benefits of easier maintenance tomorrow. Also, much thought by many smart people has gone into finding ways of breaking down complex systems into manageable parts with manageable dependencies. I don’t wish to dispute any of that.

But there is a subtlety that I have been missing in discussions about software complexity, that I feel somewhat ambivalent about, and that I think is worth discussing. It concerns the points where external humans (people outside of the team maintaining the system) touch the system — as developers using an API exposed by the system, or as end users interacting with a user interface. I will concentrate mostly on user interfaces, but much of this discussion applies to APIs too.

Examples

Let me first give a few examples, and then try to extract a pattern from them. They are examples of situations where, if you want, you can go to substantial engineering effort in order to make a user interface a little bit nicer. (Each example based on a true story!)

  • You have an e-commerce site, and need to send out order confirmation emails that explain next steps to the customer. Those next steps differ depending on availability, the tax status of the product, the location of the customer, the type of account they have, and a myriad other parameters. You want the emails to only include the information that is applicable to this particular customer’s situation, and not burden them with edge cases that don’t apply to them. You also want the emails to read as coherent prose, not as a bunch of fragmented bullet points generated by if statements based on the order parameters. So you go and build a natural language grammar model for constructing emails based on sentence snippets (providing pluralisation, agreement, declension in languages that have it, etc), in such a way that for any one out of 100 million possible parameter combinations, the resulting email is grammatically correct and easy to understand.
  • You have a multi-step user flow that is used in various different contexts, but ultimatively achieves the same thing in each context. (For example, Rapportive has several OAuth flows for connecting your account with various social networks, and there are several different buttons in different places that all lead into the same user flow.) The simple solution is to make the flow generic, and not care how the user got there. But if you want to make the user feel good, you need to imagine what state their mind was in when they entered the flow, and customise the images, text and structure of the flow in order to match their goal. This means you have to keep track of where the user came from, what they were trying to do, and thread that context through every step of the flow. This is not fundamentally hard, but it is fiddly, time-consuming and error-prone.
  • You have an application that requires some arcane configuration. You could take the stance that you will give the user a help page and they will have to figure it out from there. Or you could write a sophisticated auto-configuration tool that inspects the user’s environment, analyses thousands of possible software combinations and configurations (and updates this database as new versions of other products in the environment are released), and automatically chooses the correct settings — hopefully without having to ask the user for help. With auto-configuration, the users never even know that they were spared a confusing configuration dialog. But somehow, word gets around that the product “just works”.

What’s a user requirement?

We said above that simplicity is good. However, taking simplicity to an exaggerated extreme, you end up with software that does nothing. This implies that there are aspects of software complexity that are essential to the user’s problem that is being solved. (Note that I don’t mean complexity of the user interface, but complexity of the actual code that implements the solution to the user’s problem.)

Unfortunately, there is a lot of additional complexity introduced by stuff that is not directly visible or useful to users: stuff that is only required to “grease the wheels”, for example to make legacy components work or to improve performance. Moseley and Marks call this latter type accidental complexity, and argue that it should be removed or abstracted away as much as possible. (Other authors define essential and accidental complexity slightly differently, but the exact definition is not important for the purpose of this post.)

This suggests that it is important to understand what user problem is being solved, and that’s where things start getting tricky. When you say that something is essential because it fulfils a user requirement (as opposed to an implementation constraint or a performance optimisation), that presupposes a very utilitarian view of software. It assumes that the user is trying to get a job done, and that they are a rational actor. But what if, say, you are taking an emotional approach and optimising for user delight?

What if the user didn’t know they had a problem, but you solve it anyway? If you introduce complexity in the system for the sake of making things a little nicer for the user (but without providing new core functionality), is that complexity really essential? What if you add a little detail that is surprising but delightful?

You can try to reduce an emotional decision down to a rational one — for example, you can say that when a user plays a game, it is solving the user’s problem of boredom by providing distraction. Thus any feature which substantially contributes towards alleviating boredom may be considered essential. Such reductionism can sometimes provide useful angles of insight, but I think a lot would be lost by ignoring the emotional angle.

You can state categorically that “great user experience is an essential feature”. But what does that mean? By itself, that statement is so general that could be used to argue for anything or nothing. User experience is subjective. What’s preferable for one user may be an annoyance for another user, even if both users are in the application’s target segment. Sometimes it just comes down to taste or fashion. User experience tends to have an emotional angle that makes it hard to fit into a rational reasoning framework.

What I am trying to get at: there are things in software that introduce a lot of complexity (and that we should consequently be wary of), and that can’t be directly mapped to a bullet point on a list of user requirements, but that are nevertheless important and valuable. These things do not necessarily provide important functionality, but they contribute to how the user feels about the application. Their effect may be invisible or subconscious, but that doesn’t make them any less essential.

Data-driven vs. emotional design

Returning to the examples above: as an application developer, you can choose whether to take on substantial additional complexity in the software in order to simplify or improve the experience for the user. The increased software complexity actually reduces the complexity from the user’s point of view. These examples also illustrate how user experience concerns are not just a matter of graphic design, but can also have a big impact on how things are engineered.

The features described above arguably do not contribute to the utility of the software — in the e-commerce example, orders will be fulfilled whether or not the confirmation emails are grammatical. In that sense, the complexity is unnecessary. But I would argue that these kind of user experience improvements are just as important as the utility of the product, because they determine how users feel about it. And how they feel ultimately determines whether they come back, and thus the success or failure of the product.

One could even argue that the utility of a product is a subset of its user experience: if the software doesn’t do the job that it’s supposed to, then that’s one way of creating a pretty bad experience; however, there are also many other ways of creating a bad experience, while remaining fully functional from a utilitarian point of view.

The emotional side of user experience can be a difficult thing for organisations to grapple with, because it doesn’t easily map to metrics. You can measure things like how long a user stayed on your site, how many things they clicked on, conversion rates, funnels, repeat purchase rates, lifetime values… but those numbers tell you very little about how happy you made a user. So you can take a “data-driven” approach to design decisions and say that a feature is worthwhile if and only if it makes the metrics go up — but I fear that an important side of the story is missed if you go solely by the numbers.

Questions

This is as far as my thinking has got: believing that a great user experience is essential for many products; and recognising that building a great UX is hard, can require substantial additional complexity in engineering, and can be hard to justify in terms of logical arguments and metrics. Which leaves me with some unanswered questions:

  • Every budget is finite, so you have to prioritise things, and not everything will get done. When you consider building something that improves user experience without strictly adding utility, it has to be traded off against features that do add utility (is it better to shave a day off the delivery time than to have a nice confirmation email?), and the cost of the increased complexity (will that clever email generator be a nightmare to localise when we translate the site into other languages?). How do you decide about that kind of trade-offs?
  • User experience choices are often emotional and intuitive (no number of focus groups and usability tests can replace good taste). That doesn’t make them any more or less important than rational arguments, but combining emotional and rational arguments can be tricky. Emotionally-driven people tend to let emotional choices overrule rational arguments, and rationally-driven people vice versa. How do you find the healthy middle ground?
  • If you’re aiming for a minimum viable product in order to test out a market (as opposed to improving a mature product), does that change how you prioritise core utility relative to “icing on the cake”?

I suspect that the answers to the questions above are “it depends”. More precisely, “how one thing is valued relative to another is an aspect of your particular organisation’s culture, and there’s no one right answer”. That would imply that each of us should think about it; you should have your own personal answers for how you decide these things in your own projects, and be able to articulate them. But it’s difficult — I don’t think hard-and-fast rules have a chance of working here.

I’d love to hear your thoughts in the comments below. If you liked this post, you can subscribe to email notifications when I write something new :)

Rethinking caching in web apps

Having spent a lot of the last few years worrying about the scalability of data-heavy applications like Rapportive, I have started to get the feeling that maybe we have all been “doing it wrong”. Maybe what we consider to be “state of the art” application architecture is actually holding us back.

I don’t have a definitive answer for how we should be architecting things differently, but in this post I’d like to outline a few ideas that I have been fascinated by recently. My hope is that we can develop ways of better managing scale (in terms of complexity, volume of data and volume of traffic) while keeping our applications nimble, easy and safe to modify, test and iterate.

My biggest problem with web application architecture is how network communication concerns are often intermingled with business logic concerns. This makes it hard to rearrange the logic into new architectures, such as the precomputed cache architecture described below. In this post I explore why it important to be able to try new architectures for things like caching, and what it would take to achieve that flexibility.

An example

To illustrate, consider the clichéd Rails blogging engine example:

class Post < ActiveRecord::Base
  attr_accessible :title, :content, :author
  has_many :comments
end

class Comment < ActiveRecord::Base
  attr_accessible :content, :author
  belongs_to :post
end

class PostsController < ApplicationController
  def show
    @post = Post.find(params[:id])
    respond_to do |format|
      format.html  # show.html.erb
      format.json  { render :json => @post }
    end
  end
end

# posts/show.html.erb:

<h1><%= @post.title %></h1>
<p class="author">By <%= @post.author %></p>
<div class="content">
  <%= simple_format(@post.content) %>
</div>
<h2>Comments</h2>
<ul class="comments">
  <% @post.comments.each do |comment| %>
    <li>
      <blockquote><%= simple_format(comment.content) %></blockquote>
      <p class="author"><%= comment.author %></p>
    </li>
  <% end %>
</ul>

Pretty good code by various standards, but it has always irked me a bit that I can’t see where the network communication (i.e. making database queries) is happening. When I look at that Post.find in the controller, I can guess that probabably translates into a SELECT * FROM posts WHERE id = ? internally – unless the same query was already made recently, and ActiveRecord cached the result. And another database query of the form SELECT * FROM comments WHERE post_id = ? might be made as a result of the @post.comments call in the template. Or maybe the comments were already previously loaded by some model logic, and then cached? Or someone decided to eagerly load comments with the original post? Who knows.

The execution flow for a MVC framework request like PostsController#show probably looks something like this:

Typical MVC request flow

Of course it is deliberately designed that way. Your template and your controller shouldn’t have to worry about database queries — those are encapsulated by the model for many good reasons. I am violating abstraction by even thinking about the database whilst I’m in the template code! I should just think of my models as pure, beautiful pieces of application state. How that state gets loaded from a database is a matter that only the models need to worry about.

Adding complexity

In the example above, the amount of logic in the model is minimal, but it typically doesn’t stay that way for long. As the application becomes popular (say, the blogging engine morphs to become Twitter, Tumblr, Reddit or Pinterest), all sorts of stuff gets added: memcache to stop the database from falling over, spam filtering, analytics features, email sending, notifications, A/B testing, more memcache, premium features, ads, upsells for viral loops, more analytics, even more memcache. As the application inevitably grows in complexity, the big monolithic beast is split into several smaller services, and different services end up being maintained by different teams.

As all of this is happening, the programming model typically stays the same: each service in the architecture (which may be a user-facing web server, or an internal service e.g. for user authentication) communicates over the network with a bunch of other nodes (memcached instances, database servers, other application services), processes and combines the data in some way, and then serves it out to a client.

That processing and combining of data we can abstractly call “business logic”. It might be trivially simple, or it might involve half a million lines of parsing, rendering or machine learning code. It might behave differently depending on which A/B test bucket the user is in. It might deal with hundreds of hairy edge cases. Whatever.

At the root of the matter, business logic should be a pure function. It takes a bunch of inputs (request parameters from the client, data stored in various databases and caches, responses from various other services) and produces a bunch of outputs (data to return to the client, data to write back to various databases and caches). It is usually deterministic: given the same inputs, the business logic should produce exactly the same output again. It is also stateless: any data that is required to produce the output or to make a decision has to be provided as an input.

By contrast, the network communication logic is all about ‘wiring’. It may end up having a lot of complexity in its own right: sending requests to the right node of a sharded database, retrying failed requests with exponential back-off, making requests to different services in parallel, cross-datacenter failover, service authentication, etc. But the network communication logic ought to be general-purpose and completely independent of your application’s business logic.

Both business logic and network communication logic are needed to build a service. But how do you combine the two into a single process? Most commonly, we build abstractions for each type of logic, hiding the gory implementation details. Much like in the blog example above, you end up calling a method somewhere inside the business logic, not really knowing or caring whether it will immediately return a value that the object has already computed, or whether it will talk to another process on the same machine, or load the value from some remote cache, or make a query on a database cluster somewhere.

It’s good that the business logic doesn’t need to worry about how and when the communication happens. And it’s good that the communication logic is general-purpose and not polluted with application-specific concerns. But I think it’s problematic that network communication may happen somewhere deeply inside a business logic call stack. Let me try to explain why.

Precomputed caches

As your volume of data and your number of users grow, database access often becomes a bottleneck (there are more queries competing for I/O, and each query takes longer when there’s more data). The standard answer to the problem is of course caching. You can cache at many different levels: an individual database row, or a model object generated by combining several sources, or even an entire HTML page ready to serve to a client. I will focus on the mid-to-high-level caches, where the raw data has gone through some sort of business logic before it ends up in the cache.

Most commonly, caches are set up in read-through style: on every query, you first check the cache, and return the value from the cache if it’s a hit; otherwise it’s a miss, so you do whatever is required to generate the value (query databases, apply business logic, perform voodoo), and return it to the client whilst also storing it in the cache for next time. As long as you can generate the value on the fly in a reasonable time, this works pretty well.

I will gloss over cache invalidation and expiry for now, and return to it below.

The most apparent problem with a read-through cache is that the first time a value is requested, it’s always slow. (And if your cache is too small to hold the entire dataset, rarely accessed values will get evicted and thus be slow every time.) That may or may not be a problem for you. One reason why it may be a problem is that on many sites, the first client to request a given page is typically the Googlebot, and Google penalises slow sites in rankings. So if you have the kind of site where Google juice is lifeblood, then your SEO guys may tell you that a read-through cache is not good enough.

So, can you make sure that the data is in the cache even before it is requested for the first time? Well, if your dataset isn’t too huge, you can actually precompute every possible cache entry, put them in a big distributed key-value store and serve them with minimal latency. That has a great advantage: cache misses no longer exist. If you’ve precomputed every possible cache entry, and a key isn’t in the cache, you can be sure that there’s no data for that key.

If that sounds crazy to you, consider these points:

  • A database index is a special case of a precomputed cache. For every value you might want to search for, the index tells you where to find occurrences of that value. If it’s not in the index, it’s not in the database. The initial index creation is a one-off batch job, and thereafter the database automatically keeps it in sync with the raw data. Yes, databases have been doing this for a long time.
  • With Hadoop you can process terabytes of data without breaking a sweat. That is truly awesome power.
  • There are several datastores that allow you to precompute their files in Hadoop, which makes them very well suited for serving the cache that you precomputed. We are currently using Voldemort in read-only mode (research paper), but HBase and ElephantDB can do this too.
  • If you’re currently storing data in denormalized form (to avoid joins on read queries), you can stop doing that. You can keep your primary database in a very clean, normalized schema, and any caches you derive from it can denormalize the data to your heart’s content. This gives you the best of both worlds.

Separating communication from business logic

Ok, say you’ve decided that you want to precompute a cache in Hadoop. As we’ve not yet addressed cache invalidation (see below), let’s just say you’re going to rebuild the entire cache once a day. That means the data you serve out of the cache will be stale, out of date by up to a day, but that’s still acceptable for some applications.

The first step is to get your raw data into HDFS. That’s not hard, assuming you have daily database backups: you can take your existing backup, transform it into a more MapReduce-friendly format such as Avro, and write it straight to HDFS. Do that with all your production databases and you’ve got a fantastic resource to work with in Hadoop.

Now, to build your precomputed cache, you need to apply the same business logic to the same data as you would in an uncached service that does it on the fly. As described above, your business logic takes as input the request parameters from the user and any data that is loaded from databases or services in order to serve that request. If you have all that data in HDFS, and you can work out all possible request parameters, then in theory, you should be able to take your existing business logic implementation and run it in Hadoop.

Business logic can be very complex, so you should probably aim to reuse the existing implementation rather than rewriting it. But doing so requires untangling the real business logic from all the network communication logic.

When your business logic is running as a service processing individual requests, you’re used to making several small requests to databases, caches or other services as part of generating a response (see the blog example above). Those small requests constitute gathering all the inputs needed by the business logic in order to produce its output (e.g. a rendered HTML page).

But when you’re running in Hadoop, this is all turned on its head. You don’t want to be making individual random-access requests to data, because that would be an order of magnitude too slow. Instead you need to use MapReduce to gather all the inputs for one particular evaluation of the business logic into one place, and then run the business logic given those inputs without any network communication. Rather than the business logic pulling together all the bits of data it needs in order to produce a response, the MapReduce job has already gathered all the data it knows the business logic is going to need, and pushes it into the business logic function.

Let’s use the blog example to make this more concrete. The data dependency is fairly simple: when the blog post params[:id] is requested, we require the row in the posts table whose id column matches the requested post, and we require all the rows in the comments table whose post_id column matches the requested post. If the posts and comments tables are in HDFS, it’s a very simple MapReduce job to group together the post with id = x and all the comments with post_id = x.

We can then use a stub database implementation to feed those database rows into the existing Post and Comment model objects. That way we can make the models think that they loaded the data from a database, even though actually we had already gathered all the data we knew it was going to need. The model objects can keep doing their job as normally, and the output they produce can be written straight to the cache.

By this point, two problems should be painfully clear:

  • How does the MapReduce job know what inputs the business logic is going to need in order to work?
  • OMG, implementing stub database drivers, isn’t that a bit too much pain for limited gain? (Note that in testing frameworks it’s not unusual to stub out your database, so that you can run your unit tests without a real database. Still, it’s non-trivial and annoying.)

Both problems have the same cause, namely that the network communication logic is triggered from deep inside the business logic.

Data dependencies

When you look at the business logic in the light of precomputing a cache, it seems like the following pattern would make more sense:

  1. Declare your data dependencies: “if you want me to render the blog post with ID x, I’m going to need the row in the posts table with id = x, and also all the rows in the comments table with post_id = x”.
  2. Let the communication logic deal with resolving those dependencies. If you’re running as a normal web app, that means making database (or memcache) queries to one or more databases, and maybe talking to other services. If you’re running in Hadoop, it means configuring the MapReduce job to group together all the pieces of data on which the business logic depends.
  3. Once all the dependencies have been loaded, the business logic is now a pure function, deterministic and side-effect-free, that produces our desired output. It can perform whatever complicated computation it needs to, but it’s not allowed access to the network or data stores that weren’t declared as dependencies up front.

This separation would make application architecture very different from the way it is commonly done today. I think this new style would have several big advantages:

  • By removing the assumption that the business logic is handling one request at a time, it becomes much easier to run the business logic in completely different contexts, such as in a batch job to precompute a cache. (No more stubbing out database drivers.)
  • Testing becomes much easier. All the tricky business logic for which you want to write unit tests is now just a function with a bunch of inputs and a bunch of outputs. You can easily vary what you put in, and easily check that the right thing comes out. Again, no more stubbing out the database.
  • The network communication logic can become a lot more helpful. For example, it can make several queries in parallel without burdening the business logic with a lot of complicated concurrency stuff, and it can deduplicate similar requests.
  • Because the data dependencies are very clearly and explicitly modelled, the system becomes easier to understand, and it becomes easier to move modules around, split a big monolithic beast into smaller services, or combine smaller services into bigger, logical units.

I hope you agree that this is a very exciting prospect. But is it practical?

In most cases, I think it would not be very hard to make business logic pure (i.e. stop making database queries from deep within) — it’s mostly a matter of refactoring. I have done it to substantial chunks of the Rapportive code base, and it was a bit tedious but perfectly doable. And the network communication logic wouldn’t have to change much at all.

The problem of making this architecture practical hinges on having a good mechanism for declaring data dependencies. The idea is not new — for instance, LinkedIn have an internal framework for resolving data dependencies that queries several services in parallel — but I’ve not yet seen a language or framework that really gets to the heart of the problem.

Adapting the blog example above, this is what I imagine such an architecture would look like:

Concept for using a dependency resolver

We still have models, and they are still used as encapsulations of state, but they are no longer wrappers around a database connection. Instead, the dependency resolver can take care of the messy business of talking to the database; the models are pure and can focus on the business logic. The models don’t care whether they are instantiated in a web app or in a Hadoop cluster, and they don’t care whether the data was loaded from a SQL database or from HDFS. That’s the way it should be.

In my spare time I have started working on a language called Flowquery (don’t bother searching, there’s nothing online yet) to solve the problem of declaring data dependencies. If I can figure it out, it should make precomputed caches and all the good things above very easy. But it’s not there yet, so I don’t want to oversell it.

But wait, there is one more thing…

Cache invalidation

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

How important is it that the data in your cache is up-to-date and consistent with your “source of truth” database? The answer depends on the application and the circumstances. For example, if the user edits their own data, you almost certainly want to show them an up-to-date version of their own data post-editing, otherwise they will assume that your app is broken. But you might be able to get away with showing stale data to other users for a while. For data that is not directly edited by users, stale data may always be ok.

If staleness is acceptable, caching is fairly simple: on a read-through cache you set an expiry time on a cache key, and when that time is reached, the entry falls out of the cache. On a precomputed cache you do nothing, and just wait until the next time you recompute the entire thing.

In cases where greater consistency is required, you have to explicitly invalidate cache entries when the original data changes. If just one cache key is affected by a change, you can write-through to that cache key when the “source of truth” database is updated. If many keys may be affected, you can use generational caching and clever generalisations thereof. Whatever technique you use, it usually ends up being a lot of manually written, fiddly and error-prone code. Not a great joy to work with, hence the terribly clichéd quote above.

But… observe the following: in our efforts to separate pure business logic from network communication logic, we decided that we needed to explicitly model the data dependencies, and only data sources declared there are permitted as inputs to the business logic. In other words, the data dependency framework knows exactly which pieces of data are required in order to generate a particular piece of output — and conversely, when a piece of (input) data changes, it can know exactly which outputs (cache entries) may be affected by the change!

This means that if we have a real-time feed of changes to the underlying databases, we can feed it into a stream processing framework like Storm, run the data dependency analysis in reverse on every change, recompute the business logic for each output affected by the change in input, and write the results to another datastore. This store sits alongside the precomputed cache we generated in a batch process in Hadoop. When you want to query the cache, check both the output of the batch process and the output of the stream process. If the stream process has generated more recent data, use that, otherwise use the batch process output.

If you’ve been following recent news in Big Data, you may recognise this as an application of Nathan Marz’ lambda architecture (described in detail in his upcoming book). I cannot thank Nathan enough for his amazing work in this area.

In this architecture, you get the benefits of a precomputed cache (every request is fast, including the first one), it keeps itself up-to-date with the underlying data, and because you have already declared your data dependencies, you don’t need to manually write cache invalidation code! The same dependency declaration can be used in three different ways:

  1. In ‘online’ mode in a service or web app, for driving the network communication logic in order to make all the required queries and requests in order to serve an incoming request, and to help with read-through caching.
  2. In ‘offline’ mode in Hadoop, to configure a MapReduce pipeline that brings together all the required data in order to run it through the business logic and generate a precomputed cache of all possible queries.
  3. In ‘nearline’ mode in Storm, to configure a stream processing topology that tracks changes to the underlying data, determines which cache keys need to be invalidated, and recomputes the cache values for those keys using the business logic.

I am designing Flowquery so that it can be used in all three modes — you should be able to write your data dependencies just once, and let the framework take care of bringing all the necessary data together so that the business logic can act on it.

My hope is to make caching and cache invalidation as simple as database indexes. You declare an index once, the database runs a one-off batch job to build the index, and thereafter automatically keeps it up-to-date as the table contents change. It’s so simple to use that we don’t even think about it, and that’s what we should be aiming for in the realm of caching.

The project is still at a very early stage, but hopefully I’ll be posting more about it as it progresses. If you’d like to hear more, please leave your email address and I’ll send you a brief note when I post more. Or you can follow me on Twitter or App.net.

Thanks to Nathan Marz, Pete Warden, Conrad Irwin, Rahul Vohra and Sam Stokes for feedback on drafts of this post.

Java's hashCode is not safe for distributed systems

As you probably know, hash functions serve many different purposes:

  1. Network and storage systems use them (in the guise of checksums) to detect accidental corruption of data.
  2. Crypographic systems use them to detect malicious corruption of data and to implement signatures.
  3. Password authentication systems use them to make it harder to extract plaintext passwords from a database.
  4. Programming languages use them for hash maps, to determine in which hash bucket a key is placed.
  5. Distributed systems use them to determine which worker in a cluster should handle a part of a large job.

All those purposes have different requirements, and different hash functions exist for the various purposes. For example, CRC32 is fine for detecting bit corruption in Ethernet, as it’s really fast and easy to implement in hardware, but it’s useless for cryptographic purposes. SHA-1 is fine for protecting the integrity of a message against attackers, as it’s cryptographically secure and also reasonably fast to compute; but if you’re storing passwords, you’re probably better off with something like bcrypt, which is deliberately slow in order to make brute-force attacks harder.

Anyway, that’s all old news. Today I want to talk about points 4 and 5, and why they are also very different from each other.

Hashes for hash tables

We use hash tables (dictionaries) in programming languages all the time without thinking twice. When you insert an item into a hash table, the language computes a hash code (an integer) for the key, uses that number to choose a bucket in the hash table (typically mod n for a table of size n), and then puts the key and value in that bucket in the table. If there’s already a value there (a hash collision), a linked list typically takes care of storing the keys and values within the same hash bucket. In Ruby, for example:

$ ruby --version
ruby 1.8.7 (2011-06-30 patchlevel 352) [i686-darwin11.0.0]

$ pry
[1] pry(main)> hash_table = {'answer' => 42}
=> {"answer"=>42}
[2] pry(main)> 'answer'.hash
=> -1246806696
[3] pry(main)> 'answer'.hash
=> -1246806696
[4] pry(main)> ^D

$ pry
[1] pry(main)> 'answer'.hash
=> -1246806696
[2] pry(main)> "don't panic".hash
=> -464783873
[3] pry(main)> ^D

When you add the key 'answer' to the hash table, Ruby internally calls the #hash method on that string object. The method returns an arbitrary number, and as you see above, the number is always the same for the same string. A different string usually has a different hash code. Occasionally you might get two keys with the same hash code, but it’s extremely unlikely that you get a large number of collisions in normal operation.

The problem with the example above: when I quit Ruby (^D) and start it again, and compute the hash for the same string, I still get the same result. But why is that a problem, you say, isn’t that what a hash function is supposed to do? – Well, the problem is that I can now put on my evil genius hat, and generate a list of strings that all have the same hash code:

$ pry
[1] pry(main)> "a".hash
=> 100
[2] pry(main)> "\0a".hash
=> 100
[3] pry(main)> "\0\0a".hash
=> 100
[4] pry(main)> "\0\0\0a".hash
=> 100
[5] pry(main)> "\0\0\0\0a".hash
=> 100
[6] pry(main)> "\0\0\0\0\0a".hash
=> 100

Any server in the world running the same version of Ruby will get the same hash values. This means that I can send a specially crafted web request to your server, in which the request parameters contain lots of those strings with the same hash code. Your web framework will probably parse the parameters into a hash table, and they will all end up in the same hash bucket, no matter how big you make the hash table. Whenever you want to access the parameters, you now have to iterate over a long list of hash collisions, and your swift O(1) hash table lookup is suddenly a crawling slow O(n).

I just need to make a small number of these evil requests to your server and I’ve brought it to its knees. This type of denial of service attack was already described back in 2003, but it only became widely known last year, when Java, Ruby, Python, PHP and Node.js all suddenly scrambled to fix the issue.

The solution is for the hash code to be consistent within one process, but to be different for different processes. For example, here is a more recent version in Ruby, in which the flaw is fixed:

$ ruby --version
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.3.0]

$ pry
[1] pry(main)> 'answer'.hash
=> 968518855724416885
[2] pry(main)> 'answer'.hash
=> 968518855724416885
[3] pry(main)> ^D

$ pry
[1] pry(main)> 'answer'.hash
=> -150894376904371785
[2] pry(main)> ^D

When I quit Ruby and start it again, and ask for the hash code of the same string, I get a completely different answer. This is obviously not what you want for cryptographic hashes or checksums, since it would render them useless — but for hash tables, it’s exactly right.

Hashes for distributed systems

Now let’s talk about distributed systems — systems in which you have more than process, probably on more than one machine, and they are talking to each other. If you have something that’s too big to fit on one machine (too much data to fit on one machine’s disks, too many requests to be handled by one machine’s CPUs, etc), you need to spread it across multiple machines.

How do you know which machine to use for a given request? Unless you have some application-specific partitioning that makes more sense, a hash function is a simple and effective solution: hash the name of the thing you’re requesting, mod number of servers, and that’s your server number. (Though if you ever want to change the number of machines, consistent hashing is probably a better bet.)

For this setup you obviously don’t want a hash function in which different processes may compute different hash codes for the same value, because you’d end up routing requests to the wrong server. You can’t use the same hash function as the programming language uses for hash tables.

Unfortunately, this is exactly what Hadoop does. Storm, a stream processing framework, does too. Both use the Java Virtual Machine’s Object.hashCode() method.

I understand the use of hashCode() — it’s very tempting. On strings, numbers and collection classes, hashCode() always returns a consistent value, apparently even across different JVM vendors. It’s like that despite the documentation for hashCode() explicitly not guaranteeing consistency across different processes:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

And once in a while, a bold library comes along that actually returns different hashCode() values in different processes – Protocol Buffers, for example – and people get quite confused.

The problem is that although the documentation says hashCode() doesn’t provide a consistency guarantee, the Java standard library behaves as if it did provide the guarantee. People start relying on it, and since backwards-compatibility is rated so highly in the Java community, it will probably never ever be changed, even though the documentation would allow it to be changed. So the JVM gets the worst of both worlds: a hash table implementation that is open to DoS attacks, but also a hash function that can’t always safely be used for communication between processes. :(

Therefore…

So what I’d like to ask for is this: if you’re building a distributed framework based on the JVM, please don’t use Java’s hashCode() for anything that needs to work across different processes. Because it’ll look like it works fine when you use it with strings and numbers, and then someday a brave soul will use (e.g.) a protocol buffers object, and then spend days banging their head against a wall trying to figure out why messages are getting sent to the wrong servers.

What should you use instead? First, you probably need to serialize the object to a byte stream (which you need to do anyway if you’re going to send it over the network). If you’re using a serialization that always maps the same values to the same sequence of bytes, you can just hash that byte stream. A cryptographic hash such as MD5 or SHA-1 would be ok for many cases, but might be a bit heavyweight if you’re dealing with a really high-throughput service. I’ve heard good things about MurmurHash, which is non-cryptographic but lightweight and claims to be well-behaved.

If your serialization doesn’t always produce the same sequence of bytes for a given value, then you can still define a hash function on the objects themselves. Just please don’t use hashCode(). It’s ok for in-process hash tables, but distributed systems are a different matter.

(Oh, and in case you were wondering: it looks like the web servers affected by Java’s hashCode collisions fixed the problem not by changing to a different hash function, but simply by limiting the number of parameters: Tomcat, Jetty.)