Skip to content

Six things I wish we had known about scaling

Looking back at the last few years of building Rapportive and LinkedIn Intro, I realised that there were a number of lessons that we had to learn the hard way. We built some reasonably large data systems, and there are a few things I really wish we had known beforehand.

None of these lessons are particularly obscure – they are all well-documented, if you know where to look. They are the kind of things that made me think “I can’t believe I didn’t know that, I’m so stupid #facepalm” in retrospect. But perhaps I’m not the only one who started out not knowing these things, so I’ll write them down for the benefit of anyone else who finds themself having to scale a system.

The kind of system I’m talking about is the data backend of a consumer web/mobile app with a million users (order of magnitude). At the scale of Google, LinkedIn, Facebook or Twitter (hundreds of millions of users), you’ll have an entirely different set of problems, but you’ll also have a bigger team of experienced developers and operations people around you. The mid-range scale of about a million users is interesting, because it’s quite feasible for a small startup team to get there with some luck and good marketing skills. If that sounds like you, here are a few things to keep in mind.

1. Realistic load testing is hard

Improving the performance of a system is ideally a very scientific process. You have in your head a model of what your system is doing, and a theory of where the expensive operations are. You propose a change to the system, and predict what the outcome will be. Then you make the change, observe the system’s behaviour under laboratory conditions, and thus gather evidence which either confirms or contradicts your theory. That way you iterate your way to a better theory, and also a better-performing implementation.

Sadly, we hardly ever managed to do it that way in practice. If we were optimising a microbenchmark, running the same code a million times in a tight loop, it would be easy. But we are dealing with large volumes of data, spread out across multiple machines. If you read the same item a million times in a loop, it will simply be cached, and the load test tells you nothing. If you want meaningful results, the load test needs to simulate a realistically large working set, a realistic mixture of reads and writes, realistic distribution of requests over time, and so on. And that is difficult.

It’s difficult enough to simply know what your access patterns actually are, let alone simulate them. As a starting point, you can replay a few hours worth of access logs against a copy of your real dataset. However, that only really works for read requests. Simulating writes is harder, as you may need to account for business logic rules (e.g. a sequential workflow must first update A, then update B, then update C) and deal with changes that can happen only once (if your write changes state from D to E, you can’t change from D to E again later in the test, as you’re already in state E). That means you have to synchronise your access logs with your database snapshot, or somehow generate suitable synthetic write load.

Even harder if you want to test with a dataset that is larger than the one you actually have (so that you can find out what happens when you double your userbase, and prepare for that event). Now you have to work out the statistical properties of your dataset (the distribution of friends per user is a power law with x parameters, the correlation between one user’s number of friends and the number of friends that their friends have is y, etc) and generate a synthetic dataset with those parameters. You are now in deep, deep yak shaving territory. Step back from that yak.

In practice, it hardly ever works that way. We’re lucky if, sometimes, we can run the old code and the new code side-by-side, and observe how they perform in comparison. Often, not even that is possible. Usually we often just cross our fingers, deploy, and roll back if the change seems to have made things worse. That is deeply unsatisfying for a scientifically-minded person, but it more or less gets the job done.

2. Data evolution is difficult

Being able to rapidly respond to change is one of the biggest advantages of a small startup. Agility in product and process means you also need the freedom to change your mind about the structure of your code and your data. There is lot of talk about making code easy to change, eg. with good automated tests. But what about changing the structure of your data?

Schema changes have a reputation of being very painful, a reputation that is chiefly MySQL’s fault: simply adding a column to a table requires the entire table to be copied. On a large table, that might mean several hours during which you can’t write to the table. Various tools exist to make that less painful, but I find it unbelievable that the world’s most popular open source database handles such a common operation so badly.

Postgres can make simple schema changes without copying the table, which means they are almost instant. And of course the avoidance of schema changes is a primary selling point of document databases such as MongoDB (so it’s up to application code to deal with a database that uses different schemas for different documents). But simple schema changes, such as adding a new field or two, don’t tell the entire story.

Not all your data is in databases; some might be in archived log files or some kind of blob storage. How do you deal with changing the schema of that data? And sometimes you need to make complex changes to the data, such as breaking a large thing apart, or combining several small things, or migrating from one datastore to another. Standard tools don’t help much here, and document databases don’t make it any easier.

We’ve written large migration jobs that break the entire dataset into chunks, process chunks gradually over the course of a weekend, retry failed chunks, track which things were modified while the migration was happening, and finally catch up on the missed updates. A whole lot of complexity just for a one-off data migration. Sometimes that’s unavoidable, but it’s heavy lifting that you’d rather not have to do in the first place.

Hadoop data pipelines can help with this sort of thing, but now you have to set up a Hadoop cluster, learn how to use it, figure out how to get your data into it, and figure out how to get the transformed data out to your live systems again. Big companies like LinkedIn have figured out how to do that, but in a small team it can be a massive time-sink.

3. Database connections are a real limitation

In PostgreSQL, each client connection to the database is handled by a separate unix process; in MySQL, each connection uses a separate thread. Both of these models impose a fairly low limit on the number of connections you can have to the database – typically a few hundred. Every connection adds overhead, so the entire database slows down, even if those connections aren’t actively processing queries. For example, Heroku Postgres limits you to 60 connections on the smallest plan, and 500 connections on the largest plan, although having anywhere near 500 connections is actively discouraged.

In a fast-growing app, it doesn’t take long before you reach a few hundred connections. Each instance of your application server uses at least one. Each background worker process that needs to access the database uses one. Adding more machines running your application is fairly easy if they are stateless, but every machine you add means more connections.

Partitioning (sharding) and read replicas probably won’t help you with your connection limit, unless you can somehow load-balance requests so that all the requests for a particular partition are handled by a particular server instance. A better bet is to use a connection pooler, or to write your own data access layer which wraps database access behind an internal API.

That’s all doable, but it doesn’t seem a particularly valuable use of your time when you’re also trying to iterate on product features. And every additional service you deploy is another thing that can go wrong, another thing that needs to be monitored and maintained.

(Databases that use a lightweight connection model don’t have this problem, but they may have other problems instead.)

4. Read replicas are an operational pain

A common architecture is to designate one database instance as a leader (also known as master) and to send all database writes to that instance. The writes are then replicated to other database instances (called read replicas, followers or slaves), and many read-only queries can be served from the replicas, which takes load off the leader. This architecture is also good for fault tolerance, since it gives you a warm standby – if your leader dies, you can quickly promote one of the replicas to be the new leader (you wouldn’t want to be offline for hours while you restore the database from a backup).

What they don’t tell you is that setting up and maintaining replicas is significant operational pain. MySQL is particularly bad in this regard: in order to set up a new replica, you have to first lock the leader to stop all writes and take a consistent snapshot (which may take hours on a large database). How does your app cope if it can’t write to the database? What do your users think if they can’t post stuff?

With Postgres, you don’t need to stop writes to set up a replica, but it’s still some hassle. One of the things I like most about Heroku Postgres is that it wraps all the complexity of replication and WAL archiving behind a straightforward command-line tool.

Even so, you still need to failover manually if your leader fails. You need to monitor and maintain the replicas. Your database library may not support read replicas out of the box, so you may need to add that. Some reads need to be made on the leader, so that a user sees their own writes, even if there is replication lag. That’s all doable, but it’s additional complexity, and doesn’t add any value from users’ point of view.

Some distributed datastores such as MongoDB, RethinkDB and Couchbase also use this replication model, and they automate the replica creation and master failover processes. Just because they do that doesn’t mean they automatically give you magic scaling sauce, but it is a very valuable feature.

5. Think about memory efficiency

At various times, we puzzled about weird latency spikes in our database activity. After many PagerDuty alerts and troubleshooting, it usually turned out that we could fix the issue by throwing more RAM at the problem, either in the form of a bigger database instance, or separate caches in front of it. It’s sad, but true: many performance problems can be solved by simply buying more RAM. And if you’re in a hurry because your hair is on fire, it’s often the best thing to do. There are limitations to that approach, of course – a m2.4xlarge instance on EC2 costs quite a bit of money, and eventually there are no bigger machines to turn to.

Besides buying more RAM, an effective solution is to use RAM more efficiently in the first place, so that a bigger part of your dataset fits in RAM. In order to decide where to optimise, you need to know what all your memory is being used for – and that’s surprisingly non-trivial. With a bit of digging, you can usually get your database to report how much disk space each of your tables and indexes is taking. Figuring out the working set, and how much memory is actually used for what, is harder.

As a rule of thumb, your performance will probably be more predictable if your indexes completely fit in RAM – so that there’s a maximum of one disk read per query, which reduces your exposure to fluctuations in I/O latency. But indexes can get rather large if you have a lot of data, so this can be an expensive proposition.

At one point we found ourselves reading up about the internal structure of an index in Postgres, and realised that we could save a few bytes per row by indexing on the hash of a string column rather than the string itself. (More on that in another post.) That reduced the memory pressure on the system, and helped keep things ticking along for another few months. That’s just one example of how it can be helpful to think about using memory efficiently.

6. Change capture is under-appreciated

So far I’ve only talked about things that suck – sorry about the negativity. As final point, I’d like to mention a technique which is awesome, but not nearly as widely known and appreciated as it should be: change capture.

The idea of change capture is simple: let the application consume a feed of all writes to the database. In other words, you have a background process which gets notified every time something changes in the database (insert, update or delete).

You could achieve a similar thing if, every time you write something to the database, you also post it to a message queue. However, change capture is better because it contains exactly the same data as what was committed to the database (avoiding race conditions). A good change capture system also allows you to stream through the entire existing dataset, and then seamlessly switch to consuming real-time updates when it has caught up.

Consumers of this changelog are decoupled from the app that generates the writes, which gives you great freedom to experiment without fear of bringing down the main site. You can use the changelog for updating and invalidating caches, for maintaining full-text indexes, for calculating analytics, for sending out emails and push notifications, for importing the data into Hadoop, and much more.

LinkedIn built a technology called Databus to do this. The open source release of Databus is for Oracle DB, and there is a proof-of-concept MySQL version (which is different from the version of Databus for MySQL that LinkedIn uses in production).

The new project I am working on, Apache Samza, also sits squarely in this space – it is a framework for processing real-time data feeds, somewhat like MapReduce for streams. I am excited about it because I think this pattern of processing change capture streams can help many people build apps that scale better, are easier to maintain and more reliable than many apps today. It’s open source, and you should go and try it out.

In conclusion

The problems discussed in this post are primarily data systems problems. That’s no coincidence: if you write your applications in a stateless way, they are pretty easy to scale, since you can just run more copies of them. Thus, whether you use Rails or Express.js or whatever framework du jour really doesn’t matter much. The hard part is scaling the stateful parts of your system: your databases.

There are no easy solutions for these problems. Some new technologies and services can help – for example, the new generation of distributed datastores tries to solve some of the above problems (especially around automating replication and failover), but they have other limitations. There certainly is no panacea.

Personally I’m totally fine with using new and experimental tools for derived data, such as caches and analytics, where data loss is annoying but not end of your business. I’m more cautious with the system of record (also known as source of truth). Every system has operational quirks, and the devil you know may let you sleep better at night than the one you don’t. I don’t really mind what that devil is in your particular case.

I’m interested to see whether database-as-a-service offerings such as Firebase, Orchestrate or Fauna can help (I’ve not used any of them seriously, so I can’t vouch for them at this point). I see big potential advantages for small teams in outsourcing operations, but also a big potential risk in locking yourself to a system that you couldn’t choose to host yourself if necessary.

Building scalable systems is not all sexy roflscale fun. It’s a lot of plumbing and yak shaving. A lot of hacking together tools that really ought to exist already, but all the open source solutions out there are too bad (and yours ends up bad too, but at least it solves your particular problem).

On the other hand, consider yourself lucky. If you’ve got scaling problems, you must be doing something right – you must be making something that people want.

LinkedIn Intro: Doing the Impossible on iOS

This is a copy of a post I originally wrote on the LinkedIn engineering blog.

We recently launched LinkedIn Intro — a new product that shows you LinkedIn profiles, right inside the native iPhone mail client. That’s right: we have extended Apple’s built-in iOS Mail app, a feat that many people consider to be impossible. This post is a short summary of how Intro works, and some of the ways we bent technology to our will.

With Intro, you can see at a glance the picture of the person who’s emailing you, learn more about their background, and connect with them on LinkedIn. This is what it looks like:

The iPhone mail app, before and after Intro
The iPhone mail app, before and after Intro

How Intro Came to Be

The origins of Intro go back to before the acquisition of Rapportive by LinkedIn. At Rapportive, we had built a browser extension that modified Gmail to show the profile of an email’s sender within the Gmail page. The product was popular, but people kept asking: “I love Rapportive in Gmail, when can I have it on mobile too?”

The magic of Rapportive is that you don’t have to remember to use it. Once you have it installed, it is right there inside your email, showing you everything you need to know about your contacts. You don’t need to fire up a new app or do a search in another browser tab, because the information is right there when you need it. It just feels natural.

At LinkedIn, we want to work wherever our members work. And we know that professionals spend a lot of time on their phone, checking and replying to emails — so we had to figure out how to enhance mobile email, giving professionals the information they need to be brilliant with people.

But how do we do that? Ask any iOS engineer: there is no API for extending the built-in mail app on the iPhone. If you wanted to build something like Rapportive, most people would tell you that it is impossible. Yet we figured it out.

Impossible #1: Extending the iOS Mail Client

Our key insight was this: we cannot extend the mail client, but we can add information to the messages themselves. One way to do this would be to modify the messages on the server — but then the modification would appear on all your clients, both desktop and mobile. That would not be what users want.

Instead, we can add information to messages by using a proxy server.

Rewriting messages using an IMAP proxy
Rewriting messages using an IMAP proxy

Normally your device connects directly to the servers of your email provider (Gmail, Yahoo, AOL, etc.), but we can configure the device to connect to the Intro proxy server instead.

The Intro proxy server speaks the IMAP protocol just like an email provider, but it doesn’t store messages itself. Instead, it forwards requests from the device to your email provider, and forwards responses from the email provider back to the device. En route, it inserts Intro information at the beginning of each message body — we call this the top bar.

The great thing about this approach: the proxy server can tailor the top bar to the device, since it knows which device is downloading the message. It can adapt the layout to be appropriate to the screen size, and it can take advantage of the client’s latest features, because it doesn’t need to worry about compatibility with other devices.

Our proxy server is written in Ruby using EventMachine, which allows it to efficiently handle many concurrent IMAP connections. We have developed some libraries to make the evented programming model nicer to work with, including Deferrable Gratification and LSpace.

Impossible #2: Interactive UI in Email

Ok, we have a way of adding information about the sender to a message — but so far it’s just a static piece of HTML. The top bar is deliberately minimal, because we don’t want it to get in the way. But wouldn’t it be awesome if you could tap the top bar and see the full LinkedIn profile… without leaving the mail app?

“But that’s impossible,” they cry, “you can’t run JavaScript in the mail client!” And that’s true — any JavaScript in an email is simply ignored. But iOS Mail does have powerful CSS capabilities, since it uses the same rendering engine as Safari.

Recall that CSS has a :hover state that is triggered when you hover the mouse over an element. This is used for popup menus in the navigation of many websites, or for tooltips. But what do you do on a touchscreen device, where there is no hovering or clicking, only tapping?

A little-known fact about CSS on Mobile Safari: in certain circumstances, tapping a link once simulates a :hover state on that link, and tapping it twice has the effect of a click. Thanks to this feature, popup menus and tooltips still work on iOS.

With some creativity, we figured out how to use this effect to create an interactive user interface within a message! Just tap the top bar to see the full LinkedIn profile:

With CSS tricks we can embed an entire LinkedIn profile in a message
With CSS tricks we can embed an entire LinkedIn profile in a message

Impossible #3: Dynamic Content in Email

This :hover trick allows us to have some interactivity within a message, but for more complex interactions we have to take you to the browser (where we can run a normal web app, without the mail app’s limitations). For example, if you want to connect with your contact on LinkedIn, we take you to Safari.

That’s fine, but it leaves us with a problem: the top bar needs to show if you’re already connected with someone. Say you send an invitation, and the other person accepts — now you’re connected, but if you open the same email again, it still says that you’re not connected!

This is because once a message has been downloaded, an IMAP client may assume that the message will never change. It is cached on the device, and unlike a web page, it never gets refreshed. Now that you’re connected, the top bar content needs to change. How do we update it?

Our solution: the connect button is in a tiny <iframe> which is refreshed every time you open the message. And if you open the message while your device is offline? No problem: the iframe is positioned on top of an identical-looking button in the static top bar HTML. If the iframe fails to load, it simply falls back to the connection status at the time when the message was downloaded.

This allows the top bar to contain dynamic content, even though it’s impossible for the server to modify a message once it has been downloaded by the device.

Using an embedded iframe to keep the connection status up-to-date, within an otherwise static top bar
Using an embedded iframe to keep the connection status up-to-date, within an otherwise static top bar

Impossible #4: Easy Installation

Once we got the IMAP proxy working, we were faced with another problem: how do we configure a device to use the proxy? We cannot expect users to manually enter IMAP and SMTP hostnames, choose the correct TLS settings, etc — it’s too tedious and error-prone.

Fortunately, Apple provides a friendly way of setting up email accounts by using configuration profiles — a facility that is often used in enterprise deployments of iOS devices. Using this technique, we can simply ask the user for their email address and password, autodiscover the email provider settings, and send a configuration profile to the device. The user just needs to tap “ok” a few times, and then they have a new mail account.

Moreover, for Gmail and Google Apps accounts, we can use OAuth, and never need to ask for the user’s password. Even better!

iOS configuration profiles make setup of new email accounts a breeze
iOS configuration profiles make setup of new email accounts a breeze

Security and Privacy

We understand that operating an email proxy server carries great responsibility. We respect the fact that your email may contain very personal or sensitive information, and we will do everything we can to make sure that it is safe. Our principles and key security measures are detailed in our pledge of privacy.


When we first built Rapportive for Gmail, people thought that we were crazy — writing a browser extension that modified the Gmail page on the fly, effectively writing an application inside someone else’s application! But it turned out to be a great success, and many others have since followed our footsteps and written browser extensions for Gmail.

Similarly, Intro’s approach of proxying IMAP is a novel way of delivering software to users. It operates at the limit of what is technically possible, but it has a big advantage: we can enhance the apps you already use. Of course the idea isn’t limited to the iPhone, so watch out for new platforms coming your way soon :)

This post has only scratched the surface of the interesting challenges we have overcome while building Intro. In follow-up posts we will talk about some of our CSS techniques, testing and monitoring tools, things we do to achieve high performance and high reliability, and more. In the meantime, check out Intro and let us know what you think!

System operations over seven centuries

On a walk in the Alps last week we came across a wonderful piece of engineering, more successful than most software systems could claim to be. It is the system of Waale, an ancient irrigation system in the Vinschgau, South Tyrol.

The climate in the Vinschgau is sunny, dry and windy. Without irrigation, agriculture would barely be possible, but if water from mountain streams is channelled to the fields, apple trees and meadows can flourish. The area has been inhabited at least since the Bronze Age, and it is likely that artificial irrigation started early. The oldest documents on the Waal system date from the 12th century, and some Waale built in the 14th century are still in use today.

The pictures in this post show the Leitenwaal and the Berkwaal near the village of Schluderns in South Tyrol, northern Italy. These two conduits carry water from a mountain stream (the Saldurbach) to the fields and meadows around Schluderns. Along their combined length of about six kilometers, they overcome many obstacles: twisting along the face of steep mountainsides, crossing aqueducts over deep ravines, tunnelling underneath boulders, before they finally arrive at the fields they supply.

Some sections look almost like a natural stream – except that they flow across the mountainside, not down, because they are designed to cover the greatest possible distance with the smallest possible loss in altitude. Other sections are more obviously artificial, where the furrow has been lined with flat stones or blanks of wood.

This system was originally built almost 700 years ago, using the technology available at the time: spade, axe, hammer and chisel. Of course, nowadays, electric pumps can take water from the river at the valley floor, and sprinkle it on the fields on the slopes above. But for many centuries, the only feasible option was to take water from a stream at high altitude, and let it flow down from there.

Here a feed of water is taken from a stream, and carried along a wooden gulley: the input to the irrigation system. Along the way, gates regulate the flow of water in the direction of various farms. For centuries, the details of water distribution – how much water shall be directed towards which farm at which time – have been governed by detailed agreements, and led to many disputes between farmers.

If the system were to fail for too long, crops would wither, so it was important that the system was always well-maintained and operational. And of course, parts of the system would fail from time to time – erosion, landslides, decay, accidents or any number of other faults could occur. When a part of the system broke, it was replaced using whatever technology was available at the time.

Thus, the system is now a patchwork of different water-carrying technologies from different ages. The oldest “pipes” were made from hollowed-out tree trunks, and some of them are still in use (water flows through tree trunks across a ravine in the left picture below). Later replacements have been made with concrete, steel or plastic pipes – whatever is believed to be the most reliable solution in the long term.

Perhaps the most impressive aspect of this system are its operability features, i.e. the things that help the operator of the Waal in his job of keeping the system running smoothly. For example, at regular intervals, the water flows through gratings which filter out twigs or other objects before they can cause blockages in pipes. The gratings are cleaned regularly, and tools for clearing out pipes are kept near the Waal. Routine inspections help detect problems early, before they escalate and cause further damage.

After heavy rainfall or melting of snow, the influx of water may exceed the Waal’s capacity. This is problematic: if the Waal bursts its banks, those banks would be damaged by erosion or washed away, making the problem much worse. Thus, the system includes overflow points at which water is channelled back into the natural stream if the Waal is over capacity (left photo below).

There is even an ingenious monitoring system (right photo below). A waterwheel is placed in the stream, and a cowbell is attached so that it rings on each rotation of the wheel (video). Thus, the operator can tell the rate of water flow from a distance, simply by listening for the rhythm of the bell.

The Waaler, the operator in charge of maintenance of the Waal, is an important and highly-regarded member of the local community. Traditionally, this role is elected every year on the first Sunday of Lent. The operator can be re-elected by the community if they were satisfied with his work in the previous year.

Looking at the lessons from this ancient irrigation system, and adapting them to software systems, my take-aways are:

  • Good interface design can survive through multiple generations of technology. A stream of water, flowing downhill, is a simple interface that can be implemented in stone-lined furrows, hollowed-out tree trunks, concrete, steel and plastic pipes, and more.
  • When replacing obsolete technology with new technology, some work is required to join them up – two pieces of standardised plastic piping may fit snugly, but you can’t expect the same from a hollow tree trunk interfacing with a plastic pipe.
  • New technology is not necessarily better than old technology. Hollow tree trunks are still used to feed water into 21st-century sprinkler irrigation systems.
  • API rate limits are not a new thing.
  • Continuously monitor the health of your system, and detect problems early.
  • Operations doesn’t just happen; it has to be someone’s job.
  • If a system solves an important problem, is well-engineered and well-operated, it can stick around for a very, very long time.

Improving the security of your SSH private key files

Ever wondered how those key files in ~/.ssh actually work? How secure are they actually?

As you probably do too, I use ssh many times every single day — every git fetch and git push, every deploy, every login to a server. And recently I realised that to me, ssh was just some crypto voodoo that I had become accustomed to using, but I didn’t really understand. That’s a shame — I like to know how stuff works. So I went on a little journey of discovery, and here are some of the things I found.

When you start reading about “crypto stuff”, you very quickly get buried in an avalanche of acronyms. I will briefly mention the acronyms as we go along; they don’t help you understand the concepts, but they are useful in case you want to Google for further details.

Quick recap: If you’ve ever used public key authentication, you probably have a file ~/.ssh/id_rsa or ~/.ssh/id_dsa in your home directory. This is your RSA/DSA private key, and ~/.ssh/ or ~/.ssh/ is its public key counterpart. Any machine you want to log in to needs to have your public key in ~/.ssh/authorized_keys on that machine. When you try to log in, your SSH client uses a digital signature to prove that you have the private key; the server checks that the signature is valid, and that the public key is authorized for your username; if all is well, you are granted access.

So what is actually inside this private key file?

The unencrypted private key format

Everyone recommends that you protect your private key with a passphrase (otherwise anybody who steals the file from you can log into everything you have access to). If you leave the passphrase blank, the key is not encrypted. Let’s look at this unencrypted format first, and consider passphrase protection later.

A ssh private key file typically looks something like this:

   ... etc ... lots of base64 blah blah ...

The private key is an ASN.1 data structure, serialized to a byte string using DER, and then Base64-encoded. ASN.1 is roughly comparable to JSON (it supports various data types such as integers, booleans, strings and lists/sequences that can be nested in a tree structure). It’s very widely used for cryptographic purposes, but it has somehow fallen out of fashion with the web generation (I don’t know why, it seems like a pretty decent format).

To look inside, let’s generate a fake RSA key without passphrase using ssh-keygen, and then decode it using asn1parse:

$ ssh-keygen -t rsa -N '' -f test_rsa_key
$ openssl asn1parse -in test_rsa_key
    0:d=0  hl=4 l=1189 cons: SEQUENCE
    4:d=1  hl=2 l=   1 prim: INTEGER           :00
    7:d=1  hl=4 l= 257 prim: INTEGER           :C36EB2429D429C7768AD9D879F98C...
  268:d=1  hl=2 l=   3 prim: INTEGER           :010001
  273:d=1  hl=4 l= 257 prim: INTEGER           :A27759F60AEA1F4D1D56878901E27...
  534:d=1  hl=3 l= 129 prim: INTEGER           :F9D23EF31A387694F03AD0D050265...
  666:d=1  hl=3 l= 129 prim: INTEGER           :C84415C26A468934F1037F99B6D14...
  798:d=1  hl=3 l= 129 prim: INTEGER           :D0ACED4635B5CA5FB896F88BB9177...
  930:d=1  hl=3 l= 128 prim: INTEGER           :511810DF9AFD590E11126397310A6...
 1061:d=1  hl=3 l= 129 prim: INTEGER           :E3A296AE14E7CAF32F7E493FDF474...

Alternatively, you can paste the Base64 string into Lapo Luchini’s excellent JavaScript ASN.1 decoder. You can see that ASN.1 structure is quite simple: a sequence of nine integers. Their meaning is defined in RFC2313. The first integer is a version number (0), and the third number is quite small (65537) – the public exponent e. The two important numbers are the 2048-bit integers that appear second and fourth in the sequence: the RSA modulus n, and the private exponent d. These numbers are used directly in the RSA algorithm. The remaining five numbers can be derived from n and d, and are only cached in the key file to speed up certain operations.

DSA keys are similar, a sequence of six integers:

$ ssh-keygen -t dsa -N '' -f test_dsa_key
$ openssl asn1parse -in test_dsa_key
    0:d=0  hl=4 l= 444 cons: SEQUENCE
    4:d=1  hl=2 l=   1 prim: INTEGER           :00
    7:d=1  hl=3 l= 129 prim: INTEGER           :E497DFBFB5610906D18BCFB4C3CCD...
  139:d=1  hl=2 l=  21 prim: INTEGER           :CF2478A96A941FB440C38A86F22CF...
  162:d=1  hl=3 l= 129 prim: INTEGER           :83218C0CA49BA8F11BE40EE1A7C72...
  294:d=1  hl=3 l= 128 prim: INTEGER           :16953EA4012988E914B466B9C37CB...
  425:d=1  hl=2 l=  21 prim: INTEGER           :89A356E922688EDEB1D388258C825...

Passphrase-protected keys

Next, in order to make life harder for an attacker who manages to steal your private key file, you protect it with a passphrase. How does this actually work?

$ ssh-keygen -t rsa -N 'super secret passphrase' -f test_rsa_key
$ cat test_rsa_key
Proc-Type: 4,ENCRYPTED
DEK-Info: AES-128-CBC,D54228DB5838E32589695E83A22595C7

    ... etc ...

We’ve gained two header lines, and if you try to parse that Base64 text, you’ll find it’s no longer valid ASN.1. That’s because the entire ASN.1 structure we saw above has been encrypted, and the Base64-encoded text is the output of the encryption. The header tells us the encryption algorithm that was used: AES-128 in CBC mode. The 128-bit hex string in the DEK-Info header is the initialization vector (IV) for the cipher. This is pretty standard stuff; all common crypto libraries can handle it.

But how do you get from the passphrase to the AES encryption key? I couldn’t find it documented anywhere, so I had to dig through the OpenSSL source to find it:

  1. Append the first 8 bytes of the IV to the passphrase, without a separator (serves as a salt).
  2. Take the MD5 hash of the resulting string (once).

That’s it. To prove it, let’s decrypt the private key manually (using the IV/salt from the DEK-Info header above):

$ tail -n +4 test_rsa_key | grep -v 'END ' | base64 -D |    # get just the binary blob
  openssl aes-128-cbc -d -iv D54228DB5838E32589695E83A22595C7 -K $(
    ruby -rdigest/md5 -e 'puts Digest::MD5.hexdigest(["super secret passphrase",0xD5,0x42,0x28,0xDB,0x58,0x38,0xE3,0x25].pack("a*cccccccc"))'
  ) |
  openssl asn1parse -inform DER

…which prints out the sequence of integers from the RSA key in the clear. Of course, if you want to inspect the key, it’s much easier to do this:

$ openssl rsa -text -in test_rsa_key -passin 'pass:super secret passphrase'

but I wanted to demonstrate exactly how the AES key is derived from the password. This is important because the private key protection has two weaknesses:

  • The digest algorithm is hard-coded to be MD5, which means that without changing the format, it’s not possible to upgrade to another hash function (e.g. SHA-1). This could be a problem if MD5 turns out not to be good enough.
  • The hash function is only applied once — there is no stretching. This is a problem because MD5 and AES are both fast to compute, and thus a short passphrase is quite easy to break with brute force.

If your private SSH key ever gets into the wrong hands, e.g. because someone steals your laptop or your backup hard drive, the attacker can try a huge number of possible passphrases, even with moderate computing resources. If your passphrase is a dictionary word, it can probably be broken in a matter of seconds.

That was the bad news: the passphrase on your SSH key isn’t as useful as you thought it was. But there is good news: you can upgrade to a more secure private key format, and everything continues to work!

Better key protection with PKCS#8

What we want is to derive a symmetric encryption key from the passphrase, and we want this derivation to be slow to compute, so that an attacker needs to buy more computing time if they want to brute-force the passphrase. If you’ve seen the use bcrypt meme, this should sound very familiar.

For SSH private keys, there are a few standards with clumsy names (acronym alert!) that can help us out:

  • PKCS #5 (RFC 2898) defines PBKDF2 (Password-Based Key Derivation Function 2), an algorithm for deriving an encryption key from a password by applying a hash function repeatedly. PBES2 (Password-Based Encryption Scheme 2) is also defined here; it simply means using a PBKDF2-generated key with a symmetric cipher.
  • PKCS #8 (RFC 5208) defines a format for storing encrypted private keys that supports PBKDF2. OpenSSL transparently supports private keys in PKCS#8 format, and OpenSSH uses OpenSSL, so if you’re using OpenSSH that means you can swap your traditional SSH key files for PKCS#8 files and everything continues to work as normal!

I don’t know why ssh-keygen still generates keys in SSH’s traditional format, even though a better format has been available for years. Compatibility with servers is not a concern, because the private key never leaves your machine. Fortunately it’s easy enough to convert to PKCS#8:

$ mv test_rsa_key test_rsa_key.old
$ openssl pkcs8 -topk8 -v2 des3 \
    -in test_rsa_key.old -passin 'pass:super secret passphrase' \
    -out test_rsa_key -passout 'pass:super secret passphrase'

If you try using this new PKCS#8 file with a SSH client, you should find that it works exactly the same as the file generated by ssh-keygen. But what’s inside it?

$ cat test_rsa_key
    ... etc ...

Notice that the header/footer lines have changed (BEGIN ENCRYPTED PRIVATE KEY instead of BEGIN RSA PRIVATE KEY), and the plaintext Proc-Type and DEK-Info headers have gone. In fact, the whole key file is once again a ASN.1 structure:

$ openssl asn1parse -in test_rsa_key
    0:d=0  hl=4 l=1294 cons: SEQUENCE
    4:d=1  hl=2 l=  64 cons: SEQUENCE
    6:d=2  hl=2 l=   9 prim: OBJECT            :PBES2
   17:d=2  hl=2 l=  51 cons: SEQUENCE
   19:d=3  hl=2 l=  27 cons: SEQUENCE
   21:d=4  hl=2 l=   9 prim: OBJECT            :PBKDF2
   32:d=4  hl=2 l=  14 cons: SEQUENCE
   34:d=5  hl=2 l=   8 prim: OCTET STRING      [HEX DUMP]:3AEFD2DBFBF9E3B3
   44:d=5  hl=2 l=   2 prim: INTEGER           :0800
   48:d=3  hl=2 l=  20 cons: SEQUENCE
   50:d=4  hl=2 l=   8 prim: OBJECT            :des-ede3-cbc
   60:d=4  hl=2 l=   8 prim: OCTET STRING      [HEX DUMP]:78ABEA3810B6879F
   70:d=1  hl=4 l=1224 prim: OCTET STRING      [HEX DUMP]:C0FA1A3D2BCD7A80DD61F4C0287F8B2D...

Use Lapo Luchini’s JavaScript ASN.1 decoder to display a nice ASN.1 tree structure:

Sequence (2 elements)
|- Sequence (2 elements)
|  |- Object identifier: 1.2.840.113549.1.5.13            // using PBES2 from PKCS#5
|  `- Sequence (2 elements)
|     |- Sequence (2 elements)
|     |  |- Object identifier: 1.2.840.113549.1.5.12      // using PBKDF2 -- yay! :)
|     |  `- Sequence (2 elements)
|     |     |- Byte string (8 bytes): 3AEFD2DBFBF9E3B3    // salt
|     |     `- Integer: 2048                              // iteration count
|     `- Sequence (2 elements)
|          Object identifier: 1.2.840.113549.3.7          // encrypted with Triple DES, CBC
|          Byte string (8 bytes): 78ABEA3810B6879F        // initialization vector
`- Byte string (1224 bytes): C0FA1A3D2BCD7A80DD61F4C0287F8B2DAB46A43E...  // encrypted key blob

The format uses OIDs, numeric codes allocated by a registration authority to unambiguously refer to algorithms. The OIDs in this key file tell us that the encryption scheme is pkcs5PBES2, that the key derivation function is PBKDF2, and that the encryption is performed using des-ede3-cbc. The hash function can be explicitly specified if needed; here it’s omitted, which means that it defaults to hMAC-SHA1.

The nice thing about having all those identifiers in the file is that if better algorithms are invented in future, we can upgrade the key file without having to change the container file format.

You can also see that the key derivation function uses an iteration count of 2,048. Compared to just one iteration in the traditional SSH key format, that’s good — it means that it’s much slower to brute-force the passphrase. The number 2,048 is currently hard-coded in OpenSSL; I hope that it will be configurable in future, as you could probably increase it without any noticeable slowdown on a modern computer.

Conclusion: better protection for your SSH private keys

If you already have a strong passphrase on your SSH private key, then converting it from the traditional private key format to PKCS#8 is roughly comparable to adding two extra keystrokes to your passphrase, for free. And if you have a weak passphrase, you can take your private key protection from “easily breakable” to “slightly harder to break”.

It’s so easy, you can do it right now:

$ mv ~/.ssh/id_rsa ~/.ssh/id_rsa.old
$ openssl pkcs8 -topk8 -v2 des3 -in ~/.ssh/id_rsa.old -out ~/.ssh/id_rsa
$ chmod 600 ~/.ssh/id_rsa
# Check that the converted key works; if yes, delete the old one:
$ rm ~/.ssh/id_rsa.old

The openssl pkcs8 command asks for a passphrase three times: once to unlock your existing private key, and twice for the passphrase for the new key. It doesn’t matter whether you use a new passphrase for the converted key or keep it the same as the old key.

Not all software can read the PKCS8 format, but that’s fine — only your SSH client needs to be able to read the private key, after all. From the server’s point of view, storing the private key in a different format changes nothing at all.

Update: Brendan Thompson has wrapped this conversion in a handy shell script called keycrypt.

Update: to undo this change

On Mac OS X 10.9 (Mavericks), the default installation of OpenSSH no longer supports PKCS#8 private keys for some reason. If you followed the instructions above, you may no longer be able to log into your servers. Fortunately, it’s easy to convert your private key from PKCS#8 format back into the traditional key format:

$ mv ~/.ssh/id_rsa ~/.ssh/id_rsa.pkcs8
$ openssl pkcs8 -in ~/.ssh/id_rsa.pkcs8 -out ~/.ssh/id_rsa
$ chmod 600 ~/.ssh/id_rsa
$ ssh-keygen -f ~/.ssh/id_rsa -p

The openssl command decrypts the key, and the ssh-keygen command re-encrypts it using the traditional SSH key format.

Schema evolution in Avro, Protocol Buffers and Thrift

So you have some data that you want to store in a file or send over the network. You may find yourself going through several phases of evolution:

  1. Using your programming language’s built-in serialization, such as Java serialization, Ruby’s marshal, or Python’s pickle. Or maybe you even invent your own format.
  2. Then you realise that being locked into one programming language sucks, so you move to using a widely supported, language-agnostic format like JSON (or XML if you like to party like it’s 1999).
  3. Then you decide that JSON is too verbose and too slow to parse, you’re annoyed that it doesn’t differentiate integers from floating point, and think that you’d quite like binary strings as well as Unicode strings. So you invent some sort of binary format that’s kinda like JSON, but binary (1, 2, 3, 4, 5, 6).
  4. Then you find that people are stuffing all sorts of random fields into their objects, using inconsistent types, and you’d quite like a schema and some documentation, thank you very much. Perhaps you’re also using a statically typed programming language and want to generate model classes from a schema. Also you realize that your binary JSON-lookalike actually isn’t all that compact, because you’re still storing field names over and over again; hey, if you had a schema, you could avoid storing objects’ field names, and you could save some more bytes!

Once you get to the fourth stage, your options are typically Thrift, Protocol Buffers or Avro. All three provide efficient, cross-language serialization of data using a schema, and code generation for the Java folks.

Plenty of comparisons have been written about them already (1, 2, 3, 4). However, many posts overlook a detail that seems mundane at first, but is actually cruicial: What happens if the schema changes?

In real life, data is always in flux. The moment you think you have finalised a schema, someone will come up with a use case that wasn’t anticipated, and wants to “just quickly add a field”. Fortunately Thrift, Protobuf and Avro all support schema evolution: you can change the schema, you can have producers and consumers with different versions of the schema at the same time, and it all continues to work. That is an extremely valuable feature when you’re dealing with a big production system, because it allows you to update different components of the system independently, at different times, without worrying about compatibility.

Which brings us to the topic of today’s post. I would like to explore how Protocol Buffers, Avro and Thrift actually encode data into bytes — and this will also help explain how each of them deals with schema changes. The design choices made by each of the frameworks are interesting, and by comparing them I think you can become a better engineer (by a little bit).

The example I will use is a little object describing a person. In JSON I would write it like this:

    "userName": "Martin",
    "favouriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]

This JSON encoding can be our baseline. If I remove all the whitespace it consumes 82 bytes.

Protocol Buffers

The Protocol Buffers schema for the person object might look something like this:

message Person {
    required string user_name        = 1;
    optional int64  favourite_number = 2;
    repeated string interests        = 3;

When we encode the data above using this schema, it uses 33 bytes, as follows:

Look exactly at how the binary representation is structured, byte by byte. The person record is just the concatentation of its fields. Each field starts with a byte that indicates its tag number (the numbers 1, 2, 3 in the schema above), and the type of the field. If the first byte of a field indicates that the field is a string, it is followed by the number of bytes in the string, and then the UTF-8 encoding of the string. If the first byte indicates that the field is an integer, a variable-length encoding of the number follows. There is no array type, but a tag number can appear multiple times to represent a multi-valued field.

This encoding has consequences for schema evolution:

  • There is no difference in the encoding between optional, required and repeated fields (except for the number of times the tag number can appear). This means that you can change a field from optional to repeated and vice versa (if the parser is expecting an optional field but sees the same tag number multiple times in one record, it discards all but the last value). required has an additional validation check, so if you change it, you risk runtime errors (if the sender of a message thinks that it’s optional, but the recipient thinks that it’s required).
  • An optional field without a value, or a repeated field with zero values, does not appear in the encoded data at all — the field with that tag number is simply absent. Thus, it is safe to remove that kind of field from the schema. However, you must never reuse the tag number for another field in future, because you may still have data stored that uses that tag for the field you deleted.
  • You can add a field to your record, as long as it is given a new tag number. If the Protobuf parser parser sees a tag number that is not defined in its version of the schema, it has no way of knowing what that field is called. But it does roughly know what type it is, because a 3-bit type code is included in the first byte of the field. This means that even though the parser can’t exactly interpret the field, it can figure out how many bytes it needs to skip in order to find the next field in the record.
  • You can rename fields, because field names don’t exist in the binary serialization, but you can never change a tag number.

This approach of using a tag number to represent each field is simple and effective. But as we’ll see in a minute, it’s not the only way of doing things.


Avro schemas can be written in two ways, either in a JSON format:

    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "userName",        "type": "string"},
        {"name": "favouriteNumber", "type": ["null", "long"]},
        {"name": "interests",       "type": {"type": "array", "items": "string"}}

…or in an IDL:

record Person {
    string               userName;
    union { null, long } favouriteNumber;
    array<string>        interests;

Notice that there are no tag numbers in the schema! So how does it work?

Here is the same example data encoded in just 32 bytes:

Strings are just a length prefix followed by UTF-8 bytes, but there’s nothing in the bytestream that tells you that it is a string. It could just as well be a variable-length integer, or something else entirely. The only way you can parse this binary data is by reading it alongside the schema, and the schema tells you what type to expect next. You need to have the exact same version of the schema as the writer of the data used. If you have the wrong schema, the parser will not be able to make head or tail of the binary data.

So how does Avro support schema evolution? Well, although you need to know the exact schema with which the data was written (the writer’s schema), that doesn’t have to be the same as the schema the consumer is expecting (the reader’s schema). You can actually give two different schemas to the Avro parser, and it uses resolution rules to translate data from the writer schema into the reader schema.

This has some interesting consequences for schema evolution:

  • The Avro encoding doesn’t have an indicator to say which field is next; it just encodes one field after another, in the order they appear in the schema. Since there is no way for the parser to know that a field has been skipped, there is no such thing as an optional field in Avro. Instead, if you want to be able to leave out a value, you can use a union type, like union { null, long } above. This is encoded as a byte to tell the parser which of the possible union types to use, followed by the value itself. By making a union with the null type (which is simply encoded as zero bytes) you can make a field optional.
  • Union types are powerful, but you must take care when changing them. If you want to add a type to a union, you first need to update all readers with the new schema, so that they know what to expect. Only once all readers are updated, the writers may start putting this new type in the records they generate.
  • You can reorder fields in a record however you like. Although the fields are encoded in the order they are declared, the parser matches fields in the reader and writer schema by name, which is why no tag numbers are needed in Avro.
  • Because fields are matched by name, changing the name of a field is tricky. You need to first update all readers of the data to use the new field name, while keeping the old name as an alias (since the name matching uses aliases from the reader’s schema). Then you can update the writer’s schema to use the new field name.
  • You can add a field to a record, provided that you also give it a default value (e.g. null if the field’s type is a union with null). The default is necessary so that when a reader using the new schema parses a record written with the old schema (and hence lacking the field), it can fill in the default instead.
  • Conversely, you can remove a field from a record, provided that it previously had a default value. (This is a good reason to give all your fields default values if possible.) This is so that when a reader using the old schema parses a record written with the new schema, it can fall back to the default.

This leaves us with the problem of knowing the exact schema with which a given record was written. The best solution depends on the context in which your data is being used:

  • In Hadoop you typically have large files containing millions of records, all encoded with the same schema. Object container files handle this case: they just include the schema once at the beginning of the file, and the rest of the file can be decoded with that schema.
  • In an RPC context, it’s probably too much overhead to send the schema with every request and response. But if your RPC framework uses long-lived connections, it can negotiate the schema once at the start of the connection, and amortize that overhead over many requests.
  • If you’re storing records in a database one-by-one, you may end up with different schema versions written at different times, and so you have to annotate each record with its schema version. If storing the schema itself is too much overhead, you can use a hash of the schema, or a sequential schema version number. You then need a schema registry where you can look up the exact schema definition for a given version number.

One way of looking at it: in Protocol Buffers, every field in a record is tagged, whereas in Avro, the entire record, file or network connection is tagged with a schema version.

At first glance it may seem that Avro’s approach suffers from greater complexity, because you need to go to the additional effort of distributing schemas. However, I am beginning to think that Avro’s approach also has some distinct advantages:

  • Object container files are wonderfully self-describing: the writer schema embedded in the file contains all the field names and types, and even documentation strings (if the author of the schema bothered to write some). This means you can load these files directly into interactive tools like Pig, and it Just Works™ without any configuration.
  • As Avro schemas are JSON, you can add your own metadata to them, e.g. describing application-level semantics for a field. And as you distribute schemas, that metadata automatically gets distributed too.
  • A schema registry is probably a good thing in any case, serving as documentation and helping you to find and reuse data. And because you simply can’t parse Avro data without the schema, the schema registry is guaranteed to be up-to-date. Of course you can set up a protobuf schema registry too, but since it’s not required for operation, it’ll end up being on a best-effort basis.


Thrift is a much bigger project than Avro or Protocol Buffers, as it’s not just a data serialization library, but also an entire RPC framework. It also has a somewhat different culture: whereas Avro and Protobuf standardize a single binary encoding, Thrift embraces a whole variety of different serialization formats (which it calls “protocols”).

Indeed, Thrift has two different JSON encodings, and no fewer than three different binary encodings. (However, one of the binary encodings, DenseProtocol, is only supported in the C++ implementation; since we’re interested in cross-language serialization, I will focus on the other two.)

All the encodings share the same schema definition, in Thrift IDL:

struct Person {
  1: string       userName,
  2: optional i64 favouriteNumber,
  3: list<string> interests

The BinaryProtocol encoding is very straightforward, but also fairly wasteful (it takes 59 bytes to encode our example record):

The CompactProtocol encoding is semantically equivalent, but uses variable-length integers and bit packing to reduce the size to 34 bytes:

As you can see, Thrift’s approach to schema evolution is the same as Protobuf’s: each field is manually assigned a tag in the IDL, and the tags and field types are stored in the binary encoding, which enables the parser to skip unknown fields. Thrift defines an explicit list type rather than Protobuf’s repeated field approach, but otherwise the two are very similar.

In terms of philosophy, the libraries are very different though. Thrift favours the “one-stop shop” style that gives you an entire integrated RPC framework and many choices (with varying cross-language support), whereas Protocol Buffers and Avro appear to follow much more of a “do one thing and do it well” style.

This post has been translated into Korean by Justin Song.