Skip to content

Upcoming conference talks about Samza

After my talk about Samza fault tolerance at Berlin Buzzwords was well received a few months ago, I submitted several more talk proposals to a variety of conferences. To my surprise, all the proposals were accepted, so I’m now going to have a fairly busy time in the next few months!

Here are the four conferences at which I’ll be speaking between September and November. All the talks are about Apache Samza, the stream processing project I’ve been working on. However, all the talks are different, each focussing on a different aspect and perspective.

If you don’t yet have a ticket for these conferences, there are a few discount codes below. Hope to see you there :-)

Turning the database inside out with Apache Samza
Strange Loop, September 18–19 in St. Louis, Missouri. (Lanyrd, Twitter)

The Strange Loop conference explores the future of software development from a wonderfully eclectic range of viewpoints, ranging from functional programming to distributed systems. In this talk I’ll discuss the potential of stream processing as a fundamental programming model, which has big advantages compared to the way we usually build applications today.

Building real-time data products at LinkedIn with Apache Samza
Strata + Hadoop World, October 15–17 in New York. (Lanyrd, Twitter)
Use discount code SPEAKER20 to get 20% off.

MapReduce and its cousins are powerful tools for building data products such as recommendation engines, detecting anomalies and improving relevance. However, with batch processing there may be several hours delay before new data is reflected in the output. With stream processing, you can potentially respond in seconds rather than hours, but you have to learn a whole new way of thinking in order to write your jobs. In this talk I’ll discuss some real-life examples of stream processing at LinkedIn, and show how to use Samza to solve real-time data problems.

Staying agile in the face of the data deluge
Span conference, October 28 in London, UK. (Lanyrd, Twitter)
Use this link to get a 20% discount.

An often-overlooked but important aspect of tools is their plasticity: if your application’s requirements change, how easily do the tools let you adapt your existing code and data to the new requirements? Samza is designed with plasticity in mind. In this talk I’ll discuss how re-processing of data streams can keep your application development agile.

Scalable stream processing with Apache Samza and Apache Kafka
ApacheCon Europe, November 17–21 in Budapest, Hungary. (Lanyrd, Twitter)

Many of the most important open source data infrastructure tools are projects of the Apache Software Foundatation: Hadoop, Zookeeper, Storm and Spark, to name just a few. In this talk I’ll focus on how Samza and Kafka (also Apache projects) fit into this lively open source ecosystem.

Background reading

If you don’t yet know about Samza, don’t worry: I’ll start each talk with a quick introduction to Samza, and not assume any prior knowledge.

But if you want to ask smart-ass questions and embarrass me in front of the audience, you can begin by reading the Samza documentation (thoroughly updated over the last few months by yours truly), and start thinking of particularly tricky questions to ask.

You may also be interested in this excellent series of articles by Jay Kreps, which are relevant to the upcoming talks:

Six things I wish we had known about scaling

Looking back at the last few years of building Rapportive and LinkedIn Intro, I realised that there were a number of lessons that we had to learn the hard way. We built some reasonably large data systems, and there are a few things I really wish we had known beforehand.

None of these lessons are particularly obscure – they are all well-documented, if you know where to look. They are the kind of things that made me think “I can’t believe I didn’t know that, I’m so stupid #facepalm” in retrospect. But perhaps I’m not the only one who started out not knowing these things, so I’ll write them down for the benefit of anyone else who finds themself having to scale a system.

The kind of system I’m talking about is the data backend of a consumer web/mobile app with a million users (order of magnitude). At the scale of Google, LinkedIn, Facebook or Twitter (hundreds of millions of users), you’ll have an entirely different set of problems, but you’ll also have a bigger team of experienced developers and operations people around you. The mid-range scale of about a million users is interesting, because it’s quite feasible for a small startup team to get there with some luck and good marketing skills. If that sounds like you, here are a few things to keep in mind.

1. Realistic load testing is hard

Improving the performance of a system is ideally a very scientific process. You have in your head a model of what your system is doing, and a theory of where the expensive operations are. You propose a change to the system, and predict what the outcome will be. Then you make the change, observe the system’s behaviour under laboratory conditions, and thus gather evidence which either confirms or contradicts your theory. That way you iterate your way to a better theory, and also a better-performing implementation.

Sadly, we hardly ever managed to do it that way in practice. If we were optimising a microbenchmark, running the same code a million times in a tight loop, it would be easy. But we are dealing with large volumes of data, spread out across multiple machines. If you read the same item a million times in a loop, it will simply be cached, and the load test tells you nothing. If you want meaningful results, the load test needs to simulate a realistically large working set, a realistic mixture of reads and writes, realistic distribution of requests over time, and so on. And that is difficult.

It’s difficult enough to simply know what your access patterns actually are, let alone simulate them. As a starting point, you can replay a few hours worth of access logs against a copy of your real dataset. However, that only really works for read requests. Simulating writes is harder, as you may need to account for business logic rules (e.g. a sequential workflow must first update A, then update B, then update C) and deal with changes that can happen only once (if your write changes state from D to E, you can’t change from D to E again later in the test, as you’re already in state E). That means you have to synchronise your access logs with your database snapshot, or somehow generate suitable synthetic write load.

Even harder if you want to test with a dataset that is larger than the one you actually have (so that you can find out what happens when you double your userbase, and prepare for that event). Now you have to work out the statistical properties of your dataset (the distribution of friends per user is a power law with x parameters, the correlation between one user’s number of friends and the number of friends that their friends have is y, etc) and generate a synthetic dataset with those parameters. You are now in deep, deep yak shaving territory. Step back from that yak.

In practice, it hardly ever works that way. We’re lucky if, sometimes, we can run the old code and the new code side-by-side, and observe how they perform in comparison. Often, not even that is possible. Usually we often just cross our fingers, deploy, and roll back if the change seems to have made things worse. That is deeply unsatisfying for a scientifically-minded person, but it more or less gets the job done.

2. Data evolution is difficult

Being able to rapidly respond to change is one of the biggest advantages of a small startup. Agility in product and process means you also need the freedom to change your mind about the structure of your code and your data. There is lot of talk about making code easy to change, eg. with good automated tests. But what about changing the structure of your data?

Schema changes have a reputation of being very painful, a reputation that is chiefly MySQL’s fault: simply adding a column to a table requires the entire table to be copied. On a large table, that might mean several hours during which you can’t write to the table. Various tools exist to make that less painful, but I find it unbelievable that the world’s most popular open source database handles such a common operation so badly.

Postgres can make simple schema changes without copying the table, which means they are almost instant. And of course the avoidance of schema changes is a primary selling point of document databases such as MongoDB (so it’s up to application code to deal with a database that uses different schemas for different documents). But simple schema changes, such as adding a new field or two, don’t tell the entire story.

Not all your data is in databases; some might be in archived log files or some kind of blob storage. How do you deal with changing the schema of that data? And sometimes you need to make complex changes to the data, such as breaking a large thing apart, or combining several small things, or migrating from one datastore to another. Standard tools don’t help much here, and document databases don’t make it any easier.

We’ve written large migration jobs that break the entire dataset into chunks, process chunks gradually over the course of a weekend, retry failed chunks, track which things were modified while the migration was happening, and finally catch up on the missed updates. A whole lot of complexity just for a one-off data migration. Sometimes that’s unavoidable, but it’s heavy lifting that you’d rather not have to do in the first place.

Hadoop data pipelines can help with this sort of thing, but now you have to set up a Hadoop cluster, learn how to use it, figure out how to get your data into it, and figure out how to get the transformed data out to your live systems again. Big companies like LinkedIn have figured out how to do that, but in a small team it can be a massive time-sink.

3. Database connections are a real limitation

In PostgreSQL, each client connection to the database is handled by a separate unix process; in MySQL, each connection uses a separate thread. Both of these models impose a fairly low limit on the number of connections you can have to the database – typically a few hundred. Every connection adds overhead, so the entire database slows down, even if those connections aren’t actively processing queries. For example, Heroku Postgres limits you to 60 connections on the smallest plan, and 500 connections on the largest plan, although having anywhere near 500 connections is actively discouraged.

In a fast-growing app, it doesn’t take long before you reach a few hundred connections. Each instance of your application server uses at least one. Each background worker process that needs to access the database uses one. Adding more machines running your application is fairly easy if they are stateless, but every machine you add means more connections.

Partitioning (sharding) and read replicas probably won’t help you with your connection limit, unless you can somehow load-balance requests so that all the requests for a particular partition are handled by a particular server instance. A better bet is to use a connection pooler, or to write your own data access layer which wraps database access behind an internal API.

That’s all doable, but it doesn’t seem a particularly valuable use of your time when you’re also trying to iterate on product features. And every additional service you deploy is another thing that can go wrong, another thing that needs to be monitored and maintained.

(Databases that use a lightweight connection model don’t have this problem, but they may have other problems instead.)

4. Read replicas are an operational pain

A common architecture is to designate one database instance as a leader (also known as master) and to send all database writes to that instance. The writes are then replicated to other database instances (called read replicas, followers or slaves), and many read-only queries can be served from the replicas, which takes load off the leader. This architecture is also good for fault tolerance, since it gives you a warm standby – if your leader dies, you can quickly promote one of the replicas to be the new leader (you wouldn’t want to be offline for hours while you restore the database from a backup).

What they don’t tell you is that setting up and maintaining replicas is significant operational pain. MySQL is particularly bad in this regard: in order to set up a new replica, you have to first lock the leader to stop all writes and take a consistent snapshot (which may take hours on a large database). How does your app cope if it can’t write to the database? What do your users think if they can’t post stuff?

With Postgres, you don’t need to stop writes to set up a replica, but it’s still some hassle. One of the things I like most about Heroku Postgres is that it wraps all the complexity of replication and WAL archiving behind a straightforward command-line tool.

Even so, you still need to failover manually if your leader fails. You need to monitor and maintain the replicas. Your database library may not support read replicas out of the box, so you may need to add that. Some reads need to be made on the leader, so that a user sees their own writes, even if there is replication lag. That’s all doable, but it’s additional complexity, and doesn’t add any value from users’ point of view.

Some distributed datastores such as MongoDB, RethinkDB and Couchbase also use this replication model, and they automate the replica creation and master failover processes. Just because they do that doesn’t mean they automatically give you magic scaling sauce, but it is a very valuable feature.

5. Think about memory efficiency

At various times, we puzzled about weird latency spikes in our database activity. After many PagerDuty alerts and troubleshooting, it usually turned out that we could fix the issue by throwing more RAM at the problem, either in the form of a bigger database instance, or separate caches in front of it. It’s sad, but true: many performance problems can be solved by simply buying more RAM. And if you’re in a hurry because your hair is on fire, it’s often the best thing to do. There are limitations to that approach, of course – a m2.4xlarge instance on EC2 costs quite a bit of money, and eventually there are no bigger machines to turn to.

Besides buying more RAM, an effective solution is to use RAM more efficiently in the first place, so that a bigger part of your dataset fits in RAM. In order to decide where to optimise, you need to know what all your memory is being used for – and that’s surprisingly non-trivial. With a bit of digging, you can usually get your database to report how much disk space each of your tables and indexes is taking. Figuring out the working set, and how much memory is actually used for what, is harder.

As a rule of thumb, your performance will probably be more predictable if your indexes completely fit in RAM – so that there’s a maximum of one disk read per query, which reduces your exposure to fluctuations in I/O latency. But indexes can get rather large if you have a lot of data, so this can be an expensive proposition.

At one point we found ourselves reading up about the internal structure of an index in Postgres, and realised that we could save a few bytes per row by indexing on the hash of a string column rather than the string itself. (More on that in another post.) That reduced the memory pressure on the system, and helped keep things ticking along for another few months. That’s just one example of how it can be helpful to think about using memory efficiently.

6. Change capture is under-appreciated

So far I’ve only talked about things that suck – sorry about the negativity. As final point, I’d like to mention a technique which is awesome, but not nearly as widely known and appreciated as it should be: change capture.

The idea of change capture is simple: let the application consume a feed of all writes to the database. In other words, you have a background process which gets notified every time something changes in the database (insert, update or delete).

You could achieve a similar thing if, every time you write something to the database, you also post it to a message queue. However, change capture is better because it contains exactly the same data as what was committed to the database (avoiding race conditions). A good change capture system also allows you to stream through the entire existing dataset, and then seamlessly switch to consuming real-time updates when it has caught up.

Consumers of this changelog are decoupled from the app that generates the writes, which gives you great freedom to experiment without fear of bringing down the main site. You can use the changelog for updating and invalidating caches, for maintaining full-text indexes, for calculating analytics, for sending out emails and push notifications, for importing the data into Hadoop, and much more.

LinkedIn built a technology called Databus to do this. The open source release of Databus is for Oracle DB, and there is a proof-of-concept MySQL version (which is different from the version of Databus for MySQL that LinkedIn uses in production).

The new project I am working on, Apache Samza, also sits squarely in this space – it is a framework for processing real-time data feeds, somewhat like MapReduce for streams. I am excited about it because I think this pattern of processing change capture streams can help many people build apps that scale better, are easier to maintain and more reliable than many apps today. It’s open source, and you should go and try it out.

In conclusion

The problems discussed in this post are primarily data systems problems. That’s no coincidence: if you write your applications in a stateless way, they are pretty easy to scale, since you can just run more copies of them. Thus, whether you use Rails or Express.js or whatever framework du jour really doesn’t matter much. The hard part is scaling the stateful parts of your system: your databases.

There are no easy solutions for these problems. Some new technologies and services can help – for example, the new generation of distributed datastores tries to solve some of the above problems (especially around automating replication and failover), but they have other limitations. There certainly is no panacea.

Personally I’m totally fine with using new and experimental tools for derived data, such as caches and analytics, where data loss is annoying but not end of your business. I’m more cautious with the system of record (also known as source of truth). Every system has operational quirks, and the devil you know may let you sleep better at night than the one you don’t. I don’t really mind what that devil is in your particular case.

I’m interested to see whether database-as-a-service offerings such as Firebase, Orchestrate or Fauna can help (I’ve not used any of them seriously, so I can’t vouch for them at this point). I see big potential advantages for small teams in outsourcing operations, but also a big potential risk in locking yourself to a system that you couldn’t choose to host yourself if necessary.

Building scalable systems is not all sexy roflscale fun. It’s a lot of plumbing and yak shaving. A lot of hacking together tools that really ought to exist already, but all the open source solutions out there are too bad (and yours ends up bad too, but at least it solves your particular problem).

On the other hand, consider yourself lucky. If you’ve got scaling problems, you must be doing something right – you must be making something that people want.

LinkedIn Intro: Doing the Impossible on iOS

This is a copy of a post I originally wrote on the LinkedIn engineering blog.

We recently launched LinkedIn Intro — a new product that shows you LinkedIn profiles, right inside the native iPhone mail client. That’s right: we have extended Apple’s built-in iOS Mail app, a feat that many people consider to be impossible. This post is a short summary of how Intro works, and some of the ways we bent technology to our will.

With Intro, you can see at a glance the picture of the person who’s emailing you, learn more about their background, and connect with them on LinkedIn. This is what it looks like:

The iPhone mail app, before and after Intro
The iPhone mail app, before and after Intro

How Intro Came to Be

The origins of Intro go back to before the acquisition of Rapportive by LinkedIn. At Rapportive, we had built a browser extension that modified Gmail to show the profile of an email’s sender within the Gmail page. The product was popular, but people kept asking: “I love Rapportive in Gmail, when can I have it on mobile too?”

The magic of Rapportive is that you don’t have to remember to use it. Once you have it installed, it is right there inside your email, showing you everything you need to know about your contacts. You don’t need to fire up a new app or do a search in another browser tab, because the information is right there when you need it. It just feels natural.

At LinkedIn, we want to work wherever our members work. And we know that professionals spend a lot of time on their phone, checking and replying to emails — so we had to figure out how to enhance mobile email, giving professionals the information they need to be brilliant with people.

But how do we do that? Ask any iOS engineer: there is no API for extending the built-in mail app on the iPhone. If you wanted to build something like Rapportive, most people would tell you that it is impossible. Yet we figured it out.

Impossible #1: Extending the iOS Mail Client

Our key insight was this: we cannot extend the mail client, but we can add information to the messages themselves. One way to do this would be to modify the messages on the server — but then the modification would appear on all your clients, both desktop and mobile. That would not be what users want.

Instead, we can add information to messages by using a proxy server.

Rewriting messages using an IMAP proxy
Rewriting messages using an IMAP proxy

Normally your device connects directly to the servers of your email provider (Gmail, Yahoo, AOL, etc.), but we can configure the device to connect to the Intro proxy server instead.

The Intro proxy server speaks the IMAP protocol just like an email provider, but it doesn’t store messages itself. Instead, it forwards requests from the device to your email provider, and forwards responses from the email provider back to the device. En route, it inserts Intro information at the beginning of each message body — we call this the top bar.

The great thing about this approach: the proxy server can tailor the top bar to the device, since it knows which device is downloading the message. It can adapt the layout to be appropriate to the screen size, and it can take advantage of the client’s latest features, because it doesn’t need to worry about compatibility with other devices.

Our proxy server is written in Ruby using EventMachine, which allows it to efficiently handle many concurrent IMAP connections. We have developed some libraries to make the evented programming model nicer to work with, including Deferrable Gratification and LSpace.

Impossible #2: Interactive UI in Email

Ok, we have a way of adding information about the sender to a message — but so far it’s just a static piece of HTML. The top bar is deliberately minimal, because we don’t want it to get in the way. But wouldn’t it be awesome if you could tap the top bar and see the full LinkedIn profile… without leaving the mail app?

“But that’s impossible,” they cry, “you can’t run JavaScript in the mail client!” And that’s true — any JavaScript in an email is simply ignored. But iOS Mail does have powerful CSS capabilities, since it uses the same rendering engine as Safari.

Recall that CSS has a :hover state that is triggered when you hover the mouse over an element. This is used for popup menus in the navigation of many websites, or for tooltips. But what do you do on a touchscreen device, where there is no hovering or clicking, only tapping?

A little-known fact about CSS on Mobile Safari: in certain circumstances, tapping a link once simulates a :hover state on that link, and tapping it twice has the effect of a click. Thanks to this feature, popup menus and tooltips still work on iOS.

With some creativity, we figured out how to use this effect to create an interactive user interface within a message! Just tap the top bar to see the full LinkedIn profile:

With CSS tricks we can embed an entire LinkedIn profile in a message
With CSS tricks we can embed an entire LinkedIn profile in a message

Impossible #3: Dynamic Content in Email

This :hover trick allows us to have some interactivity within a message, but for more complex interactions we have to take you to the browser (where we can run a normal web app, without the mail app’s limitations). For example, if you want to connect with your contact on LinkedIn, we take you to Safari.

That’s fine, but it leaves us with a problem: the top bar needs to show if you’re already connected with someone. Say you send an invitation, and the other person accepts — now you’re connected, but if you open the same email again, it still says that you’re not connected!

This is because once a message has been downloaded, an IMAP client may assume that the message will never change. It is cached on the device, and unlike a web page, it never gets refreshed. Now that you’re connected, the top bar content needs to change. How do we update it?

Our solution: the connect button is in a tiny <iframe> which is refreshed every time you open the message. And if you open the message while your device is offline? No problem: the iframe is positioned on top of an identical-looking button in the static top bar HTML. If the iframe fails to load, it simply falls back to the connection status at the time when the message was downloaded.

This allows the top bar to contain dynamic content, even though it’s impossible for the server to modify a message once it has been downloaded by the device.

Using an embedded iframe to keep the connection status up-to-date, within an otherwise static top bar
Using an embedded iframe to keep the connection status up-to-date, within an otherwise static top bar

Impossible #4: Easy Installation

Once we got the IMAP proxy working, we were faced with another problem: how do we configure a device to use the proxy? We cannot expect users to manually enter IMAP and SMTP hostnames, choose the correct TLS settings, etc — it’s too tedious and error-prone.

Fortunately, Apple provides a friendly way of setting up email accounts by using configuration profiles — a facility that is often used in enterprise deployments of iOS devices. Using this technique, we can simply ask the user for their email address and password, autodiscover the email provider settings, and send a configuration profile to the device. The user just needs to tap “ok” a few times, and then they have a new mail account.

Moreover, for Gmail and Google Apps accounts, we can use OAuth, and never need to ask for the user’s password. Even better!

iOS configuration profiles make setup of new email accounts a breeze
iOS configuration profiles make setup of new email accounts a breeze

Security and Privacy

We understand that operating an email proxy server carries great responsibility. We respect the fact that your email may contain very personal or sensitive information, and we will do everything we can to make sure that it is safe. Our principles and key security measures are detailed in our pledge of privacy.


When we first built Rapportive for Gmail, people thought that we were crazy — writing a browser extension that modified the Gmail page on the fly, effectively writing an application inside someone else’s application! But it turned out to be a great success, and many others have since followed our footsteps and written browser extensions for Gmail.

Similarly, Intro’s approach of proxying IMAP is a novel way of delivering software to users. It operates at the limit of what is technically possible, but it has a big advantage: we can enhance the apps you already use. Of course the idea isn’t limited to the iPhone, so watch out for new platforms coming your way soon :)

This post has only scratched the surface of the interesting challenges we have overcome while building Intro. In follow-up posts we will talk about some of our CSS techniques, testing and monitoring tools, things we do to achieve high performance and high reliability, and more. In the meantime, check out Intro and let us know what you think!

System operations over seven centuries

On a walk in the Alps last week we came across a wonderful piece of engineering, more successful than most software systems could claim to be. It is the system of Waale, an ancient irrigation system in the Vinschgau, South Tyrol.

The climate in the Vinschgau is sunny, dry and windy. Without irrigation, agriculture would barely be possible, but if water from mountain streams is channelled to the fields, apple trees and meadows can flourish. The area has been inhabited at least since the Bronze Age, and it is likely that artificial irrigation started early. The oldest documents on the Waal system date from the 12th century, and some Waale built in the 14th century are still in use today.

The pictures in this post show the Leitenwaal and the Berkwaal near the village of Schluderns in South Tyrol, northern Italy. These two conduits carry water from a mountain stream (the Saldurbach) to the fields and meadows around Schluderns. Along their combined length of about six kilometers, they overcome many obstacles: twisting along the face of steep mountainsides, crossing aqueducts over deep ravines, tunnelling underneath boulders, before they finally arrive at the fields they supply.

Some sections look almost like a natural stream – except that they flow across the mountainside, not down, because they are designed to cover the greatest possible distance with the smallest possible loss in altitude. Other sections are more obviously artificial, where the furrow has been lined with flat stones or blanks of wood.

This system was originally built almost 700 years ago, using the technology available at the time: spade, axe, hammer and chisel. Of course, nowadays, electric pumps can take water from the river at the valley floor, and sprinkle it on the fields on the slopes above. But for many centuries, the only feasible option was to take water from a stream at high altitude, and let it flow down from there.

Here a feed of water is taken from a stream, and carried along a wooden gulley: the input to the irrigation system. Along the way, gates regulate the flow of water in the direction of various farms. For centuries, the details of water distribution – how much water shall be directed towards which farm at which time – have been governed by detailed agreements, and led to many disputes between farmers.

If the system were to fail for too long, crops would wither, so it was important that the system was always well-maintained and operational. And of course, parts of the system would fail from time to time – erosion, landslides, decay, accidents or any number of other faults could occur. When a part of the system broke, it was replaced using whatever technology was available at the time.

Thus, the system is now a patchwork of different water-carrying technologies from different ages. The oldest “pipes” were made from hollowed-out tree trunks, and some of them are still in use (water flows through tree trunks across a ravine in the left picture below). Later replacements have been made with concrete, steel or plastic pipes – whatever is believed to be the most reliable solution in the long term.

Perhaps the most impressive aspect of this system are its operability features, i.e. the things that help the operator of the Waal in his job of keeping the system running smoothly. For example, at regular intervals, the water flows through gratings which filter out twigs or other objects before they can cause blockages in pipes. The gratings are cleaned regularly, and tools for clearing out pipes are kept near the Waal. Routine inspections help detect problems early, before they escalate and cause further damage.

After heavy rainfall or melting of snow, the influx of water may exceed the Waal’s capacity. This is problematic: if the Waal bursts its banks, those banks would be damaged by erosion or washed away, making the problem much worse. Thus, the system includes overflow points at which water is channelled back into the natural stream if the Waal is over capacity (left photo below).

There is even an ingenious monitoring system (right photo below). A waterwheel is placed in the stream, and a cowbell is attached so that it rings on each rotation of the wheel (video). Thus, the operator can tell the rate of water flow from a distance, simply by listening for the rhythm of the bell.

The Waaler, the operator in charge of maintenance of the Waal, is an important and highly-regarded member of the local community. Traditionally, this role is elected every year on the first Sunday of Lent. The operator can be re-elected by the community if they were satisfied with his work in the previous year.

Looking at the lessons from this ancient irrigation system, and adapting them to software systems, my take-aways are:

  • Good interface design can survive through multiple generations of technology. A stream of water, flowing downhill, is a simple interface that can be implemented in stone-lined furrows, hollowed-out tree trunks, concrete, steel and plastic pipes, and more.
  • When replacing obsolete technology with new technology, some work is required to join them up – two pieces of standardised plastic piping may fit snugly, but you can’t expect the same from a hollow tree trunk interfacing with a plastic pipe.
  • New technology is not necessarily better than old technology. Hollow tree trunks are still used to feed water into 21st-century sprinkler irrigation systems.
  • API rate limits are not a new thing.
  • Continuously monitor the health of your system, and detect problems early.
  • Operations doesn’t just happen; it has to be someone’s job.
  • If a system solves an important problem, is well-engineered and well-operated, it can stick around for a very, very long time.

Improving the security of your SSH private key files

Ever wondered how those key files in ~/.ssh actually work? How secure are they actually?

As you probably do too, I use ssh many times every single day — every git fetch and git push, every deploy, every login to a server. And recently I realised that to me, ssh was just some crypto voodoo that I had become accustomed to using, but I didn’t really understand. That’s a shame — I like to know how stuff works. So I went on a little journey of discovery, and here are some of the things I found.

When you start reading about “crypto stuff”, you very quickly get buried in an avalanche of acronyms. I will briefly mention the acronyms as we go along; they don’t help you understand the concepts, but they are useful in case you want to Google for further details.

Quick recap: If you’ve ever used public key authentication, you probably have a file ~/.ssh/id_rsa or ~/.ssh/id_dsa in your home directory. This is your RSA/DSA private key, and ~/.ssh/ or ~/.ssh/ is its public key counterpart. Any machine you want to log in to needs to have your public key in ~/.ssh/authorized_keys on that machine. When you try to log in, your SSH client uses a digital signature to prove that you have the private key; the server checks that the signature is valid, and that the public key is authorized for your username; if all is well, you are granted access.

So what is actually inside this private key file?

The unencrypted private key format

Everyone recommends that you protect your private key with a passphrase (otherwise anybody who steals the file from you can log into everything you have access to). If you leave the passphrase blank, the key is not encrypted. Let’s look at this unencrypted format first, and consider passphrase protection later.

A ssh private key file typically looks something like this:

   ... etc ... lots of base64 blah blah ...

The private key is an ASN.1 data structure, serialized to a byte string using DER, and then Base64-encoded. ASN.1 is roughly comparable to JSON (it supports various data types such as integers, booleans, strings and lists/sequences that can be nested in a tree structure). It’s very widely used for cryptographic purposes, but it has somehow fallen out of fashion with the web generation (I don’t know why, it seems like a pretty decent format).

To look inside, let’s generate a fake RSA key without passphrase using ssh-keygen, and then decode it using asn1parse:

$ ssh-keygen -t rsa -N '' -f test_rsa_key
$ openssl asn1parse -in test_rsa_key
    0:d=0  hl=4 l=1189 cons: SEQUENCE
    4:d=1  hl=2 l=   1 prim: INTEGER           :00
    7:d=1  hl=4 l= 257 prim: INTEGER           :C36EB2429D429C7768AD9D879F98C...
  268:d=1  hl=2 l=   3 prim: INTEGER           :010001
  273:d=1  hl=4 l= 257 prim: INTEGER           :A27759F60AEA1F4D1D56878901E27...
  534:d=1  hl=3 l= 129 prim: INTEGER           :F9D23EF31A387694F03AD0D050265...
  666:d=1  hl=3 l= 129 prim: INTEGER           :C84415C26A468934F1037F99B6D14...
  798:d=1  hl=3 l= 129 prim: INTEGER           :D0ACED4635B5CA5FB896F88BB9177...
  930:d=1  hl=3 l= 128 prim: INTEGER           :511810DF9AFD590E11126397310A6...
 1061:d=1  hl=3 l= 129 prim: INTEGER           :E3A296AE14E7CAF32F7E493FDF474...

Alternatively, you can paste the Base64 string into Lapo Luchini’s excellent JavaScript ASN.1 decoder. You can see that ASN.1 structure is quite simple: a sequence of nine integers. Their meaning is defined in RFC2313. The first integer is a version number (0), and the third number is quite small (65537) – the public exponent e. The two important numbers are the 2048-bit integers that appear second and fourth in the sequence: the RSA modulus n, and the private exponent d. These numbers are used directly in the RSA algorithm. The remaining five numbers can be derived from n and d, and are only cached in the key file to speed up certain operations.

DSA keys are similar, a sequence of six integers:

$ ssh-keygen -t dsa -N '' -f test_dsa_key
$ openssl asn1parse -in test_dsa_key
    0:d=0  hl=4 l= 444 cons: SEQUENCE
    4:d=1  hl=2 l=   1 prim: INTEGER           :00
    7:d=1  hl=3 l= 129 prim: INTEGER           :E497DFBFB5610906D18BCFB4C3CCD...
  139:d=1  hl=2 l=  21 prim: INTEGER           :CF2478A96A941FB440C38A86F22CF...
  162:d=1  hl=3 l= 129 prim: INTEGER           :83218C0CA49BA8F11BE40EE1A7C72...
  294:d=1  hl=3 l= 128 prim: INTEGER           :16953EA4012988E914B466B9C37CB...
  425:d=1  hl=2 l=  21 prim: INTEGER           :89A356E922688EDEB1D388258C825...

Passphrase-protected keys

Next, in order to make life harder for an attacker who manages to steal your private key file, you protect it with a passphrase. How does this actually work?

$ ssh-keygen -t rsa -N 'super secret passphrase' -f test_rsa_key
$ cat test_rsa_key
Proc-Type: 4,ENCRYPTED
DEK-Info: AES-128-CBC,D54228DB5838E32589695E83A22595C7

    ... etc ...

We’ve gained two header lines, and if you try to parse that Base64 text, you’ll find it’s no longer valid ASN.1. That’s because the entire ASN.1 structure we saw above has been encrypted, and the Base64-encoded text is the output of the encryption. The header tells us the encryption algorithm that was used: AES-128 in CBC mode. The 128-bit hex string in the DEK-Info header is the initialization vector (IV) for the cipher. This is pretty standard stuff; all common crypto libraries can handle it.

But how do you get from the passphrase to the AES encryption key? I couldn’t find it documented anywhere, so I had to dig through the OpenSSL source to find it:

  1. Append the first 8 bytes of the IV to the passphrase, without a separator (serves as a salt).
  2. Take the MD5 hash of the resulting string (once).

That’s it. To prove it, let’s decrypt the private key manually (using the IV/salt from the DEK-Info header above):

$ tail -n +4 test_rsa_key | grep -v 'END ' | base64 -D |    # get just the binary blob
  openssl aes-128-cbc -d -iv D54228DB5838E32589695E83A22595C7 -K $(
    ruby -rdigest/md5 -e 'puts Digest::MD5.hexdigest(["super secret passphrase",0xD5,0x42,0x28,0xDB,0x58,0x38,0xE3,0x25].pack("a*cccccccc"))'
  ) |
  openssl asn1parse -inform DER

…which prints out the sequence of integers from the RSA key in the clear. Of course, if you want to inspect the key, it’s much easier to do this:

$ openssl rsa -text -in test_rsa_key -passin 'pass:super secret passphrase'

but I wanted to demonstrate exactly how the AES key is derived from the password. This is important because the private key protection has two weaknesses:

  • The digest algorithm is hard-coded to be MD5, which means that without changing the format, it’s not possible to upgrade to another hash function (e.g. SHA-1). This could be a problem if MD5 turns out not to be good enough.
  • The hash function is only applied once — there is no stretching. This is a problem because MD5 and AES are both fast to compute, and thus a short passphrase is quite easy to break with brute force.

If your private SSH key ever gets into the wrong hands, e.g. because someone steals your laptop or your backup hard drive, the attacker can try a huge number of possible passphrases, even with moderate computing resources. If your passphrase is a dictionary word, it can probably be broken in a matter of seconds.

That was the bad news: the passphrase on your SSH key isn’t as useful as you thought it was. But there is good news: you can upgrade to a more secure private key format, and everything continues to work!

Better key protection with PKCS#8

What we want is to derive a symmetric encryption key from the passphrase, and we want this derivation to be slow to compute, so that an attacker needs to buy more computing time if they want to brute-force the passphrase. If you’ve seen the use bcrypt meme, this should sound very familiar.

For SSH private keys, there are a few standards with clumsy names (acronym alert!) that can help us out:

  • PKCS #5 (RFC 2898) defines PBKDF2 (Password-Based Key Derivation Function 2), an algorithm for deriving an encryption key from a password by applying a hash function repeatedly. PBES2 (Password-Based Encryption Scheme 2) is also defined here; it simply means using a PBKDF2-generated key with a symmetric cipher.
  • PKCS #8 (RFC 5208) defines a format for storing encrypted private keys that supports PBKDF2. OpenSSL transparently supports private keys in PKCS#8 format, and OpenSSH uses OpenSSL, so if you’re using OpenSSH that means you can swap your traditional SSH key files for PKCS#8 files and everything continues to work as normal!

I don’t know why ssh-keygen still generates keys in SSH’s traditional format, even though a better format has been available for years. Compatibility with servers is not a concern, because the private key never leaves your machine. Fortunately it’s easy enough to convert to PKCS#8:

$ mv test_rsa_key test_rsa_key.old
$ openssl pkcs8 -topk8 -v2 des3 \
    -in test_rsa_key.old -passin 'pass:super secret passphrase' \
    -out test_rsa_key -passout 'pass:super secret passphrase'

If you try using this new PKCS#8 file with a SSH client, you should find that it works exactly the same as the file generated by ssh-keygen. But what’s inside it?

$ cat test_rsa_key
    ... etc ...

Notice that the header/footer lines have changed (BEGIN ENCRYPTED PRIVATE KEY instead of BEGIN RSA PRIVATE KEY), and the plaintext Proc-Type and DEK-Info headers have gone. In fact, the whole key file is once again a ASN.1 structure:

$ openssl asn1parse -in test_rsa_key
    0:d=0  hl=4 l=1294 cons: SEQUENCE
    4:d=1  hl=2 l=  64 cons: SEQUENCE
    6:d=2  hl=2 l=   9 prim: OBJECT            :PBES2
   17:d=2  hl=2 l=  51 cons: SEQUENCE
   19:d=3  hl=2 l=  27 cons: SEQUENCE
   21:d=4  hl=2 l=   9 prim: OBJECT            :PBKDF2
   32:d=4  hl=2 l=  14 cons: SEQUENCE
   34:d=5  hl=2 l=   8 prim: OCTET STRING      [HEX DUMP]:3AEFD2DBFBF9E3B3
   44:d=5  hl=2 l=   2 prim: INTEGER           :0800
   48:d=3  hl=2 l=  20 cons: SEQUENCE
   50:d=4  hl=2 l=   8 prim: OBJECT            :des-ede3-cbc
   60:d=4  hl=2 l=   8 prim: OCTET STRING      [HEX DUMP]:78ABEA3810B6879F
   70:d=1  hl=4 l=1224 prim: OCTET STRING      [HEX DUMP]:C0FA1A3D2BCD7A80DD61F4C0287F8B2D...

Use Lapo Luchini’s JavaScript ASN.1 decoder to display a nice ASN.1 tree structure:

Sequence (2 elements)
|- Sequence (2 elements)
|  |- Object identifier: 1.2.840.113549.1.5.13            // using PBES2 from PKCS#5
|  `- Sequence (2 elements)
|     |- Sequence (2 elements)
|     |  |- Object identifier: 1.2.840.113549.1.5.12      // using PBKDF2 -- yay! :)
|     |  `- Sequence (2 elements)
|     |     |- Byte string (8 bytes): 3AEFD2DBFBF9E3B3    // salt
|     |     `- Integer: 2048                              // iteration count
|     `- Sequence (2 elements)
|          Object identifier: 1.2.840.113549.3.7          // encrypted with Triple DES, CBC
|          Byte string (8 bytes): 78ABEA3810B6879F        // initialization vector
`- Byte string (1224 bytes): C0FA1A3D2BCD7A80DD61F4C0287F8B2DAB46A43E...  // encrypted key blob

The format uses OIDs, numeric codes allocated by a registration authority to unambiguously refer to algorithms. The OIDs in this key file tell us that the encryption scheme is pkcs5PBES2, that the key derivation function is PBKDF2, and that the encryption is performed using des-ede3-cbc. The hash function can be explicitly specified if needed; here it’s omitted, which means that it defaults to hMAC-SHA1.

The nice thing about having all those identifiers in the file is that if better algorithms are invented in future, we can upgrade the key file without having to change the container file format.

You can also see that the key derivation function uses an iteration count of 2,048. Compared to just one iteration in the traditional SSH key format, that’s good — it means that it’s much slower to brute-force the passphrase. The number 2,048 is currently hard-coded in OpenSSL; I hope that it will be configurable in future, as you could probably increase it without any noticeable slowdown on a modern computer.

Conclusion: better protection for your SSH private keys

If you already have a strong passphrase on your SSH private key, then converting it from the traditional private key format to PKCS#8 is roughly comparable to adding two extra keystrokes to your passphrase, for free. And if you have a weak passphrase, you can take your private key protection from “easily breakable” to “slightly harder to break”.

It’s so easy, you can do it right now:

$ mv ~/.ssh/id_rsa ~/.ssh/id_rsa.old
$ openssl pkcs8 -topk8 -v2 des3 -in ~/.ssh/id_rsa.old -out ~/.ssh/id_rsa
$ chmod 600 ~/.ssh/id_rsa
# Check that the converted key works; if yes, delete the old one:
$ rm ~/.ssh/id_rsa.old

The openssl pkcs8 command asks for a passphrase three times: once to unlock your existing private key, and twice for the passphrase for the new key. It doesn’t matter whether you use a new passphrase for the converted key or keep it the same as the old key.

Not all software can read the PKCS8 format, but that’s fine — only your SSH client needs to be able to read the private key, after all. From the server’s point of view, storing the private key in a different format changes nothing at all.

Update: Brendan Thompson has wrapped this conversion in a handy shell script called keycrypt.

Update: to undo this change

On Mac OS X 10.9 (Mavericks), the default installation of OpenSSH no longer supports PKCS#8 private keys for some reason. If you followed the instructions above, you may no longer be able to log into your servers. Fortunately, it’s easy to convert your private key from PKCS#8 format back into the traditional key format:

$ mv ~/.ssh/id_rsa ~/.ssh/id_rsa.pkcs8
$ openssl pkcs8 -in ~/.ssh/id_rsa.pkcs8 -out ~/.ssh/id_rsa
$ chmod 600 ~/.ssh/id_rsa
$ ssh-keygen -f ~/.ssh/id_rsa -p

The openssl command decrypts the key, and the ssh-keygen command re-encrypts it using the traditional SSH key format.