My book, Designing Data-Intensive Applications, was published by O’Reilly in March 2017.
Published by Martin Kleppmann on 26 Mar 2014.
Looking back at the last few years of building Rapportive and LinkedIn Intro, I realised that there were a number of lessons that we had to learn the hard way. We built some reasonably large data systems, and there are a few things I really wish we had known beforehand.
None of these lessons are particularly obscure – they are all well-documented, if you know where to look. They are the kind of things that made me think “I can’t believe I didn’t know that, I’m so stupid #facepalm” in retrospect. But perhaps I’m not the only one who started out not knowing these things, so I’ll write them down for the benefit of anyone else who finds themself having to scale a system.
The kind of system I’m talking about is the data backend of a consumer web/mobile app with a million users (order of magnitude). At the scale of Google, LinkedIn, Facebook or Twitter (hundreds of millions of users), you’ll have an entirely different set of problems, but you’ll also have a bigger team of experienced developers and operations people around you. The mid-range scale of about a million users is interesting, because it’s quite feasible for a small startup team to get there with some luck and good marketing skills. If that sounds like you, here are a few things to keep in mind.
Improving the performance of a system is ideally a very scientific process. You have in your head a model of what your system is doing, and a theory of where the expensive operations are. You propose a change to the system, and predict what the outcome will be. Then you make the change, observe the system’s behaviour under laboratory conditions, and thus gather evidence which either confirms or contradicts your theory. That way you iterate your way to a better theory, and also a better-performing implementation.
Sadly, we hardly ever managed to do it that way in practice. If we were optimising a microbenchmark, running the same code a million times in a tight loop, it would be easy. But we are dealing with large volumes of data, spread out across multiple machines. If you read the same item a million times in a loop, it will simply be cached, and the load test tells you nothing. If you want meaningful results, the load test needs to simulate a realistically large working set, a realistic mixture of reads and writes, realistic distribution of requests over time, and so on. And that is difficult.
It’s difficult enough to simply know what your access patterns actually are, let alone simulate them. As a starting point, you can replay a few hours worth of access logs against a copy of your real dataset. However, that only really works for read requests. Simulating writes is harder, as you may need to account for business logic rules (e.g. a sequential workflow must first update A, then update B, then update C) and deal with changes that can happen only once (if your write changes state from D to E, you can’t change from D to E again later in the test, as you’re already in state E). That means you have to synchronise your access logs with your database snapshot, or somehow generate suitable synthetic write load.
Even harder if you want to test with a dataset that is larger than the one you actually have (so that you can find out what happens when you double your userbase, and prepare for that event). Now you have to work out the statistical properties of your dataset (the distribution of friends per user is a power law with x parameters, the correlation between one user’s number of friends and the number of friends that their friends have is y, etc) and generate a synthetic dataset with those parameters. You are now in deep, deep yak shaving territory. Step back from that yak.
In practice, it hardly ever works that way. We’re lucky if, sometimes, we can run the old code and the new code side-by-side, and observe how they perform in comparison. Often, not even that is possible. Usually we often just cross our fingers, deploy, and roll back if the change seems to have made things worse. That is deeply unsatisfying for a scientifically-minded person, but it more or less gets the job done.
Being able to rapidly respond to change is one of the biggest advantages of a small startup. Agility in product and process means you also need the freedom to change your mind about the structure of your code and your data. There is lot of talk about making code easy to change, eg. with good automated tests. But what about changing the structure of your data?
Schema changes have a reputation of being very painful, a reputation that is chiefly MySQL’s fault: simply adding a column to a table requires the entire table to be copied. On a large table, that might mean several hours during which you can’t write to the table. Various tools exist to make that less painful, but I find it unbelievable that the world’s most popular open source database handles such a common operation so badly.
Postgres can make simple schema changes without copying the table, which means they are almost instant. And of course the avoidance of schema changes is a primary selling point of document databases such as MongoDB (so it’s up to application code to deal with a database that uses different schemas for different documents). But simple schema changes, such as adding a new field or two, don’t tell the entire story.
Not all your data is in databases; some might be in archived log files or some kind of blob storage. How do you deal with changing the schema of that data? And sometimes you need to make complex changes to the data, such as breaking a large thing apart, or combining several small things, or migrating from one datastore to another. Standard tools don’t help much here, and document databases don’t make it any easier.
We’ve written large migration jobs that break the entire dataset into chunks, process chunks gradually over the course of a weekend, retry failed chunks, track which things were modified while the migration was happening, and finally catch up on the missed updates. A whole lot of complexity just for a one-off data migration. Sometimes that’s unavoidable, but it’s heavy lifting that you’d rather not have to do in the first place.
Hadoop data pipelines can help with this sort of thing, but now you have to set up a Hadoop cluster, learn how to use it, figure out how to get your data into it, and figure out how to get the transformed data out to your live systems again. Big companies like LinkedIn have figured out how to do that, but in a small team it can be a massive time-sink.
In PostgreSQL, each client connection to the database is handled by a separate unix process; in MySQL, each connection uses a separate thread. Both of these models impose a fairly low limit on the number of connections you can have to the database – typically a few hundred. Every connection adds overhead, so the entire database slows down, even if those connections aren’t actively processing queries. For example, Heroku Postgres limits you to 60 connections on the smallest plan, and 500 connections on the largest plan, although having anywhere near 500 connections is actively discouraged.
In a fast-growing app, it doesn’t take long before you reach a few hundred connections. Each instance of your application server uses at least one. Each background worker process that needs to access the database uses one. Adding more machines running your application is fairly easy if they are stateless, but every machine you add means more connections.
Partitioning (sharding) and read replicas probably won’t help you with your connection limit, unless you can somehow load-balance requests so that all the requests for a particular partition are handled by a particular server instance. A better bet is to use a connection pooler, or to write your own data access layer which wraps database access behind an internal API.
That’s all doable, but it doesn’t seem a particularly valuable use of your time when you’re also trying to iterate on product features. And every additional service you deploy is another thing that can go wrong, another thing that needs to be monitored and maintained.
(Databases that use a lightweight connection model don’t have this problem, but they may have other problems instead.)
A common architecture is to designate one database instance as a leader (also known as master) and to send all database writes to that instance. The writes are then replicated to other database instances (called read replicas, followers or slaves), and many read-only queries can be served from the replicas, which takes load off the leader. This architecture is also good for fault tolerance, since it gives you a warm standby – if your leader dies, you can quickly promote one of the replicas to be the new leader (you wouldn’t want to be offline for hours while you restore the database from a backup).
What they don’t tell you is that setting up and maintaining replicas is significant operational pain. MySQL is particularly bad in this regard: in order to set up a new replica, you have to first lock the leader to stop all writes and take a consistent snapshot (which may take hours on a large database). How does your app cope if it can’t write to the database? What do your users think if they can’t post stuff?
With Postgres, you don’t need to stop writes to set up a replica, but it’s still some hassle. One of the things I like most about Heroku Postgres is that it wraps all the complexity of replication and WAL archiving behind a straightforward command-line tool.
Even so, you still need to failover manually if your leader fails. You need to monitor and maintain the replicas. Your database library may not support read replicas out of the box, so you may need to add that. Some reads need to be made on the leader, so that a user sees their own writes, even if there is replication lag. That’s all doable, but it’s additional complexity, and doesn’t add any value from users’ point of view.
Some distributed datastores such as MongoDB, RethinkDB and Couchbase also use this replication model, and they automate the replica creation and master failover processes. Just because they do that doesn’t mean they automatically give you magic scaling sauce, but it is a very valuable feature.
At various times, we puzzled about weird latency spikes in our database activity. After many PagerDuty alerts and troubleshooting, it usually turned out that we could fix the issue by throwing more RAM at the problem, either in the form of a bigger database instance, or separate caches in front of it. It’s sad, but true: many performance problems can be solved by simply buying more RAM. And if you’re in a hurry because your hair is on fire, it’s often the best thing to do. There are limitations to that approach, of course – a m2.4xlarge instance on EC2 costs quite a bit of money, and eventually there are no bigger machines to turn to.
Besides buying more RAM, an effective solution is to use RAM more efficiently in the first place, so that a bigger part of your dataset fits in RAM. In order to decide where to optimise, you need to know what all your memory is being used for – and that’s surprisingly non-trivial. With a bit of digging, you can usually get your database to report how much disk space each of your tables and indexes is taking. Figuring out the working set, and how much memory is actually used for what, is harder.
As a rule of thumb, your performance will probably be more predictable if your indexes completely fit in RAM – so that there’s a maximum of one disk read per query, which reduces your exposure to fluctuations in I/O latency. But indexes can get rather large if you have a lot of data, so this can be an expensive proposition.
At one point we found ourselves reading up about the internal structure of an index in Postgres, and realised that we could save a few bytes per row by indexing on the hash of a string column rather than the string itself. (More on that in another post.) That reduced the memory pressure on the system, and helped keep things ticking along for another few months. That’s just one example of how it can be helpful to think about using memory efficiently.
So far I’ve only talked about things that suck – sorry about the negativity. As final point, I’d like to mention a technique which is awesome, but not nearly as widely known and appreciated as it should be: change capture.
The idea of change capture is simple: let the application consume a feed of all writes to the database. In other words, you have a background process which gets notified every time something changes in the database (insert, update or delete).
You could achieve a similar thing if, every time you write something to the database, you also post it to a message queue. However, change capture is better because it contains exactly the same data as what was committed to the database (avoiding race conditions). A good change capture system also allows you to stream through the entire existing dataset, and then seamlessly switch to consuming real-time updates when it has caught up.
Consumers of this changelog are decoupled from the app that generates the writes, which gives you great freedom to experiment without fear of bringing down the main site. You can use the changelog for updating and invalidating caches, for maintaining full-text indexes, for calculating analytics, for sending out emails and push notifications, for importing the data into Hadoop, and much more.
LinkedIn built a technology called Databus to do this. The open source release of Databus is for Oracle DB, and there is a proof-of-concept MySQL version (which is different from the version of Databus for MySQL that LinkedIn uses in production).
The new project I am working on, Apache Samza, also sits squarely in this space – it is a framework for processing real-time data feeds, somewhat like MapReduce for streams. I am excited about it because I think this pattern of processing change capture streams can help many people build apps that scale better, are easier to maintain and more reliable than many apps today. It’s open source, and you should go and try it out.
The problems discussed in this post are primarily data systems problems. That’s no coincidence: if you write your applications in a stateless way, they are pretty easy to scale, since you can just run more copies of them. Thus, whether you use Rails or Express.js or whatever framework du jour really doesn’t matter much. The hard part is scaling the stateful parts of your system: your databases.
There are no easy solutions for these problems. Some new technologies and services can help – for example, the new generation of distributed datastores tries to solve some of the above problems (especially around automating replication and failover), but they have other limitations. There certainly is no panacea.
Personally I’m totally fine with using new and experimental tools for derived data, such as caches and analytics, where data loss is annoying but not end of your business. I’m more cautious with the system of record (also known as source of truth). Every system has operational quirks, and the devil you know may let you sleep better at night than the one you don’t. I don’t really mind what that devil is in your particular case.
I’m interested to see whether database-as-a-service offerings such as Firebase, Orchestrate or Fauna can help (I’ve not used any of them seriously, so I can’t vouch for them at this point). I see big potential advantages for small teams in outsourcing operations, but also a big potential risk in locking yourself to a system that you couldn’t choose to host yourself if necessary.
Building scalable systems is not all sexy roflscale fun. It’s a lot of plumbing and yak shaving. A lot of hacking together tools that really ought to exist already, but all the open source solutions out there are too bad (and yours ends up bad too, but at least it solves your particular problem).
On the other hand, consider yourself lucky. If you’ve got scaling problems, you must be doing something right – you must be making something that people want.