Drawing a map of distributed data systems

Published by Martin Kleppmann on 15 Mar 2017.

How we created an illustrated guide to help you find your way through the data landscape.

Designing Data-Intensive Applications, the book I’ve been working on for four years, is finally finished, and should be available in your favorite bookstore in the next week or two. An incomplete beta (Early Release) edition has been available for the last 2½ years as I continued working on the final chapters.

Throughout that process, we have been quietly working on a surprise. Something that has not been part of any of the Early Releases of the book. In fact, something that I have never seen in any tech book. And today we are excited to share it with you.

In Designing Data-Intensive Applications, each of the 12 chapters is accompanied by a map. The map is a kind of graphical table of contents of the chapter, showing how some of the main ideas in the chapter relate to each other.

Here is an example, from Chapter 3 (on storage engines):

Figure 1. Map illustration from Designing Data-Intensive Applications, O’Reilly Media, 2017.

Don’t take it too seriously—some of it is a little tongue-in-cheek, we have taken some artistic license, and the things included on the map are not exhaustive.

But it does reflect the structure of the chapter: political or geographic regions represent ways of doing something, and cities represent particular implementations of those approaches. Similar things are more likely to be close together, and roads or rivers represent concepts that connect different implementations or regions.

Most computing books describe one particular piece of software and discuss all the aspects of how it works. This book is structured differently: it starts with the concepts—discussing the high-level approaches of how you might solve some problem, and comparing the pros and cons of each—and then points out which pieces of software use which approach. The maps use the same structure: the region in which a city is located tells you what approach it uses.

For example, in the map above, you can see a high-level subdivision into two countries: transaction processing and analytics. Within transaction processing, there are two regions: log-structured storage and B-trees, which are two ways of implementing OLTP storage engines. Within the B-tree region, you see databases like MySQL and PostgreSQL^[1], while within the log-structured region you see databases like Cassandra and HBase. On the analytics side, you can see that the mountain range representing column storage reaches into both the data warehousing and the Hadoop regions, since the approach applies to both.

The maps are in black and white, both because of practicalities of printing and also because I was looking for a Tolkien-esque style. You are, of course, welcome to color them in yourself. In fact, by coloring them in, you would be following a fine tradition: for over three centuries, maps were printed in black and white from an engraved copper plate, and then colored in by hand.

Each of the chapters has a map like that, focusing on the particular aspects discussed in that chapter. This means that some cities appear on multiple islands—the data landscape is multidimensional, so a city may lie in more than one (conceptual) realm. For example, the map below is for Chapter 5 (on the topic of replication):

Figure 2. Map illustration from Designing Data-Intensive Applications, O’Reilly Media, 2017.

Cities representing Cassandra, MongoDB, MySQL, and others appear on both this map, the Chapter 3 map above, and some other maps, too.

Shipping routes connect some of the ports shown in the maps, in cases where there is a noteworthy link between chapters. Most of the maps are of islands, but there are some exceptions. (I won’t give away too much, but I just want to say…beware of the Kraken.)

I am incredibly delighted that O’Reilly was willing to take on this crazy idea of creating maps. It took a whole team to make them happen: from my rough pencil sketches (which showed the structure but had absolutely zero artistic value), Shabbir Diwan, Edie Freedman, and Ron Bilodeau created the beautifully illustrated versions you see above, and Marie Beaugureau patiently managed several rounds of revisions, in which we painstakingly polished all the details.

Perhaps you’re curious to know how we got onto the idea of creating maps. Early on in the Early Release of the book, some readers told me they would love some kind of flowchart to help them decide quickly which database they should use for their application. Such flowcharts have been attempted, but I never liked them much—it is tempting to read them out of context and jump to conclusions too quickly, and they have to simplify the issues to the point of almost being intellectually dishonest.

My goal for Designing Data-Intensive Applications was different. I can’t in good faith give you a recommendation for one particular tool because I don’t know enough about your particular requirements. However, I can teach you what questions to ask and how to evaluate vendors’ claims critically. That requires more subtlety and detail than can be expressed in a one-page flowchart, which is why the book is 600 pages long, not one page.

However, I did think that some kind of graphical representation of the main ideas and structure would be useful. I thought about Venn diagrams, but they’re excruciatingly boring. I thought about mind maps, and then started taking the “maps” bit more literally. I thought about the Atlas of Experience by Louise van Swaaij and Jean Klare, a sublime book that represents aspects of human life as fictitious places on a geographic map. (It is originally in Dutch; the English translation is, sadly, out of print.) The Phantom Tollbooth by Norton Juster and Jules Feiffer does a similar thing.

Closer to technology, similar map-style visualizations have been used to represent online communities, to visualize the history of relational databases and programming languages, and to explain libraries related to Facebook’s React. I simply love the style.

Other inspirations for me are the ornate maps produced in the medieval and renaissance periods, especially when it comes to whimsical sea monsters. For example, around 1590, Abraham Ortelius published a wonderful map of Iceland, with Mount Hekla spewing fire, and surrounded by fantastical sea monsters:

Figure 3. “Islandia” map by Abraham Ortelius, ca. 1590. Source: Wikimedia Commons

I feel those maps and sea monsters reflect the 16th-century sense of excitement to explore the earth and discover new places, as well as the dangers of sailing across unknown seas. And perhaps a bit of that excitement exists today in our exploration of technologies for storing, processing, and using data. There seems to be a lot of potential, but we also don’t really know what we’re doing, and it’s sometimes a bit dangerous (raise your hand if you’ve lost data at some point because something went wrong).

We hope the maps in Designing Data-Intensive Applications will help convey some of that excitement, and also make you smile. In both the print and ebook editions, the map for each chapter appears at the start of each chapter.

What’s more, we have taken all the individual chapter maps and assembled them into a poster—an archipelago of islands representing technologies in the sea of distributed data. The poster also includes some bonus sea monsters (of course).

If you are at Strata + Hadoop World, San Jose this week, you can drop by the O’Reilly booth to pick up a free print of the poster so you have something geeky and cool to hang on the wall in your office. Alternatively, you can download a JPG version for free from the O’Reilly website for your personal, noncommercial use.

We hope you enjoy Designing Data-Intensive Applications and the maps as much as we enjoyed making them!

Figure 4. Me holding the poster with the maps.

[1] Footnote for the well-actually crowd: yes, I know about hash indexes and GiST in PostgreSQL, various other index types in other databases, and the fact that in MySQL the index type is actually a matter of the storage engine (such as InnoDB), but those details are beside the point here. I am highlighting a dichotomy between a page-oriented update-in-place approach and a log-structured, compaction-based approach. This distinction is best explained with concrete examples, and the graphical representation cannot capture all the subtleties that are discussed in the text of the book.

If you found this post useful, please support me on Patreon so that I can write more like it!

To get notified when I write something new, follow me on Bluesky or Mastodon, or enter your email address:

I won't give your address to anyone else, won't send you any spam, and you can unsubscribe at any time.

Drawing a map of distributed data systems

Recent posts

Conference talks

Drawing a map of distributed data systems

Subscribe

My book

Recent posts

Conference talks