One of the things I realised while doing research for my book is that
contemporary software engineering still has a lot to learn from the 1970s. As we’re in such
a fast-moving field, we often have a tendency of dismissing older ideas as irrelevant – and
consequently, we end up having to learn the same lessons over and over again, the hard way.
Although computers have got faster, data has got bigger and requirements have become more complex,
many old ideas are actually still highly relevant today.
In this blog post I’d like to highlight one particular set of old ideas that I think deserves more
attention today: the Unix philosophy. I’ll show how this philosophy is very different from the
design approach of mainstream databases, and explore what it would look like if modern distributed
data systems learnt a thing or two from Unix.
In particular, I’m going to argue that there are a lot of similarities between Unix pipes and Apache
Kafka, and that this similarity enables good architectural styles for large-scale applications. But
before we get into that, let me remind you of the foundations of the Unix philosophy. You’ve
probably seen the power of Unix tools before – but to get started, let me give you a concrete
example that we can talk about.
Say you have a web server that writes an entry to a log file every time it serves a request. For
example, using the nginx default access log format, one line of the log might look like this:
216.58.210.78 - - [27/Feb/2015:17:55:11 +0000] "GET /css/typography.css HTTP/1.1"
200 3377 "http://martin.kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36"
(That is actually one line, it’s only broken up into multiple lines here for readability.) This line
of the log indicates that on 27 February 2015 at 17:55:11 UTC, the server received a request for the
file /css/typography.css from the client IP address 216.58.210.78. It then goes on to note various
other details, including the browser’s user-agent string.
Various tools can take these log files and produce pretty reports about your website traffic, but
for the sake of the exercise, let’s build our own, using basic Unix tools. Let’s determine the
5 most popular URLs on our website. To start with, we need to extract the path of the URL that was
requested, for which we can use awk.
awk doesn’t know about the format of nginx logs – it just treats the log file as text. By
default, awk takes one line of input at a time, splits it by whitespace, and makes the
whitespace-separated components available as variables $1, $2, etc. In the nginx log example,
the requested URL path is the seventh whitespace-separated component:
Now that you’ve extracted the path, you can determine the 5 most popular pages on your website as
follows:
The output of that series of commands looks something like this:
Although the above command line looks a bit obscure if you’re unfamiliar with Unix tools, it is
incredibly powerful. It will process gigabytes of log files in a matter of seconds, and you can
easily modify the analysis to suit your needs. For example, if you want to count top client IP
addresses instead of top pages, change the awk argument to '{print $1}'.
Many data analyses can be done in a few minutes using some combination of awk, sed, grep,
sort, uniq and xargs, and they perform
surprisingly well.
This is no coincidence: it is a direct result of the design philosophy of Unix.
The Unix philosophy is a set of principles that emerged gradually during the design and
implementation of Unix systems during the late 1960s and ’70s. There are various interpretations of
the Unix philosophy, but two points that particularly stand out were
described by Doug McIlroy, Elliot Pinson and Berk Tague
as follows in 1978:
Make each program do one thing well. To do a new job, build afresh rather than complicate old
programs by adding new “features.”
Expect the output of every program to become the input to another, as yet unknown, program.
These principles are the foundation for chaining together programs into pipelines that can
accomplish complex processing tasks. The key idea here is that a program does not know or care
where its input is coming from, or where its output is going to: it may be a file, or another
program that’s part of the operating system, or another program written by someone else entirely.
The tools that come with the operating system are generic, but they are designed such that they can
be composed together into larger programs that can perform application-specific tasks.
The benefits that the designers of Unix derived from this design approach sound quite like the ideas
of the Agile and DevOps movements that appeared decades later: scripting and automation, rapid
prototyping, incremental iteration, being friendly to experimentation, and breaking down large
projects into manageable chunks. Plus ça change.
When you join two commands with the pipe character in your shell, the shell starts both programs at
the same time, and attaches the output of the first process to the second process’ input. This
attachment mechanism uses the pipe syscall provided by the
operating system.
Note that this wiring is not done by the programs themselves, but by the shell – this allows them
to be loosely coupled, and not worry about where
their input is coming from, or where their output is going.
The pipe had been invented in 1964 by Doug McIlroy, who first
described it like this in an internal Bell Labs memo: “We should
have some ways of connecting programs like [a] garden hose – screw in another segment when it
becomes necessary to massage data in another way.” Dennis Richie later
wrote up his perspective on the memo.
They also realised early that the inter-process communication mechanism (pipes) can look very
similar to the mechanism for reading and writing files. We now call this input redirection (using
the contents of a file as input to a process) and output redirection (writing the output of
a process to a file).
The reason that Unix programs can be composed so flexibly is that they all conform to the same
interface: most programs have one stream for input data (stdin) and two output streams (stdout
for regular output data, and stderr for errors and diagnostic messages).
Programs may also do other things besides reading stdin and writing stdout, such as reading and
writing files, communicating over the network, or drawing a graphical user interface. However, the
stdin/stdout communication is considered to be the main way how data flows from one Unix tool to
another.
And the great thing about the stdin/stdout interface is that anyone can implement it easily, in
any programming language. You can develop your own tool that conforms to this interface, and it will
play nicely with all the standard tools that ship as part of the operating system.
For example, when analysing a web server log file, perhaps you want to find out how many visitors
you have from each country. The log doesn’t tell you the country, but it does tell you the IP
address, which you can translate into a country using an IP geolocation database. Such a database
isn’t included with your operating system by default, but you can write your own tool that takes
IP addresses on stdin and outputs country codes on stdout.
Once you’ve written that tool, you can include it in the data processing pipeline we discussed
previously, and it will work just fine. This may seem painfully obvious if you’ve been working with
Unix for a while, but I’d like to emphasise how remarkable this is: your own code runs on equal
terms with the tools provided by the operating system.
Apps with graphical user interfaces or web apps cannot simply be extended and wired together like
this. You can’t just pipe Gmail into a separate search engine app, and post results to a wiki.
Today it’s an exception, not the norm, to have programs that work together as smoothly as Unix tools
do.
Change of scene. Around the same time as Unix was being developed, the
relational data model
was proposed, which in time became SQL, and was implemented in many popular databases. Many
databases actually run on Unix systems. Does that mean they also follow the Unix philosophy?
The dataflow in most database systems is very different from Unix tools. Rather than using stdin
and stdout as communication channels, there is a database server, and several clients. The
clients send queries to read or write data on the server, the server handles the queries and sends
responses to the clients. This relationship is fundamentally asymmetric: clients and servers are
distinct roles.
What about the composability and extensibility that we find in Unix systems? Clients can do anything
they like (since they are application code), but database servers are mostly in the business of
storing and retrieving your data. Letting you run arbitrary code is not their top priority.
That said, many databases do provide some ways of extending the database server with your own code.
For example, many relational databases let you write stored procedures in their own, rudimentary
procedural language such as PL/SQL
(and some let you run code in a general-purpose programming language such as
JavaScript). However, the
things you can do in stored procedures are limited.
Other extension points in some databases are support for custom data types (this was one of the
early design goals of Postgres), or pluggable
storage engines. Essentially, these are plugin APIs: you can run your code in the database server,
provided that your module adheres to a plugin API exposed by the database server for a particular
purpose.
This kind of extensibility is not the same as the arbitrary composability we saw with Unix tools.
The plugin API is totally controlled by the database server, and subordinate to it. Your extension
code is a guest in the database server’s home, not an equal partner.
A consequence of this design is that you can’t just pipe one database into another, even if they
have the same data model. Nor can you insert your own code into the database’s internal processing
pipelines (unless the server has specifically provided an extension point for you, such as
triggers).
I feel the design of databases is very self-centered. A database seems to assume that it’s the
centre of your universe: the only place where you might want to store and query your data, the
source of truth, and the destination for all queries. The closest you can get to piping data in and
out of it is through bulk-loading and bulk-dumping (backup) operations, but those operations don’t
really use any of the database’s features, such as query planning and indexes.
If a database was designed according to the Unix philosophy, it would be based on a small number of
core primitives that you could easily combine, extend and replace at will. Instead, databases are
tremendously complicated, monolithic beasts. While Unix acknowledges that the operating system will
never do everything you might want, and thus encourages you to extend it, databases try to implement
all the features you may need in a single program.
Perhaps that design is fine in simple applications where a single database is indeed sufficient.
However, many complex applications find that they have to use their data in
various different ways:
they need fast random access for OLTP, big sequential scans for analytics, inverted indexes for
full-text search, graph indexes for connected data, machine learning systems for recommendation
engines, a push mechanism for notifications, various different cached representations of the data
for fast reads, and so on.
A general-purpose database may try to do all of those things in one product
(“one size fits all”), but in all likelihood it will not
perform as well as a tool that is specialized for one particular purpose. In practice, you can often
get the best results by combining various different data storage and indexing systems: for example,
you may take the same data and store it in a relational database for random access, in Elasticsearch
for full-text search, in a columnar format in Hadoop for analytics, and cached in a denormalized
form in memcached.
When you need to integrate different databases, the lack of Unix-style composability is a severe
limitation. (I’ve
done some work
on piping data out of Postgres into other applications, but there’s still a long way to go before we
can simply pipe any database into any other database.)
We said that Unix tools are composable because they all implement the same interface of stdin,
stdout and stderr – and each of these is a file descriptor, i.e. a stream of bytes that you
can read or write like a file. This interface is simple enough that anyone can easily implement it,
but it is also powerful enough that you can use it for anything.
Because all Unix tools implement the same interface, we call it a uniform interface. That’s why
you can pipe the output of gunzip to wc without a second thought, even though those two tools
appear to have nothing in common. It’s like lego bricks, which all implement the same pattern of
knobbly bits and grooves, allowing you to stack any lego brick on any other, regardless of their
shape, size or colour.
The uniform interface of file descriptors in Unix doesn’t just apply to the input and output of
processes, but it’s a very broadly applied pattern. If you open a file on the filesystem, you get
a file descriptor. Pipes and unix sockets provide file descriptors that are a communication channel
to another process on the same machine. On Linux, the virtual files in /dev are the interfaces of
device drivers, so you might be talking to a USB port or even a GPU. The virtual files in /proc
are an API for the kernel, but since they’re exposed as files, you can access them with the same
tools as regular files.
Even a TCP connection to a process on another machine is a file descriptor, although the BSD sockets
API (which is most commonly used to establish TCP connections) is arguably not as Unixy as it could
be. Plan 9 shows that even the network
could have been cleanly integrated into the same uniform interface.
To a first approximation, everything on Unix is a file. This uniformity means the logic of Unix
tools is separated from the wiring, making it more composable. sed doesn’t need to care whether
it’s talking to a pipe to another process, or a socket, or a device driver, or a real file on the
filesystem. It’s all the same.
A file is a stream of bytes, perhaps with an end-of-file (EOF) marker at some point, indicating
that the stream has ended (a stream can be of arbitrary length, and a process may not know in
advance how long its input is going to be).
A few tools (e.g. gzip) operate purely on byte streams, and don’t care about the structure of the
data. But most tools need to parse their input in order to do anything useful with it. For this,
most Unix tools use ASCII, with each record on one line, and fields separated by tabs or spaces, or
maybe commas.
Files are totally obvious to us today, which shows that a byte stream turned out to be a good
uniform interface. However, the implementors of Unix could have decided to do it very differently.
For example, it could have been a function callback interface, using a schema to pass records from
process to process. Or it could have been shared memory (like
System V IPC or
mmap, which came along later). Or it could have
been a bit stream rather than a byte stream.
In a sense, a byte stream is a lowest common denominator – the simplest possible interface.
Everything can be expressed in terms of a stream of bytes, and it’s fairly agnostic to the transport
medium (pipe from another process, file on disk, TCP connection, tape, etc). But this is also a
disadvantage, as we shall discuss later.
We’ve seen that Unix developed some very good design principles for software development, and that
databases have taken a very different route. I would love to see a future in which we can learn from
both paths of development, and combine the best ideas from each.
How can we make 21st-century data systems better by learning from the Unix philosophy? In the rest
of this post I’d like to explore what it might look like if we bring the Unix philosophy to the
world of databases.
First, let’s acknowledge that Unix is not perfect. Although I think the simple, uniform interface of
byte streams was very successful at enabling an ecosystem of flexible, composable, powerful tools,
Unix has some limitations:
It’s designed for use on a single machine. As our applications get every more data and traffic,
and have higher uptime requirements, moving to distributed systems is becoming
increasingly inevitable. Although a TCP connection
can be made to look somewhat like a file, I don’t think that’s the right answer: it only works if
both sides of the connection are up, and it has
somewhat messy
edge case semantics. TCP is good, but by itself it’s too low-level to serve as a distributed pipe
implementation.
A Unix pipe is designed to have a single sender process, and a single recipient. You can’t use
pipes to send output to several processes, or to collect input from several processes. (You can
branch a pipeline with tee, but a pipe itself is always
one-to-one.)
ASCII text (or rather, UTF-8) is great for making data easily explorable, but it quickly gets
messy. Every process needs to be set up with its own input parsing: first breaking the byte stream
into records (usually separated by newline, though some advocate 0x1e, the ASCII record
separator).
Then a record needs to be broken up into fields, like the $7 in the awk example at the
beginning. Separator characters that appear in the data need to be escaped somehow. Even
a fairly simple tool like xargs has about half
a dozen command-line options to specify how its input should be parsed. Text-based interfaces work
tolerably well, but in retrospect, I am pretty sure that a richer data model with
explicit schemas
would have worked better.
Unix processes are generally assumed to be fairly short-running. For example, if a process
in the middle of a pipeline crashes, there is no way for it to resume processing from its input
pipe – the entire pipeline fails and must be re-run from scratch. That’s no problem if the
commands run only for a few seconds, but if an application is expected to run continuously for
years, better fault tolerance is needed.
I think we can find a solution that overcomes these downsides, while retaining the Unix philosophy’s
benefits.
The cool thing is that this solution already exists, and is implemented in
Kafka and Samza, two open source projects
that work together to provide distributed stream processing.
As you probably already know from
otherposts
on this blog, Kafka is a scalable distributed message broker, and Samza is a framework that lets you
write code to consume and produce data streams.
In fact, when you look at it through the Unix lens, Kafka looks quite like the pipe that connects
the output of one process to the input of another. And Samza looks quite like a standard library
that helps you read stdin and write stdout (and a few helpful additions, such as a deployment
mechanism,
state management,
metrics, and monitoring).
The style of stream processing jobs that you can write with Kafka and Samza closely follows the Unix
tradition of small, composable tools:
In Unix, the operating system kernel provides the pipe, a transport mechanism for getting a stream
of bytes from one process to another.
In stream processing, Kafka provides publish-subscribe streams, a transport mechanism for getting
messages from one stream processing job to another.
Kafka addresses the downsides of Unix pipes that we discussed previously:
The single-machine limitation is lifted: Kafka itself is distributed by default, and any stream
processors that use it can also be distributed across multiple machines.
A Unix pipe connects exactly one process output with exactly one process input, whereas a stream
in Kafka can have many producers and many consumers. Many inputs is important for services that
are distributed across multiple machines, and many outputs makes Kafka more like a broadcast
channel. This is very useful, since it allows the same data stream to be consumed independently
for several different purposes (including monitoring and audit purposes, which are often outside
of the application itself). Kafka consumers can come and go without affecting other consumers.
Kafka also provides good fault tolerance: data is replicated across multiple Kafka nodes, so if
one node fails, another node can automatically take over. If a stream processor node fails and is
restarted, it can resume processing at its last checkpoint.
Rather than a stream of bytes, Kafka provides a stream of messages, which saves the first step of
input parsing (breaking the stream of bytes into a sequence of records). Each message is just an
array of bytes, so you can use your
favourite serialisation format
for individual messages: JSON, XML, Avro, Thrift or Protocol Buffers are all reasonable choices.
It’s well worth standardising on one
encoding, and Confluent provides particularly good
schema management support for Avro.
This allows applications to work with objects that have meaningful field names, and not have to
worry about input parsing or output escaping. It also provides good support for schema evolution
without breaking compatibility.
There are a few more things that Kafka does differently from Unix pipes, which are worth calling out
briefly:
As mentioned, Unix pipes provide a byte stream, whereas Kafka provides a stream of messages. This
is especially important if several processes are concurrently writing to the same stream: in
a byte stream, the bytes from different writers can be interleaved, leading to an unparseable
mess. Since messages are coarser-grained and self-contained, they can be safely interleaved,
making it safe for multiple processes to concurrently write to the same stream.
Unix pipes are just a small in-memory buffer, whereas Kafka durably writes all messages to disk.
In this regard, Kafka is less like a pipe, and more like one process writing to a temporary file,
while several other processes continuously read that file using tail -f (each consumer tails the
file independently). Kafka’s approach provides better fault tolerance, since it allows a consumer
to fail and restart without skipping messages. Kafka automatically splits those ‘temporary’ files
into segments and garbage-collects old segments on a configurable schedule.
In Unix, if the consuming process of a pipe is slow to read the data, the buffer fills up and the
sending process is blocked from writing to the pipe. This is a kind of backpressure. In Kafka, the
producer and consumer are more decoupled: a slow consumer has its input buffered, so it doesn’t
slow down the producer or other consumers. As long as the buffer fits within Kafka’s available
disk space, the slow consumer can catch up later. This makes the system less sensitive to
individual slow components, and more robust overall.
A data stream in Kafka is called a topic, and you can refer to it by name (which makes it more
like a Unix named pipe.
A pipeline of Unix programs is usually started all at once, so the pipes normally don’t need
explicit names. On the other hand, a long-running application usually has bits added, removed or
replaced gradually over time, so you need names in order to tell the system what you want to
connect to. Naming also helps with discovery and management.
Despite those differences, I still think it makes sense to think of Kafka as Unix pipes for
distributed data. For example, one thing they have in common is that Kafka keeps messages in a fixed
order (like Unix pipes, which keep the byte stream in a fixed order). This is a
very useful property
for event log data: the order in which things happened is often meaningful and needs to be preserved.
Other types of message broker, like AMQP and JMS, do not have this ordering property.
So we’ve got Unix tools and stream processors that look quite similar. Both read some input stream,
modify or transform it in some way, and produce an output stream that is somehow derived from the
input.
Importantly, the processing does not modify the input itself: it remains immutable. If you run awk
on some input file, the file remains unmodified (unless you explicitly choose to overwrite it).
Also, most Unix tools are deterministic, i.e. if you give them the same input, they always produce
the same output. This means you can re-run the same command as many times as you want, and gradually
iterate your way towards a working program. It’s great for experimentation, because you can always
go back to your original data if you mess up the processing.
This deterministic and side-effect-free processing looks a lot like functional programming. That
doesn’t mean you have to use a functional programming language like Haskell (although you’re welcome
to do so if you want), but you still get many of the benefits of functional code.
The Unix-like design principles of Kafka enable building composable systems at a large scale. In
a large organisation, different teams can each publish their data to Kafka. Each team can
independently develop and maintain stream processing jobs that consume streams and produce new
streams. Since a stream can have any number of independent consumers, no coordination is required to
set up a new consumer.
We’ve been calling this idea a stream data platform.
In this kind of architecture, the data streams in Kafka act as the communication channel between
different teams’ systems. Each team focusses on making their particular part of the system do one
thing well. While Unix tools can be composed to accomplish a data processing task, distributed
streaming systems can be composed to comprise the
entire operation of a large organisation.
A Unixy approach manages the complexity of a large system by encouraging loose coupling: thanks to
the uniform interface of streams, different components can be developed and deployed independently.
Thanks to the fault tolerance and buffering of the pipe (Kafka), when a problem occurs in one part
of the system, it remains localised. And
schema management allows changes to data
structures to be made safely, so that each team can move fast without breaking things for other
teams.
To wrap up this post, let’s consider a real-life example of how this works at LinkedIn. As you may
know, companies can post their job openings on LinkedIn, and jobseekers can browse and apply for
those jobs. What happens if a LinkedIn member (user) views one of those job postings?
It’s very useful to know who has looked at which jobs, so the service that handles job views
publishes an event to Kafka, saying something like “member 123 viewed job 456 at time 789”. Now that
this information is in Kafka, it can be used for many
good purposes:
Monitoring systems: Companies pay LinkedIn to post their job openings, so it’s important that
the site is working correctly. If the rate of job views drops unexpectedly, alarms should go off,
because it indicates a problem that needs to be investigated.
Relevance and recommendations: It’s annoying for users to see the same thing over and over
again, so it’s good to track how many times the users has seen a job posting, and feed that into
the scoring process. Keeping track of who viewed what also allows for collaborative filtering
recommendations (people who viewed X also viewed Y).
Preventing abuse: LinkedIn doesn’t want people to be able to scrape all the jobs, submit
spam, or otherwise violate the terms of service. Knowing who is doing what is the first step
towards detecting and blocking abuse.
Job poster analytics: The companies who post their job openings want to see
stats
(in the style of Google Analytics) about who is viewing their postings, for example so that
they can test which wording attracts the best candidates.
Import into Hadoop and Data Warehouse: For LinkedIn’s internal business analytics, for senior
management’s dashboards, for crunching numbers that are reported to Wall Street, for evaluating
A/B tests, and so on.
All of those systems are complex in their own right, and are maintained by different teams. Kafka
provides a fault-tolerant, scalable implementation of a pipe. A stream data platform based on Kafka
allows all of these various systems to be developed independently, and to be connected and composed
in a robust way.