Skip to content


Schema evolution in Avro, Protocol Buffers and Thrift

So you have some data that you want to store in a file or send over the network. You may find yourself going through several phases of evolution:

  1. Using your programming language’s built-in serialization, such as Java serialization, Ruby’s marshal, or Python’s pickle. Or maybe you even invent your own format.
  2. Then you realise that being locked into one programming language sucks, so you move to using a widely supported, language-agnostic format like JSON (or XML if you like to party like it’s 1999).
  3. Then you decide that JSON is too verbose and too slow to parse, you’re annoyed that it doesn’t differentiate integers from floating point, and think that you’d quite like binary strings as well as Unicode strings. So you invent some sort of binary format that’s kinda like JSON, but binary (1, 2, 3, 4, 5, 6).
  4. Then you find that people are stuffing all sorts of random fields into their objects, using inconsistent types, and you’d quite like a schema and some documentation, thank you very much. Perhaps you’re also using a statically typed programming language and want to generate model classes from a schema. Also you realize that your binary JSON-lookalike actually isn’t all that compact, because you’re still storing field names over and over again; hey, if you had a schema, you could avoid storing objects’ field names, and you could save some more bytes!

Once you get to the fourth stage, your options are typically Thrift, Protocol Buffers or Avro. All three provide efficient, cross-language serialization of data using a schema, and code generation for the Java folks.

Plenty of comparisons have been written about them already (1, 2, 3, 4). However, many posts overlook a detail that seems mundane at first, but is actually cruicial: What happens if the schema changes?

In real life, data is always in flux. The moment you think you have finalised a schema, someone will come up with a use case that wasn’t anticipated, and wants to “just quickly add a field”. Fortunately Thrift, Protobuf and Avro all support schema evolution: you can change the schema, you can have producers and consumers with different versions of the schema at the same time, and it all continues to work. That is an extremely valuable feature when you’re dealing with a big production system, because it allows you to update different components of the system independently, at different times, without worrying about compatibility.

Which brings us to the topic of today’s post. I would like to explore how Protocol Buffers, Avro and Thrift actually encode data into bytes — and this will also help explain how each of them deals with schema changes. The design choices made by each of the frameworks are intesting, and by comparing them I think you can become a better engineer (by a little bit).

The example I will use is a little object describing a person. In JSON I would write it like this:

{
    "userName": "Martin",
    "favouriteNumber": 1337,
    "interests": ["daydreaming", "hacking"]
}

This JSON encoding can be our baseline. If I remove all the whitespace it consumes 82 bytes.

Protocol Buffers

The Protocol Buffers schema for the person object might look something like this:

message Person {
    required string user_name        = 1;
    optional int64  favourite_number = 2;
    repeated string interests        = 3;
}

When we encode the data above using this schema, it uses 33 bytes, as follows:

Look exactly at how the binary representation is structured, byte by byte. The person record is just the concatentation of its fields. Each field starts with a byte that indicates its tag number (the numbers 1, 2, 3 in the schema above), and the type of the field. If the first byte of a field indicates that the field is a string, it is followed by the number of bytes in the string, and then the UTF-8 encoding of the string. If the first byte indicates that the field is an integer, a variable-length encoding of the number follows. There is no array type, but a tag number can appear multiple times to represent a multi-valued field.

This encoding has consequences for schema evolution:

  • There is no difference in the encoding between optional, required and repeated fields (except for the number of times the tag number can appear). This means that you can change a field from optional to repeated and vice versa (if the parser is expecting an optional field but sees the same tag number multiple times in one record, it discards all but the last value). required has an additional validation check, so if you change it, you risk runtime errors (if the sender of a message thinks that it’s optional, but the recipient thinks that it’s required).
  • An optional field without a value, or a repeated field with zero values, does not appear in the encoded data at all — the field with that tag number is simply absent. Thus, it is safe to remove that kind of field from the schema. However, you must never reuse the tag number for another field in future, because you may still have data stored that uses that tag for the field you deleted.
  • You can add a field to your record, as long as it is given a new tag number. If the Protobuf parser parser sees a tag number that is not defined in its version of the schema, it has no way of knowing what that field is called. But it does roughly know what type it is, because a 3-bit type code is included in the first byte of the field. This means that even though the parser can’t exactly interpret the field, it can figure out how many bytes it needs to skip in order to find the next field in the record.
  • You can rename fields, because field names don’t exist in the binary serialization, but you can never change a tag number.

This approach of using a tag number to represent each field is simple and effective. But as we’ll see in a minute, it’s not the only way of doing things.

Avro

Avro schemas can be written in two ways, either in a JSON format:

{
    "type": "record",
    "name": "Person",
    "fields": [
        {"name": "userName",        "type": "string"},
        {"name": "favouriteNumber", "type": ["null", "long"]},
        {"name": "interests",       "type": {"type": "array", "items": "string"}}
    ]
}

…or in an IDL:

record Person {
    string               userName;
    union { null, long } favouriteNumber;
    array<string>        interests;
}

Notice that there are no tag numbers in the schema! So how does it work?

Here is the same example data encoded in just 32 bytes:

Strings are just a length prefix followed by UTF-8 bytes, but there’s nothing in the bytestream that tells you that it is a string. It could just as well be a variable-length integer, or something else entirely. The only way you can parse this binary data is by reading it alongside the schema, and the schema tells you what type to expect next. You need to have the exact same version of the schema as the writer of the data used. If you have the wrong schema, the parser will not be able to make head or tail of the binary data.

So how does Avro support schema evolution? Well, although you need to know the exact schema with which the data was written (the writer’s schema), that doesn’t have to be the same as the schema the consumer is expecting (the reader’s schema). You can actually give two different schemas to the Avro parser, and it uses resolution rules to translate data from the writer schema into the reader schema.

This has some interesting consequences for schema evolution:

  • The Avro encoding doesn’t have an indicator to say which field is next; it just encodes one field after another, in the order they appear in the schema. Since there is no way for the parser to know that a field has been skipped, there is no such thing as an optional field in Avro. Instead, if you want to be able to leave out a value, you can use a union type, like union { null, long } above. This is encoded as a byte to tell the parser which of the possible union types to use, followed by the value itself. By making a union with the null type (which is simply encoded as zero bytes) you can make a field optional.
  • Union types are powerful, but you must take care when changing them. If you want to add a type to a union, you first need to update all readers with the new schema, so that they know what to expect. Only once all readers are updated, the writers may start putting this new type in the records they generate.
  • You can reorder fields in a record however you like. Although the fields are encoded in the order they are declared, the parser matches fields in the reader and writer schema by name, which is why no tag numbers are needed in Avro.
  • Because fields are matched by name, changing the name of a field is tricky. You need to first update all readers of the data to use the new field name, while keeping the old name as an alias (since the name matching uses aliases from the reader’s schema). Then you can update the writer’s schema to use the new field name.
  • You can add a field to a record, provided that you also give it a default value (e.g. null if the field’s type is a union with null). The default is necessary so that when a reader using the new schema parses a record written with the old schema (and hence lacking the field), it can fill in the default instead.
  • Conversely, you can remove a field from a record, provided that it previously had a default value. (This is a good reason to give all your fields default values if possible.) This is so that when a reader using the old schema parses a record written with the new schema, it can fall back to the default.

This leaves us with the problem of knowing the exact schema with which a given record was written. The best solution depends on the context in which your data is being used:

  • In Hadoop you typically have large files containing millions of records, all encoded with the same schema. Object container files handle this case: they just include the schema once at the beginning of the file, and the rest of the file can be decoded with that schema.
  • In an RPC context, it’s probably too much overhead to send the schema with every request and response. But if your RPC framework uses long-lived connections, it can negotiate the schema once at the start of the connection, and amortize that overhead over many requests.
  • If you’re storing records in a database one-by-one, you may end up with different schema versions written at different times, and so you have to annotate each record with its schema version. If storing the schema itself is too much overhead, you can use a hash of the schema, or a sequential schema version number. You then need a schema registry where you can look up the exact schema definition for a given version number.

One way of looking at it: in Protocol Buffers, every field in a record is tagged, whereas in Avro, the entire record, file or network connection is tagged with a schema version.

At first glance it may seem that Avro’s approach suffers from greater complexity, because you need to go to the additional effort of distributing schemas. However, I am beginning to think that Avro’s approach also has some distinct advantages:

  • Object container files are wonderfully self-describing: the writer schema embedded in the file contains all the field names and types, and even documentation strings (if the author of the schema bothered to write some). This means you can load these files directly into interactive tools like Pig, and it Just Works™ without any configuration.
  • As Avro schemas are JSON, you can add your own metadata to them, e.g. describing application-level semantics for a field. And as you distribute schemas, that metadata automatically gets distributed too.
  • A schema registry is probably a good thing in any case, serving as documentation and helping you to find and reuse data. And because you simply can’t parse Avro data without the schema, the schema registry is guaranteed to be up-to-date. Of course you can set up a protobuf schema registry too, but since it’s not required for operation, it’ll end up being on a best-effort basis.

Thrift

Thrift is a much bigger project than Avro or Protocol Buffers, as it’s not just a data serialization library, but also an entire RPC framework. It also has a somewhat different culture: whereas Avro and Protobuf standardize a single binary encoding, Thrift embraces a whole variety of different serialization formats (which it calls “protocols”).

Indeed, Thrift has two different JSON encodings, and no fewer than three different binary encodings. (However, one of the binary encodings, DenseProtocol, is only supported in the C++ implementation; since we’re interested in cross-language serialization, I will focus on the other two.)

All the encodings share the same schema definition, in Thrift IDL:

struct Person {
  1: string       userName,
  2: optional i64 favouriteNumber,
  3: list<string> interests
}

The BinaryProtocol encoding is very straightforward, but also fairly wasteful (it takes 59 bytes to encode our example record):

The CompactProtocol encoding is semantically equivalent, but uses variable-length integers and bit packing to reduce the size to 34 bytes:

As you can see, Thrift’s approach to schema evolution is the same as Protobuf’s: each field is manually assigned a tag in the IDL, and the tags and field types are stored in the binary encoding, which enables the parser to skip unknown fields. Thrift defines an explicit list type rather than Protobuf’s repeated field approach, but otherwise the two are very similar.

In terms of philosophy, the libraries are very different though. Thrift favours the “one-stop shop” style that gives you an entire integrated RPC framework and many choices (with varying cross-language support), whereas Protocol Buffers and Avro appear to follow much more of a “do one thing and do it well” style.

The complexity of user experience

The problem of overly complex software is nothing new; it is almost as old as software itself. Over and over again, software systems become so complex that they become very difficult to maintain and very time-consuming and expensive to modify. Most developers hate working on such systems, yet nevertheless we keep creating new, overly complex systems all the time.

Much has been written about this, including classic papers by Fred Brooks (No Silver Bullet), and Ben Moseley and Peter Marks (Out of the Tar Pit). They are much more worth reading than this post, and it is presumptuous of me to think I could add anything significant to this debate. But I will try nevertheless.

Pretty much everyone agrees that if you have a choice between a simpler software design and a more complex design, all else being equal, that simpler is better. It is also widely thought to be worthwhile to deliberately invest in simplicity — for example, to spend effort refactoring existing code into a cleaner design — because the one-off cost of refactoring today is easily offset by the benefits of easier maintenance tomorrow. Also, much thought by many smart people has gone into finding ways of breaking down complex systems into manageable parts with manageable dependencies. I don’t wish to dispute any of that.

But there is a subtlety that I have been missing in discussions about software complexity, that I feel somewhat ambivalent about, and that I think is worth discussing. It concerns the points where external humans (people outside of the team maintaining the system) touch the system — as developers using an API exposed by the system, or as end users interacting with a user interface. I will concentrate mostly on user interfaces, but much of this discussion applies to APIs too.

Examples

Let me first give a few examples, and then try to extract a pattern from them. They are examples of situations where, if you want, you can go to substantial engineering effort in order to make a user interface a little bit nicer. (Each example based on a true story!)

  • You have an e-commerce site, and need to send out order confirmation emails that explain next steps to the customer. Those next steps differ depending on availability, the tax status of the product, the location of the customer, the type of account they have, and a myriad other parameters. You want the emails to only include the information that is applicable to this particular customer’s situation, and not burden them with edge cases that don’t apply to them. You also want the emails to read as coherent prose, not as a bunch of fragmented bullet points generated by if statements based on the order parameters. So you go and build a natural language grammar model for constructing emails based on sentence snippets (providing pluralisation, agreement, declension in languages that have it, etc), in such a way that for any one out of 100 million possible parameter combinations, the resulting email is grammatically correct and easy to understand.
  • You have a multi-step user flow that is used in various different contexts, but ultimatively achieves the same thing in each context. (For example, Rapportive has several OAuth flows for connecting your account with various social networks, and there are several different buttons in different places that all lead into the same user flow.) The simple solution is to make the flow generic, and not care how the user got there. But if you want to make the user feel good, you need to imagine what state their mind was in when they entered the flow, and customise the images, text and structure of the flow in order to match their goal. This means you have to keep track of where the user came from, what they were trying to do, and thread that context through every step of the flow. This is not fundamentally hard, but it is fiddly, time-consuming and error-prone.
  • You have an application that requires some arcane configuration. You could take the stance that you will give the user a help page and they will have to figure it out from there. Or you could write a sophisticated auto-configuration tool that inspects the user’s environment, analyses thousands of possible software combinations and configurations (and updates this database as new versions of other products in the environment are released), and automatically chooses the correct settings — hopefully without having to ask the user for help. With auto-configuration, the users never even know that they were spared a confusing configuration dialog. But somehow, word gets around that the product “just works”.

What’s a user requirement?

We said above that simplicity is good. However, taking simplicity to an exaggerated extreme, you end up with software that does nothing. This implies that there are aspects of software complexity that are essential to the user’s problem that is being solved. (Note that I don’t mean complexity of the user interface, but complexity of the actual code that implements the solution to the user’s problem.)

Unfortunately, there is a lot of additional complexity introduced by stuff that is not directly visible or useful to users: stuff that is only required to “grease the wheels”, for example to make legacy components work or to improve performance. Moseley and Marks call this latter type accidental complexity, and argue that it should be removed or abstracted away as much as possible. (Other authors define essential and accidental complexity slightly differently, but the exact definition is not important for the purpose of this post.)

This suggests that it is important to understand what user problem is being solved, and that’s where things start getting tricky. When you say that something is essential because it fulfils a user requirement (as opposed to an implementation constraint or a performance optimisation), that presupposes a very utilitarian view of software. It assumes that the user is trying to get a job done, and that they are a rational actor. But what if, say, you are taking an emotional approach and optimising for user delight?

What if the user didn’t know they had a problem, but you solve it anyway? If you introduce complexity in the system for the sake of making things a little nicer for the user (but without providing new core functionality), is that complexity really essential? What if you add a little detail that is surprising but delightful?

You can try to reduce an emotional decision down to a rational one — for example, you can say that when a user plays a game, it is solving the user’s problem of boredom by providing distraction. Thus any feature which substantially contributes towards alleviating boredom may be considered essential. Such reductionism can sometimes provide useful angles of insight, but I think a lot would be lost by ignoring the emotional angle.

You can state categorically that “great user experience is an essential feature”. But what does that mean? By itself, that statement is so general that could be used to argue for anything or nothing. User experience is subjective. What’s preferable for one user may be an annoyance for another user, even if both users are in the application’s target segment. Sometimes it just comes down to taste or fashion. User experience tends to have an emotional angle that makes it hard to fit into a rational reasoning framework.

What I am trying to get at: there are things in software that introduce a lot of complexity (and that we should consequently be wary of), and that can’t be directly mapped to a bullet point on a list of user requirements, but that are nevertheless important and valuable. These things do not necessarily provide important functionality, but they contribute to how the user feels about the application. Their effect may be invisible or subconscious, but that doesn’t make them any less essential.

Data-driven vs. emotional design

Returning to the examples above: as an application developer, you can choose whether to take on substantial additional complexity in the software in order to simplify or improve the experience for the user. The increased software complexity actually reduces the complexity from the user’s point of view. These examples also illustrate how user experience concerns are not just a matter of graphic design, but can also have a big impact on how things are engineered.

The features described above arguably do not contribute to the utility of the software — in the e-commerce example, orders will be fulfilled whether or not the confirmation emails are grammatical. In that sense, the complexity is unnecessary. But I would argue that these kind of user experience improvements are just as important as the utility of the product, because they determine how users feel about it. And how they feel ultimately determines whether they come back, and thus the success or failure of the product.

One could even argue that the utility of a product is a subset of its user experience: if the software doesn’t do the job that it’s supposed to, then that’s one way of creating a pretty bad experience; however, there are also many other ways of creating a bad experience, while remaining fully functional from a utilitarian point of view.

The emotional side of user experience can be a difficult thing for organisations to grapple with, because it doesn’t easily map to metrics. You can measure things like how long a user stayed on your site, how many things they clicked on, conversion rates, funnels, repeat purchase rates, lifetime values… but those numbers tell you very little about how happy you made a user. So you can take a “data-driven” approach to design decisions and say that a feature is worthwhile if and only if it makes the metrics go up — but I fear that an important side of the story is missed if you go solely by the numbers.

Questions

This is as far as my thinking has got: believing that a great user experience is essential for many products; and recognising that building a great UX is hard, can require substantial additional complexity in engineering, and can be hard to justify in terms of logical arguments and metrics. Which leaves me with some unanswered questions:

  • Every budget is finite, so you have to prioritise things, and not everything will get done. When you consider building something that improves user experience without strictly adding utility, it has to be traded off against features that do add utility (is it better to shave a day off the delivery time than to have a nice confirmation email?), and the cost of the increased complexity (will that clever email generator be a nightmare to localise when we translate the site into other languages?). How do you decide about that kind of trade-offs?
  • User experience choices are often emotional and intuitive (no number of focus groups and usability tests can replace good taste). That doesn’t make them any more or less important than rational arguments, but combining emotional and rational arguments can be tricky. Emotionally-driven people tend to let emotional choices overrule rational arguments, and rationally-driven people vice versa. How do you find the healthy middle ground?
  • If you’re aiming for a minimum viable product in order to test out a market (as opposed to improving a mature product), does that change how you prioritise core utility relative to “icing on the cake”?

I suspect that the answers to the questions above are “it depends”. More precisely, “how one thing is valued relative to another is an aspect of your particular organisation’s culture, and there’s no one right answer”. That would imply that each of us should think about it; you should have your own personal answers for how you decide these things in your own projects, and be able to articulate them. But it’s difficult — I don’t think hard-and-fast rules have a chance of working here.

I’d love to hear your thoughts in the comments below. If you liked this post, you can subscribe to email notifications when I write something new :)

Rethinking caching in web apps

Having spent a lot of the last few years worrying about the scalability of data-heavy applications like Rapportive, I have started to get the feeling that maybe we have all been “doing it wrong”. Maybe what we consider to be “state of the art” application architecture is actually holding us back.

I don’t have a definitive answer for how we should be architecting things differently, but in this post I’d like to outline a few ideas that I have been fascinated by recently. My hope is that we can develop ways of better managing scale (in terms of complexity, volume of data and volume of traffic) while keeping our applications nimble, easy and safe to modify, test and iterate.

My biggest problem with web application architecture is how network communication concerns are often intermingled with business logic concerns. This makes it hard to rearrange the logic into new architectures, such as the precomputed cache architecture described below. In this post I explore why it important to be able to try new architectures for things like caching, and what it would take to achieve that flexibility.

An example

To illustrate, consider the clichéd Rails blogging engine example:

class Post < ActiveRecord::Base
  attr_accessible :title, :content, :author
  has_many :comments
end

class Comment < ActiveRecord::Base
  attr_accessible :content, :author
  belongs_to :post
end

class PostsController < ApplicationController
  def show
    @post = Post.find(params[:id])
    respond_to do |format|
      format.html  # show.html.erb
      format.json  { render :json => @post }
    end
  end
end

# posts/show.html.erb:

<h1><%= @post.title %></h1>
<p class="author">By <%= @post.author %></p>
<div class="content">
  <%= simple_format(@post.content) %>
</div>
<h2>Comments</h2>
<ul class="comments">
  <% @post.comments.each do |comment| %>
    <li>
      <blockquote><%= simple_format(comment.content) %></blockquote>
      <p class="author"><%= comment.author %></p>
    </li>
  <% end %>
</ul>

Pretty good code by various standards, but it has always irked me a bit that I can’t see where the network communication (i.e. making database queries) is happening. When I look at that Post.find in the controller, I can guess that probabably translates into a SELECT * FROM posts WHERE id = ? internally – unless the same query was already made recently, and ActiveRecord cached the result. And another database query of the form SELECT * FROM comments WHERE post_id = ? might be made as a result of the @post.comments call in the template. Or maybe the comments were already previously loaded by some model logic, and then cached? Or someone decided to eagerly load comments with the original post? Who knows.

The execution flow for a MVC framework request like PostsController#show probably looks something like this:

Typical MVC request flow

Of course it is deliberately designed that way. Your template and your controller shouldn’t have to worry about database queries — those are encapsulated by the model for many good reasons. I am violating abstraction by even thinking about the database whilst I’m in the template code! I should just think of my models as pure, beautiful pieces of application state. How that state gets loaded from a database is a matter that only the models need to worry about.

Adding complexity

In the example above, the amount of logic in the model is minimal, but it typically doesn’t stay that way for long. As the application becomes popular (say, the blogging engine morphs to become Twitter, Tumblr, Reddit or Pinterest), all sorts of stuff gets added: memcache to stop the database from falling over, spam filtering, analytics features, email sending, notifications, A/B testing, more memcache, premium features, ads, upsells for viral loops, more analytics, even more memcache. As the application inevitably grows in complexity, the big monolithic beast is split into several smaller services, and different services end up being maintained by different teams.

As all of this is happening, the programming model typically stays the same: each service in the architecture (which may be a user-facing web server, or an internal service e.g. for user authentication) communicates over the network with a bunch of other nodes (memcached instances, database servers, other application services), processes and combines the data in some way, and then serves it out to a client.

That processing and combining of data we can abstractly call “business logic”. It might be trivially simple, or it might involve half a million lines of parsing, rendering or machine learning code. It might behave differently depending on which A/B test bucket the user is in. It might deal with hundreds of hairy edge cases. Whatever.

At the root of the matter, business logic should be a pure function. It takes a bunch of inputs (request parameters from the client, data stored in various databases and caches, responses from various other services) and produces a bunch of outputs (data to return to the client, data to write back to various databases and caches). It is usually deterministic: given the same inputs, the business logic should produce exactly the same output again. It is also stateless: any data that is required to produce the output or to make a decision has to be provided as an input.

By contrast, the network communication logic is all about ‘wiring’. It may end up having a lot of complexity in its own right: sending requests to the right node of a sharded database, retrying failed requests with exponential back-off, making requests to different services in parallel, cross-datacenter failover, service authentication, etc. But the network communication logic ought to be general-purpose and completely independent of your application’s business logic.

Both business logic and network communication logic are needed to build a service. But how do you combine the two into a single process? Most commonly, we build abstractions for each type of logic, hiding the gory implementation details. Much like in the blog example above, you end up calling a method somewhere inside the business logic, not really knowing or caring whether it will immediately return a value that the object has already computed, or whether it will talk to another process on the same machine, or load the value from some remote cache, or make a query on a database cluster somewhere.

It’s good that the business logic doesn’t need to worry about how and when the communication happens. And it’s good that the communication logic is general-purpose and not polluted with application-specific concerns. But I think it’s problematic that network communication may happen somewhere deeply inside a business logic call stack. Let me try to explain why.

Precomputed caches

As your volume of data and your number of users grow, database access often becomes a bottleneck (there are more queries competing for I/O, and each query takes longer when there’s more data). The standard answer to the problem is of course caching. You can cache at many different levels: an individual database row, or a model object generated by combining several sources, or even an entire HTML page ready to serve to a client. I will focus on the mid-to-high-level caches, where the raw data has gone through some sort of business logic before it ends up in the cache.

Most commonly, caches are set up in read-through style: on every query, you first check the cache, and return the value from the cache if it’s a hit; otherwise it’s a miss, so you do whatever is required to generate the value (query databases, apply business logic, perform voodoo), and return it to the client whilst also storing it in the cache for next time. As long as you can generate the value on the fly in a reasonable time, this works pretty well.

I will gloss over cache invalidation and expiry for now, and return to it below.

The most apparent problem with a read-through cache is that the first time a value is requested, it’s always slow. (And if your cache is too small to hold the entire dataset, rarely accessed values will get evicted and thus be slow every time.) That may or may not be a problem for you. One reason why it may be a problem is that on many sites, the first client to request a given page is typically the Googlebot, and Google penalises slow sites in rankings. So if you have the kind of site where Google juice is lifeblood, then your SEO guys may tell you that a read-through cache is not good enough.

So, can you make sure that the data is in the cache even before it is requested for the first time? Well, if your dataset isn’t too huge, you can actually precompute every possible cache entry, put them in a big distributed key-value store and serve them with minimal latency. That has a great advantage: cache misses no longer exist. If you’ve precomputed every possible cache entry, and a key isn’t in the cache, you can be sure that there’s no data for that key.

If that sounds crazy to you, consider these points:

  • A database index is a special case of a precomputed cache. For every value you might want to search for, the index tells you where to find occurrences of that value. If it’s not in the index, it’s not in the database. The initial index creation is a one-off batch job, and thereafter the database automatically keeps it in sync with the raw data. Yes, databases have been doing this for a long time.
  • With Hadoop you can process terabytes of data without breaking a sweat. That is truly awesome power.
  • There are several datastores that allow you to precompute their files in Hadoop, which makes them very well suited for serving the cache that you precomputed. We are currently using Voldemort in read-only mode (research paper), but HBase and ElephantDB can do this too.
  • If you’re currently storing data in denormalized form (to avoid joins on read queries), you can stop doing that. You can keep your primary database in a very clean, normalized schema, and any caches you derive from it can denormalize the data to your heart’s content. This gives you the best of both worlds.

Separating communication from business logic

Ok, say you’ve decided that you want to precompute a cache in Hadoop. As we’ve not yet addressed cache invalidation (see below), let’s just say you’re going to rebuild the entire cache once a day. That means the data you serve out of the cache will be stale, out of date by up to a day, but that’s still acceptable for some applications.

The first step is to get your raw data into HDFS. That’s not hard, assuming you have daily database backups: you can take your existing backup, transform it into a more MapReduce-friendly format such as Avro, and write it straight to HDFS. Do that with all your production databases and you’ve got a fantastic resource to work with in Hadoop.

Now, to build your precomputed cache, you need to apply the same business logic to the same data as you would in an uncached service that does it on the fly. As described above, your business logic takes as input the request parameters from the user and any data that is loaded from databases or services in order to serve that request. If you have all that data in HDFS, and you can work out all possible request parameters, then in theory, you should be able to take your existing business logic implementation and run it in Hadoop.

Business logic can be very complex, so you should probably aim to reuse the existing implementation rather than rewriting it. But doing so requires untangling the real business logic from all the network communication logic.

When your business logic is running as a service processing individual requests, you’re used to making several small requests to databases, caches or other services as part of generating a response (see the blog example above). Those small requests constitute gathering all the inputs needed by the business logic in order to produce its output (e.g. a rendered HTML page).

But when you’re running in Hadoop, this is all turned on its head. You don’t want to be making individual random-access requests to data, because that would be an order of magnitude too slow. Instead you need to use MapReduce to gather all the inputs for one particular evaluation of the business logic into one place, and then run the business logic given those inputs without any network communication. Rather than the business logic pulling together all the bits of data it needs in order to produce a response, the MapReduce job has already gathered all the data it knows the business logic is going to need, and pushes it into the business logic function.

Let’s use the blog example to make this more concrete. The data dependency is fairly simple: when the blog post params[:id] is requested, we require the row in the posts table whose id column matches the requested post, and we require all the rows in the comments table whose post_id column matches the requested post. If the posts and comments tables are in HDFS, it’s a very simple MapReduce job to group together the post with id = x and all the comments with post_id = x.

We can then use a stub database implementation to feed those database rows into the existing Post and Comment model objects. That way we can make the models think that they loaded the data from a database, even though actually we had already gathered all the data we knew it was going to need. The model objects can keep doing their job as normally, and the output they produce can be written straight to the cache.

By this point, two problems should be painfully clear:

  • How does the MapReduce job know what inputs the business logic is going to need in order to work?
  • OMG, implementing stub database drivers, isn’t that a bit too much pain for limited gain? (Note that in testing frameworks it’s not unusual to stub out your database, so that you can run your unit tests without a real database. Still, it’s non-trivial and annoying.)

Both problems have the same cause, namely that the network communication logic is triggered from deep inside the business logic.

Data dependencies

When you look at the business logic in the light of precomputing a cache, it seems like the following pattern would make more sense:

  1. Declare your data dependencies: “if you want me to render the blog post with ID x, I’m going to need the row in the posts table with id = x, and also all the rows in the comments table with post_id = x”.
  2. Let the communication logic deal with resolving those dependencies. If you’re running as a normal web app, that means making database (or memcache) queries to one or more databases, and maybe talking to other services. If you’re running in Hadoop, it means configuring the MapReduce job to group together all the pieces of data on which the business logic depends.
  3. Once all the dependencies have been loaded, the business logic is now a pure function, deterministic and side-effect-free, that produces our desired output. It can perform whatever complicated computation it needs to, but it’s not allowed access to the network or data stores that weren’t declared as dependencies up front.

This separation would make application architecture very different from the way it is commonly done today. I think this new style would have several big advantages:

  • By removing the assumption that the business logic is handling one request at a time, it becomes much easier to run the business logic in completely different contexts, such as in a batch job to precompute a cache. (No more stubbing out database drivers.)
  • Testing becomes much easier. All the tricky business logic for which you want to write unit tests is now just a function with a bunch of inputs and a bunch of outputs. You can easily vary what you put in, and easily check that the right thing comes out. Again, no more stubbing out the database.
  • The network communication logic can become a lot more helpful. For example, it can make several queries in parallel without burdening the business logic with a lot of complicated concurrency stuff, and it can deduplicate similar requests.
  • Because the data dependencies are very clearly and explicitly modelled, the system becomes easier to understand, and it becomes easier to move modules around, split a big monolithic beast into smaller services, or combine smaller services into bigger, logical units.

I hope you agree that this is a very exciting prospect. But is it practical?

In most cases, I think it would not be very hard to make business logic pure (i.e. stop making database queries from deep within) — it’s mostly a matter of refactoring. I have done it to substantial chunks of the Rapportive code base, and it was a bit tedious but perfectly doable. And the network communication logic wouldn’t have to change much at all.

The problem of making this architecture practical hinges on having a good mechanism for declaring data dependencies. The idea is not new — for instance, LinkedIn have an internal framework for resolving data dependencies that queries several services in parallel — but I’ve not yet seen a language or framework that really gets to the heart of the problem.

Adapting the blog example above, this is what I imagine such an architecture would look like:

Concept for using a dependency resolver

We still have models, and they are still used as encapsulations of state, but they are no longer wrappers around a database connection. Instead, the dependency resolver can take care of the messy business of talking to the database; the models are pure and can focus on the business logic. The models don’t care whether they are instantiated in a web app or in a Hadoop cluster, and they don’t care whether the data was loaded from a SQL database or from HDFS. That’s the way it should be.

In my spare time I have started working on a language called Flowquery (don’t bother searching, there’s nothing online yet) to solve the problem of declaring data dependencies. If I can figure it out, it should make precomputed caches and all the good things above very easy. But it’s not there yet, so I don’t want to oversell it.

But wait, there is one more thing…

Cache invalidation

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

How important is it that the data in your cache is up-to-date and consistent with your “source of truth” database? The answer depends on the application and the circumstances. For example, if the user edits their own data, you almost certainly want to show them an up-to-date version of their own data post-editing, otherwise they will assume that your app is broken. But you might be able to get away with showing stale data to other users for a while. For data that is not directly edited by users, stale data may always be ok.

If staleness is acceptable, caching is fairly simple: on a read-through cache you set an expiry time on a cache key, and when that time is reached, the entry falls out of the cache. On a precomputed cache you do nothing, and just wait until the next time you recompute the entire thing.

In cases where greater consistency is required, you have to explicitly invalidate cache entries when the original data changes. If just one cache key is affected by a change, you can write-through to that cache key when the “source of truth” database is updated. If many keys may be affected, you can use generational caching and clever generalisations thereof. Whatever technique you use, it usually ends up being a lot of manually written, fiddly and error-prone code. Not a great joy to work with, hence the terribly clichéd quote above.

But… observe the following: in our efforts to separate pure business logic from network communication logic, we decided that we needed to explicitly model the data dependencies, and only data sources declared there are permitted as inputs to the business logic. In other words, the data dependency framework knows exactly which pieces of data are required in order to generate a particular piece of output — and conversely, when a piece of (input) data changes, it can know exactly which outputs (cache entries) may be affected by the change!

This means that if we have a real-time feed of changes to the underlying databases, we can feed it into a stream processing framework like Storm, run the data dependency analysis in reverse on every change, recompute the business logic for each output affected by the change in input, and write the results to another datastore. This store sits alongside the precomputed cache we generated in a batch process in Hadoop. When you want to query the cache, check both the output of the batch process and the output of the stream process. If the stream process has generated more recent data, use that, otherwise use the batch process output.

If you’ve been following recent news in Big Data, you may recognise this as an application of Nathan Marz’ lambda architecture (described in detail in his upcoming book). I cannot thank Nathan enough for his amazing work in this area.

In this architecture, you get the benefits of a precomputed cache (every request is fast, including the first one), it keeps itself up-to-date with the underlying data, and because you have already declared your data dependencies, you don’t need to manually write cache invalidation code! The same dependency declaration can be used in three different ways:

  1. In ‘online’ mode in a service or web app, for driving the network communication logic in order to make all the required queries and requests in order to serve an incoming request, and to help with read-through caching.
  2. In ‘offline’ mode in Hadoop, to configure a MapReduce pipeline that brings together all the required data in order to run it through the business logic and generate a precomputed cache of all possible queries.
  3. In ‘nearline’ mode in Storm, to configure a stream processing topology that tracks changes to the underlying data, determines which cache keys need to be invalidated, and recomputes the cache values for those keys using the business logic.

I am designing Flowquery so that it can be used in all three modes — you should be able to write your data dependencies just once, and let the framework take care of bringing all the necessary data together so that the business logic can act on it.

My hope is to make caching and cache invalidation as simple as database indexes. You declare an index once, the database runs a one-off batch job to build the index, and thereafter automatically keeps it up-to-date as the table contents change. It’s so simple to use that we don’t even think about it, and that’s what we should be aiming for in the realm of caching.

The project is still at a very early stage, but hopefully I’ll be posting more about it as it progresses. If you’d like to hear more, please leave your email address and I’ll send you a brief note when I post more. Or you can follow me on Twitter or App.net.

Thanks to Nathan Marz, Pete Warden, Conrad Irwin, Rahul Vohra and Sam Stokes for feedback on drafts of this post.

Java's hashCode is not safe for distributed systems

As you probably know, hash functions serve many different purposes:

  1. Network and storage systems use them (in the guise of checksums) to detect accidental corruption of data.
  2. Crypographic systems use them to detect malicious corruption of data and to implement signatures.
  3. Password authentication systems use them to make it harder to extract plaintext passwords from a database.
  4. Programming languages use them for hash maps, to determine in which hash bucket a key is placed.
  5. Distributed systems use them to determine which worker in a cluster should handle a part of a large job.

All those purposes have different requirements, and different hash functions exist for the various purposes. For example, CRC32 is fine for detecting bit corruption in Ethernet, as it’s really fast and easy to implement in hardware, but it’s useless for cryptographic purposes. SHA-1 is fine for protecting the integrity of a message against attackers, as it’s cryptographically secure and also reasonably fast to compute; but if you’re storing passwords, you’re probably better off with something like bcrypt, which is deliberately slow in order to make brute-force attacks harder.

Anyway, that’s all old news. Today I want to talk about points 4 and 5, and why they are also very different from each other.

Hashes for hash tables

We use hash tables (dictionaries) in programming languages all the time without thinking twice. When you insert an item into a hash table, the language computes a hash code (an integer) for the key, uses that number to choose a bucket in the hash table (typically mod n for a table of size n), and then puts the key and value in that bucket in the table. If there’s already a value there (a hash collision), a linked list typically takes care of storing the keys and values within the same hash bucket. In Ruby, for example:

$ ruby --version
ruby 1.8.7 (2011-06-30 patchlevel 352) [i686-darwin11.0.0]

$ pry
[1] pry(main)> hash_table = {'answer' => 42}
=> {"answer"=>42}
[2] pry(main)> 'answer'.hash
=> -1246806696
[3] pry(main)> 'answer'.hash
=> -1246806696
[4] pry(main)> ^D

$ pry
[1] pry(main)> 'answer'.hash
=> -1246806696
[2] pry(main)> "don't panic".hash
=> -464783873
[3] pry(main)> ^D

When you add the key 'answer' to the hash table, Ruby internally calls the #hash method on that string object. The method returns an arbitrary number, and as you see above, the number is always the same for the same string. A different string usually has a different hash code. Occasionally you might get two keys with the same hash code, but it’s extremely unlikely that you get a large number of collisions in normal operation.

The problem with the example above: when I quit Ruby (^D) and start it again, and compute the hash for the same string, I still get the same result. But why is that a problem, you say, isn’t that what a hash function is supposed to do? – Well, the problem is that I can now put on my evil genius hat, and generate a list of strings that all have the same hash code:

$ pry
[1] pry(main)> "a".hash
=> 100
[2] pry(main)> "\0a".hash
=> 100
[3] pry(main)> "\0\0a".hash
=> 100
[4] pry(main)> "\0\0\0a".hash
=> 100
[5] pry(main)> "\0\0\0\0a".hash
=> 100
[6] pry(main)> "\0\0\0\0\0a".hash
=> 100

Any server in the world running the same version of Ruby will get the same hash values. This means that I can send a specially crafted web request to your server, in which the request parameters contain lots of those strings with the same hash code. Your web framework will probably parse the parameters into a hash table, and they will all end up in the same hash bucket, no matter how big you make the hash table. Whenever you want to access the parameters, you now have to iterate over a long list of hash collisions, and your swift O(1) hash table lookup is suddenly a crawling slow O(n).

I just need to make a small number of these evil requests to your server and I’ve brought it to its knees. This type of denial of service attack was already described back in 2003, but it only became widely known last year, when Java, Ruby, Python, PHP and Node.js all suddenly scrambled to fix the issue.

The solution is for the hash code to be consistent within one process, but to be different for different processes. For example, here is a more recent version in Ruby, in which the flaw is fixed:

$ ruby --version
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.3.0]

$ pry
[1] pry(main)> 'answer'.hash
=> 968518855724416885
[2] pry(main)> 'answer'.hash
=> 968518855724416885
[3] pry(main)> ^D

$ pry
[1] pry(main)> 'answer'.hash
=> -150894376904371785
[2] pry(main)> ^D

When I quit Ruby and start it again, and ask for the hash code of the same string, I get a completely different answer. This is obviously not what you want for cryptographic hashes or checksums, since it would render them useless — but for hash tables, it’s exactly right.

Hashes for distributed systems

Now let’s talk about distributed systems — systems in which you have more than process, probably on more than one machine, and they are talking to each other. If you have something that’s too big to fit on one machine (too much data to fit on one machine’s disks, too many requests to be handled by one machine’s CPUs, etc), you need to spread it across multiple machines.

How do you know which machine to use for a given request? Unless you have some application-specific partitioning that makes more sense, a hash function is a simple and effective solution: hash the name of the thing you’re requesting, mod number of servers, and that’s your server number. (Though if you ever want to change the number of machines, consistent hashing is probably a better bet.)

For this setup you obviously don’t want a hash function in which different processes may compute different hash codes for the same value, because you’d end up routing requests to the wrong server. You can’t use the same hash function as the programming language uses for hash tables.

Unfortunately, this is exactly what Hadoop does. Storm, a stream processing framework, does too. Both use the Java Virtual Machine’s Object.hashCode() method.

I understand the use of hashCode() — it’s very tempting. On strings, numbers and collection classes, hashCode() always returns a consistent value, apparently even across different JVM vendors. It’s like that despite the documentation for hashCode() explicitly not guaranteeing consistency across different processes:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

And once in a while, a bold library comes along that actually returns different hashCode() values in different processes – Protocol Buffers, for example – and people get quite confused.

The problem is that although the documentation says hashCode() doesn’t provide a consistency guarantee, the Java standard library behaves as if it did provide the guarantee. People start relying on it, and since backwards-compatibility is rated so highly in the Java community, it will probably never ever be changed, even though the documentation would allow it to be changed. So the JVM gets the worst of both worlds: a hash table implementation that is open to DoS attacks, but also a hash function that can’t always safely be used for communication between processes. :(

Therefore…

So what I’d like to ask for is this: if you’re building a distributed framework based on the JVM, please don’t use Java’s hashCode() for anything that needs to work across different processes. Because it’ll look like it works fine when you use it with strings and numbers, and then someday a brave soul will use (e.g.) a protocol buffers object, and then spend days banging their head against a wall trying to figure out why messages are getting sent to the wrong servers.

What should you use instead? First, you probably need to serialize the object to a byte stream (which you need to do anyway if you’re going to send it over the network). If you’re using a serialization that always maps the same values to the same sequence of bytes, you can just hash that byte stream. A cryptographic hash such as MD5 or SHA-1 would be ok for many cases, but might be a bit heavyweight if you’re dealing with a really high-throughput service. I’ve heard good things about MurmurHash, which is non-cryptographic but lightweight and claims to be well-behaved.

If your serialization doesn’t always produce the same sequence of bytes for a given value, then you can still define a hash function on the objects themselves. Just please don’t use hashCode(). It’s ok for in-process hash tables, but distributed systems are a different matter.

(Oh, and in case you were wondering: it looks like the web servers affected by Java’s hashCode collisions fixed the problem not by changing to a different hash function, but simply by limiting the number of parameters: Tomcat, Jetty.)

My FounderLY interview

Matthew from FounderLY wondered what it would have been like to watch raw video footage of Steve Jobs, Bill Gates, and other tech founders during their formative years. So he’s been going around interviewing young startup founders, for posterity and for other founders’ inspiration. A pretty interesting effort.

A few weeks ago he asked whether he could interview me for the site. Although it would be rather presumptious to put myself in the category of potential future Steve Jobses, I agreed.

So here you go – a tidily scripted set of questions from Matthew, and some chaotically unscripted stream-of-consciousness replies from me. The video comes in two parts, about 22 minutes in total, and a transcript is below.

Martin Kleppmann interview, part 1 from FounderLY on Vimeo

Martin Kleppmann interview, part 2 from FounderLY on Vimeo

Transcript

Matthew Wise: Hi this is Matthew Wise with FounderLY.com. We empower entrepreneurs to have a voice and share their story with the world, enabling others to learn about building products and starting companies.

I’m really excited today because I’m here with Martin Kleppmann, founder of Rapportive. Rapportive shows you everything about your contacts inside your email box, enabling you to see who people are and where they are based, so that you can connect and collaborate over shared interests. So, Martin, we’d love you to give our audience a brief bio.

Martin: Sure. I’m originally from Germany, which explains my weird accent, and then I went to the UK for several years to study computer science. That was in Cambridge. After that, I started a startup; it was called Go Test It, we made a tool for automated cross-browser testing of websites. That was pretty cool, and it was acquired a few years ago. After that, I was looking around for something new to do, and together with two friends we started Rapportive.

What we do now is to pull photos, job details from LinkedIn, recent tweets and all of this stuff into Gmail, and show it right there.

Matthew: What makes Rapportive unique, who is it for and why are you so passionate about it?

Martin: It’s really for people who do a lot of email, particularly emailing with people who you don’t really know well. If you only ever email with ten different people, then you wouldn’t need it — but most of us, particularly startup founders, are constantly dealing with investors, outside advisors, users emailing us, potential customers, potential partners, people on emailing lists… all of these people, we vaguely know who they are, but not really. And actually, it is really important that you build this personal contact with them, and get to know them personally.

Previously, when people got an email from someone, they would go and search Google, try to find their Twitter account, try to find them on LinkedIn, and this just takes a lot of time. And we’ve just automated all of that. The idea is that now, you can actually respond to people personally and build up that personal connection. It’s little things: even just being able to see the photo of someone in your email… firstly, that’s a deep visceral connection: you connect much more with them than if you’re just looking at a wall of text; and also, if you meet them in real life, well, you’re much more likely to be able to recognise them. I think that makes your email a better place; it’s really excellent.

Matthew: What are some of the technology and market trends that currently exist, and how do you see things developing in the future in your space?

Martin: I’m not sure about the big trends. There are a lot of things, but they are all very subtle things. For example, people caring a lot about user experience, and we take that really seriously. We put ridiculous amounts of effort into making sure that stuff works really nicely.

Other things that are happening: we are having to deal with more and more people, and people expect that you don’t just get an automated stock reply, but that people actually engage with you personally. That’s the future, I think. We’ve already got that in one-to-one communication between individuals, but the big trend is that companies as a whole are starting to be more personal with the outside. They are no longer this corporate brand, this cold, anonymous thing, but you actually expect to be able to see the people behind that brand, and be able to engage with them directly and build a relationship. And those relationships are what matter, because… if you’re just competing on price, your customers can just go somewhere else, but if you can build up a relationship with your customers, that’s really really powerful.

We think that’s what we are enabling, by giving you this social substrate for your communications.

Matthew: Can you tell us what inspired you to start Rapportive? Was there an “aha” moment, or did market research lead you to the opportunity? What’s the story behind it?

Martin: It really came from something we wanted ourselves. I think everyone says this! In my previous startup (and my cofounders also had a previous startup), we were all trying to do a lot of engaging with people personally, getting out there, learning a lot from people, really understanding where they were coming from. And that was so much effort! I’d keep lists of people in a custom database or in spreadsheets or in CRM systems like Highrise, and I’d have to keep them up to date by hand. I’d make a lot of notes about people, even just for myself, just so that I could remember when I came back to them six months later: what interactions I’d had with them, what we’d talked about.

But I then found that all of this information would go stale: for example, I had entered someone’s job details and then they’d change job… and I’m not going to go and re-enter all of this stuff! It’s already out there on the web — really, software should just do this stuff automatically; there’s no reason why I should have to type this in again.

And then, also, why should I have to always change over to another browser tab in order to search for something, and have five tabs open with different searches for stuff? It’s just ridiculous, this stuff should be in the tool which I use all the time anyway, which is email.

And so, those are the two premises we started with. We wanted something which keeps itself up-to-date automatically from all the data which is already out there; you shouldn’t have to re-enter anything. And secondly, it should be in the workflow of the tool you already use, which, for most of us is primarily email. And on that premise we said: what can we build? Oh, well, let’s just stick something on the side of Gmail, see how it works. And people loved it.

Matthew: Excellent! Who is your cofounder, how did you meet, what qualities were you looking for in a cofounder, and how did you know they’d be a good fit?

Martin: I have two cofounders, Rahul and Sam; there are three of us. They are both really excellent people. I had known them for a while before starting: we were together in an office space, a kind of co-working space in Cambridge, UK. They were working on their previous startup, and I was working on my previous startup; we worked together a bit, we had lunch together every day, and just ended up talking about a lot of things.

We found that partly we thought the same in a lot of ways, and partly we also had different but nicely complementary ways of thinking. We had a shared culture but often different perspectives, which helped us to together find the best way of doing stuff. And that’s really the basis on which we work. I think we have a very strong sense of a culture and making sure we work together very well, so we are constantly getting better at what we do.

Matthew: From idea to product launch, how long did it take, and when did you actually launch?

Martin: It was pretty quick actually: from first UI mockups to launch it was less than two months. We weren’t actually intending to launch: we had just put up this little website. We were applying for Y Combinator at the time and we also had some other people who were interested, so we wanted to show some potential investors what we were doing. Put up a little website; it wasn’t protected, but just at unknown URL.

And then somehow the press got hold of this, and within a day we found ourselves with 10,000 users on our hands, because it just went wild through all of the blogs. That was a totally crazy experience: we had thought, “well, we’ve built this little thing, let’s give it to 10 people and see how it works”, and suddenly we have this massive load of people coming in. And we were working, working very hard, firstly trying to keep the servers up, but fortunately they held up quite nicely. Then also responding to all of the tweets, responding to all of the emails that were coming in. There was lots and lots of stuff happening very quickly; at that point we knew that we were on to something pretty exciting.

Matthew: And then you formally launched when?

Martin: We considered that our launch after the fact; we then said, “Well, OK, I guess we’ve launched now. Oh well, we’ve launched.” And then since then we’ve, at times, launched new features but that original bit of press we regard as our real launch.

Matthew: Are there any unique metrics or social proof about Rapportive that you’d like to share with our audience?

Martin: I think the thing I find most exciting: we always have a Twitter search going on — we have a big screen in the office, showing what people are saying about Rapportive on Twitter — and there’s just this constant stream of people loving it. I’m really humbled all of the time I see this. Every hour there’s stuff coming in from people saying things like, “This product has changed our life.”

And that’s just amazing: when people will actually go out of their way to say something like that, and we’re not even particularly prompting them. So yeah, we have hundreds of thousands of users at the moment, but the important thing is really how much people care about it.

Matthew: We know founders face unique challenges when they start a company. What was the hardest part about launching or starting Rapportive, and how did you overcome this obstacle?

Martin: So we had a bit of a frustrating phase over the last summer. We were working very, very hard and there was lots going on, but our product was making very little visible progress, because we were spending all of our time firefighting, scaling our database because we had so much stuff coming in that we had to do a lot of work to re-architect it. We were doing a lot of groundwork for features which are just coming out now, but in technical groundwork there are months of work which is just invisible. We were moving country because we were all coming from the UK, moving to San Francisco, and we were fighting with US immigration. We were also spending a lot of time on support — which is good, it’s really valuable, because we learn a lot about the problems that people have, but again it’s very time consuming.

So, with all of those things, it’s all useful stuff; there’s nothing really wasteful there. But on the other hand, our product wasn’t making progress, and people were starting to ask, “Well, you’ve been around for six months now, nine months now, and you’ve not really released any exciting new features. What’s going on?” And we were just saying, “Yeah, we’re trying to get to it, we’re doing what we can!”

And then I was so happy when, towards the end of 2010, we got over this big hump of stuff, and now we’re putting out features again and there is much more visible progress. So that was a fairly hard phase to go through, but I’m really glad we got over it. In the end you just have to work through it. You just have to not give up, just keep on going, keep on going, even if it’s getting tough.

Matthew: Since you’ve been in operation, what have you learned about your business and your users that you didn’t realize before you launched?

Martin: When we first launched I was a bit cautious. I was wondering: “are people going to be really freaked out by seeing how much information is actually publicly available about them on the web?” You know, when you think about it rationally, it’s obvious: you can just search for someone on Google, and for most people you’ll actually get a pretty good idea of who this person is just by looking at the search results. And we’ve just taken away a step by automating a lot of that search, making it more convenient by putting it in email.

And so I was expecting that there’d be a lot of people who would go, “Oh my God, no, privacy is dead!” But we tried to manage that very carefully: whenever anyone was concerned, we listen to them and respond to any concerns very quickly, and explain what we’re doing, why we’re doing it and why we think it’s absolutely fine. We are all very privacy conscious and we make that very clear as well. We don’t mess around with people’s private data; we only show information which people actually want to be public.

And that is something I found surprising: just how quickly we can defuse any situation. If anyone was upset we’d just talk to them quietly, patiently, and explain what’s going on. If there was any problem, fix it quickly — and all the problems suddenly go away. And that’s really encouraging, because it means that we seem to actually be doing the right thing: pushing the envelope a bit. But yes, it works.

Matthew: What is it that you make look easy? What skill or talent comes easy or intuitively to you, and what has been difficult and how do you manage that?

Martin: I’d say: what we, as a team, are particularly good at is product design. Making something which is very neat, stays out of your way, but is still powerful; which does exactly the stuff you need, not more, not less; and just behaves the way people expect it to behave, without running into a weird corner where you don’t know what to do.

And that is actually really hard to achieve. The amount of time we spend on optimizing the workflows for different users, depending on which starting state they’re coming from, which screens they have to go through and exactly what button we can show in which place, exactly what copy we use, what words we use to describe things, then taking them through the flow… and then, to the user, all it looks like is: “oh, I clicked a button, a pop-up appeared, I clicked another button and it worked.”

That’s something we really enjoy: making that look easy, but a lot of work goes into it. In the end people just appreciate it as a product which is really nicely designed, which just works and which gives them a kind of warm, fuzzy feeling.

Matthew: What’s the most important lesson you’ve learned since launching Rapportive?

Martin: The most important lesson? I’ve not really graded them in a particular priority.

I’d say, off the top of my head: caring about user experience and caring about users was something we thought from the start was really important — and that has really been validated. People appreciate us for having a product which just works nicely, and which has the little details thought out.

People appreciate that we get back to them quickly, that we’re always very friendly when responding to them, that we’re trying to be personal where we can.

Matthew: Martin, what bit of advice do you wish you would have known before starting Rapportive?

Martin: I think what’s really interesting is that in a startup everything is magnified. If you have any issue early on, that will just continue, continue, get bigger and bigger, so if you have any issue early on then make sure you fix it early on. I think we’ve generally done a pretty good job of that. But it’s worth doing that really consciously.

Certain things are really hard, but you need to get good at them. For example, communicating and sharing intuitions, that’s a topic that I’ve been thinking about a lot. We find that, since we’re three cofounders, we often have similar ideas about things, but and then often find that they differ in subtle ways. Really what we want to do is to combine our three intuitions into one, so that together, we have a really good broad and also deep insight into what people want. That requires that you find ways of explaining to the others not just what you think, but why you think it.

And that’s really hard to learn, and we’ve gradually been getting better at that. As you go about things, just be conscious of the fact that it’s going to take a lot of effort and time, even just to learn to speak the same language. You think you all speak English, but then you find, of course, that you make up your own words to describe the domain you’re working in. A lot of things are just completely non-obvious.

You get a lot of conflicting advice from outside mentors. We have a lot of really good investors, advisors, mentors, and often they say completely contradictory things — and that’s fine. You just need to learn to absorb those things into your own intuition, and within the team work out how you can share those intuitions. Then you can have a coherent vision, all together, for what you’re going to build, why it’s important, how you’re going to go forward.

Matthew: What bit of advice would you like to share with our audience about launching a startup? If you have to distill it, what are the key elements?

Martin: One thing, which worked in our favor but is not necessarily particularly replicable: if your product works well for journalists, then journalists will write about it quite a lot. We didn’t realize this initially, but it happened to be the fact that, Rapportive works really well for people who deal with a lot of incoming weird stuff from lots of people they don’t know, and need to assess very quickly whether the sources are reliable. And, well, that’s pretty much what journalists do.

It was also the case that when we started Rapportive, a lot of the data we had about people was not particularly great, but bloggers tend to be the kind of people who are very present on social media, so we had really great data for them! And that worked in our favor. Since then we’ve got a lot better at data for everyone else, and now we’ve got a pretty high coverage rate for everyone. But for that initial launch, just working well for reporters and bloggers was pretty good.

But of course, you can’t choose your startup based on the fact that it’s going to be useful for bloggers, so that’s not very useful advice.

There are lots of different schools of thought for launching and they all kind of make sense. There’s the “launch small and make sure that you’re continuously learning” school, and that makes a lot of sense. And then there’s also the school which observes that, if you can get a lot of very quick press that generates a lot of excitement and a lot of buzz, that’s also valuable. In the end, with these things there’s never a right answer; you just have to take in all of the bits of advice you hear and create your personal conglomerate of what makes sense.

Matthew: Before we close, I would love for you to give our audience your vision of Rapportive and how you hope it will change the world.

Martin: We’ve got a lot of really exciting things coming. I don’t want to talk about them in too much detail, but to give a rough outline:

I think, firstly, the inbox is a really, really interesting place, because that’s where all of your communications come together. Email is the primary one we use at the moment; I don’t know, maybe it’ll be Facebook mail within two years’ time, but that doesn’t really matter, that’s beside the point.

The point is that people are really, really opinionated about which tool they want to use, and getting people to change tool is really, really hard. So we’re building Rapportive in the philosophy that we don’t people to change behavior; we just want people to continue doing what they’re doing already, and just make it better.

Just add those little magic touches, add little things which either save you time, or which take something which was previously laborious (and required switching to other browser tabs and required re-entering of data), and make all of that go away. Just make it be there, and make common tasks feel natural.

That’s the philosophy with which we’re going about things, and that seems to be working pretty well.

Matthew: Excellent. Well, Martin, it’s been a pleasure having you as a guest on FounderLY. We’re rooting for your success at Rapportive. For those in our audience who’d like to learn more you can visit their website at www.rapportive.com and register to become a user and join their community. This is Matthew Wise with FounderLY. Thanks so much, Martin.

Martin: Thank you, Matthew.