An Introduction to Protocol Buffers

Aug 28 2015

Protocol buffers have come up in a few conversations that I’ve had recently. They seem to be gaining mindshare with developers, but I had to confess that I didn’t know too much about them. Until now, that is, because I’ve done some research which I am sharing here in hopes that it’ll help someone else.

What are protocol buffers?

This one is easy, I can just copy and paste from the official website. Protocol buffers are “Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data.” Ok, post over. Think of them as a nice middle-ground between JSON and XML: simple and easy to use like JSON, yet structured like XML. Protocol buffer messages can be represented in text format for humans to read, but are encoded in a binary format when sent over the wire. This makes transit faster and parsing quicker. According to Google’s guide they’re “3 to 10 times” smaller than XML messages and “20 to 100 times faster.”

How do they work?

Implementing protocol buffers involves three different parts:

  • A .proto file or files to define message formats
  • A protocol buffers compiler
  • A protocol buffers API in your programming language of choice to read and write messages

In a nutshell: you first define your message scheme in one or more .proto files. You use a protocol buffers compiler for your language to create a class that includes setters and getters that will handle the reading and writing of the data for you. And obviously you’ll use the API for your programming language to integrate this writing and reading of data with the rest of your code.

So what does a .protoformat look like? This is shamelessly taken straight from the Google docs:

package tutorial;

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phone = 4;
}

message AddressBook {
  repeated Person person = 1;
}

The first time I saw this format I was struck by how easy it was to understand. You can define messages, such as the Person entity above. These formats are made up of typed fields that can be required, optional or repeated (zero or more times). Standard primitive types are available to you, but you can also define custom ones such as the nested PhoneNumber format above. You can also set predefined values for a field as seen in PhoneType above. The numbers are unique “tags” that are used to save space when encoding to binary.

Which languages can you use protocol buffers with?

Google offers official compilers for Java, C++ and Python and there are a plethora of third-party libraries to support many other languages, including probably yours. There are even some for JavaScript although quality may vary.

But why?

First off, as stated earlier, protocol buffers are smaller than other formats, which makes them quicker to transmit. They’re also faster to parse than other formats, due to the set schema.

Another reason may be apparent from the above answers. In a polyglot, microservices world, it’s nice to have a standard representation of your data that you define and share via the .proto files. You could version and share these .proto files in a Git repo so that they are available to any consumer of your services.

This line from Michael Bernstein’s take got me:

There is a certain painful irony to the fact that we carefully craft our data models inside our databases, maintain layers of code to keep these data models in check, and then allow all of that forethought to fly out the window when we want to send that data over the wire to another service.

You can almost think of protocol buffers as an extension of your database schema (assuming of course that your database of choice has a schema) all the way to the consumer of your data. For applications written in dynamic languages this can reduce the amount of parsing code and testing to confirm that you are correctly handling the data you are sending and receiving.

What about Thrift, Avro and others?

I don’t want to get into that here, but this is a good comparision.

There’s more protocol buffers (backward compatibility is particularly important) that I won’t cover here; the official Google docs are a good starting place.

Now that I know what they are and how they work, I’ll definitely consider using protocol buffers on my next project. It’ll be interesting to see where they show up in the future: imagine a REST API, maybe even an open data one from a government, including .proto files to download along with their documentation of resources.

In my next blog post I’ll experiment with sharing data between JavaScript and Go and show some actual code.

Discuss this post with me on Twitter.

Send a pull request for this post on GitHub.

Dave Walk is a software developer, basketball nerd and wannabe runner living in Philadelphia. He enjoys constantly learning and creating solutions with Go, JavaScript and Python. This is his website.