Home > Positioning > Subjects > Apache Avro

Apache Avro

Apache Avro is a data serialization system and remote-procedure-call framework. Schemas are written in JSON; data is encoded in a compact, untagged binary form; and — this is the move that organises everything else — the schema is carried alongside the data rather than assumed by the code that reads it. A reader always has both the schema the data was written with and the schema it expects, and reconciles the two at read time. Because the schema is always present, no generated, compiled class is required to read or write a record. The current release is 1.12.1 (October 2025); the canonical reference is the Avro specification.

This page gives the through-line; the sub-pages give the structured depth, each linking out to the specification for exhaustive detail rather than reproducing it.

Design philosophy

Four commitments distinguish Avro from the binary serialization formats it grew up beside, Apache Thrift and Protocol Buffers:

The schema travels with the data. An Avro object container file embeds the writer’s schema in its header; an RPC connection exchanges schemas during a handshake. The bytes are never self-describing on their own, but the schema needed to read them is never far away.
Code generation is not required. Thrift and Protocol Buffers lean on classes generated from a schema and compiled into the program. Avro can do this too, but its native mode is dynamic: a schema read at runtime is enough to read or write conforming data. This makes it comfortable in dynamic languages and in systems where schemas are discovered rather than fixed at build time.
No field tags in the data. Protocol Buffers tag every field with a number that must stay stable across versions. Avro writes field values in schema order with nothing between them — no tags, no names, no delimiters. Fields are matched by name at read time, not by position or number in the stream.
Compactness. With no tags or field names in the payload and integers written in a variable-length encoding, the encoded form is small. The schema, held once per file or once per connection, carries the structure the bytes omit.

Origin

Avro was created by Doug Cutting — the creator of Lucene and Hadoop — and proposed as a Hadoop sub-project in 2009. Hadoop needed a serialization format and a wire protocol for moving data between processes written in different languages, and the existing options required code generation and stable field tags that sat awkwardly with Hadoop’s dynamic, multi-language environment. Avro’s answer was to keep the schema with the data and resolve differences at read time. It graduated to a top-level Apache project in May 2010 and is now used well beyond its origin — most visibly in the Apache Kafka ecosystem, in Apache Spark, and across data lakes and pipelines where schemas change over time.

Persons

Doug Cutting — creator of Avro, Lucene, and Hadoop.
Martin Kleppmann — whose account of schema evolution in Avro, Protocol Buffers, and Thrift and Designing Data-Intensive Applications (Ch. 4) are the clearest narrative treatments of the reader/writer model.

Sources

Apache Avro documentation and specification — the canonical reference.
Kleppmann, M. (2012). Schema evolution in Avro, Protocol Buffers and Thrift.
Kleppmann, M. (2017). Designing Data-Intensive Applications, Ch. 4. O’Reilly.

Apache Avro

Design philosophy

Origin

Pages

Persons

Sources