Home > Positioning > Subjects > Apache Avro > Schemas and encoding

Schemas and encoding

An Avro schema describes the shape of the data; an encoding turns a value conforming to that schema into bytes. The two are inseparable — without the schema, the bytes cannot be read, because the encoding records no type information of its own.

Schemas in JSON

Avro schemas are written in JSON. A schema is one of three things: a JSON string naming a type ("int", "string"), a JSON object describing a type with attributes ({"type": "array", "items": "string"}), or a JSON array, which denotes a union of the listed types. A record schema, the workhorse, is a JSON object naming the record and listing its fields, each with a name and a type:

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}

A higher-level alternative, Avro IDL (.avdl), offers a C-like declaration syntax for schemas and protocols that compiles down to this JSON form. The JSON representation remains the canonical one.

The type system

Avro defines eight primitive types: null, boolean, int (32-bit signed), long (64-bit signed), float, double, bytes (a byte sequence), and string (a UTF-8 character sequence).

It defines six complex types:

Records, enums, and fixed types are named: they carry a name (and optional namespace and aliases), which is what later allows a reader and writer to line their schemas up.

Binary encoding

The binary encoding is compact and carries no type tags — the schema supplies the structure, so the bytes carry only values.

The specification gives the exhaustive byte-level rules.

JSON encoding

Alongside the binary encoding, Avro defines a JSON encoding, useful for debugging and for human-readable interchange. Most values map to their natural JSON form. The exception worth knowing is the union: in binary a union value is a numeric branch index, but in JSON a non-null union value is wrapped in a single-member object whose key is the branch type’s name — {"string": "abc"} for a string branch — while a null branch is written as bare JSON null. This makes the chosen branch explicit where JSON has no positional index to carry it.

Sources