JSONPull - A JSON pull parser for Erlang

Hex.pm package - Source code


I would like to start this post by referring you to the talk A Fast, Compliant JSON Pull Parser for Writing Robust Applications by Jonathan Müller at CppCon 2023. Watching that talk is what inspired me to create this library and it will explain all of these concepts better than I can. I will try to sum it up briefly, but you can skip ahead if you did watch it.

JSON parsing is something that’s very familiar to anyone that has been programming for the web in the past decade. There already exists a number of libraries for this, so what makes this different? Well, it’s a completely different paradigm.

First of all, let’s consider the most basic JSON parser: a DOM parser. It simply reads a JSON object and translates it, as best it can, into a native data type. However: We are not writing Javascript. We don’t have objects and arrays, we have maps and lists. In some ways they are equivalent, in others, they are not. In many cases, you end up parsing your JSON data into a native type, and then “parsing” that native type into the format you actually want.

The second type of parser is a SAX parser. Instead of giving you the entire object, it sends a callback every time a value is read, leaving you to reconstruct a type from a chain of events. This sort of solves one problem, but it means you have to work with very difficult to understand syntax.

From a quick search on Hex.pm, here is how I could categorize the current available libraries:

  • DOM parsers: jason, poison, jsx, yamerl, thoas, jsone, json
  • SAX parsers: jaxon

So what is a pull parser? It works like a lazy iterator available in many languages, or like Streams in Elixir (from what I understand, I’m not really an Elixir person). You only pull data when you want to, and what the pull parser lets you do is to manage your expectations. You get a few basic building blocks, and on top of these you can build complex constructions.

Here are the JSON primitives:

{ok, {null, Rest}} = jsonpull:null(<<"null">>).
{ok, {true, Rest}} = jsonpull:boolean(<<"true">>).
% Strings may be pulled as binary strings or iolists
{ok, {<<"Hello, world!">>, Rest}} = jsonpull:string(<<"\"Hello, world!\"">>).
% Numbers may be pulled as the raw string interpretation or asserted to be an integer or float
{ok, {<<"12345">>, Rest}} = jsonpull:number(<<"12345">>).
{ok, {12345, Rest}} = jsonpull:integer(<<"12345">>).
{ok, {123.45, Rest}} = jsonpull:float(<<"123.45">>).

Simple enough, but what about structures? Let’s start with an array.

{ok, {begin_array, R1}} = jsonpull:array(<<"[1,2,3]">>).
{ok, {element, R2}} = jsonpull:element(R1). % We use this to know there is an element to read.
{ok, {1, R3}} = jsonpull:integer(R2).
{ok, {element, R4}} = jsonpull:element(R3). % It also skips the comma and any whitespace.
{ok, {2, R5}} = jsonpull:integer(R4).
{ok, {element, R6}} = jsonpull:element(R5).
{ok, {3, R7}} = jsonpull:integer(R6).
{ok, {end_array, <<>>}} = jsonpull:element(R7).

Sure, this is verbose for now, but we’re now certain at every step of the way that we are reading 3 integers from an array in this JSON, and nothing else. We have confirmed the structure of our data in the parsing step and any invalid input is denied already.

And now, on top of this we can already think of some constructs to make use of this iteration. For example, to read an array of numbers we can just do:

{ok, {[1,2,3], <<>>}} = jsonpull_construct:list(<<"[1,2,3]">>, fun jsonpull:integer/1).

I’ll talk about the different modules later. For now, let’s proceed to the last type: objects. Objects are a difficult problem in any Erlang-adjacent JSON parser because you never know what you will get, or what you want. A proplist? A map? How about this: You decide.

JSON = <<"{\"id\":42,\"name\":\"foo\",\"speaker\":true}">>,
{ok, {begin_object, R1}} = jsonpull:object(JSON),
{ok, {<<"id">>, R2}} = jsonpull:key(R1),
R3 = jsonpull:skip_value(R2),
{ok, {<<"name">>, R4}} = jsonpull:key(R3),
{ok, {Name, R5}} = jsonpull:string(R4),
{ok, {<<"speaker">>, R6}} = jsonpull:key(R5),
{ok, {IsSpeaker, R7}} = jsonpull:boolean(R6),
{ok, {end_object, <<>>}} = jsonpull:key(R7),
#{ name => Name, is_speaker => IsSpeaker }.

This has one problem besides being clunky: Keys in JSON objects are not sorted. Therefore, we need to pull a key first and see what we’re dealing with before we act on it. But sure enough, this is an easy problem to solve and there is another easy construct you can use:

JSON = <<"{\"id\":42,\"name\":\"foo\",\"speaker\":true}">>,
jsonpull_construct:map(JSON, [
  {required, [{<<"name">>, fun (Bin, Acc) ->
    {ok, {N, Rest}} = jsonpull:string(Bin),
    {ok, {Acc#{name => N}, Rest}}
  end}]},
  {optional, [{<<"speaker">>, fun (Bin, Acc) ->
    {ok, {B, Rest}} = jsonpull:boolean(Bin),
    {ok, {Acc#{is_speaker => B}, Rest}}
  end}]}
]).

And we can make that even better with some shortcuts:

JSON = <<"{\"id\":42,\"name\":\"foo\",\"speaker\":true}">>,
jsonpull_construct:map(JSON, [
  {required, [{<<"name">>, {set, name, string}}]},
  {optional, [{<<"speaker">>, {set, is_speaker, boolean}}]}
]).

All of these example return the same map. But what if we could do something no other Erlang JSON parser can do?

-record(talk, {
  name,
  is_speaker
}).

JSON = <<"{\"id\":42,\"name\":\"foo\",\"speaker\":true}">>,
{ok, {Rec, <<>>}} = jsonpull_construct:fold(JSON, #talk{}, [
  {required, [{<<"name">>, {set, #talk.name, string}}]},
  {optional, [{<<"speaker">>, {set, #talk.is_speaker, boolean}}]}
]),
#talk{name = <<"foo">>, is_speaker = true} = Rec.

JSONPull is an experimental library and I have not yet completely nailed down the structure. I wouldn’t recommend using it in a production setting unless you are prepared to change your code down the line in case of breaking changes.

Nevertheless, the different modules currently available in JSONPull follow:

jsonpull_read

This module is the completely barebones JSON reading tool that serves as the basis for the entire library. It will attempt to read exact types and return their binary data without modification. The common function signature here is {ReadBinary, RestBinary} | ErrorAtom = jsonpull_read:TYPE(JSON)

jsonpull

This module is the basic frontend. It reads and converts values to the type you want. The common function signature here is {ok, {Value, RestBinary}} | {error, ErrorAtom} = jsonpull:TYPE(JSON)

jsonpull_expect

In the pursuit of less boilerplate, I threw together this module. It is identical to the basic module except anything but ok will be thrown as an error, with a bit of error_info to match. The common function signature is {Value, RestBinary} = jsonpull_expect:TYPE(JSON)

jsonpull_construct

This module holds the helpful constructs that make working with the structured types more enjoyable. For the elements of a list and the values of an object they call the user-supplied function to read JSON binary and come up with a type. This can either be your own custom fun, a pointer to a jsonpull fun, or just an atom which will be taken as jsonpull:ATOM/1.

For jsonpull_construct:list(JSON, Fun) any function that returns {ok, {Value, RestBinary}}will work.

jsonpull_construct:fold/3 and map/2 are slightly different. For you to make sure that the object was constructed successfully you will specify constraints that will have to be met for the entire function to exit correctly. Each constraint specifies if it is required or optional, and has a list of keys it is looking for and what to do with the value if found. The “what to do” can either be a function that works like in a fold or a helper like {set, key, TypeFun} where TypeFun follows the same rules as the list fun.


Here are some tidbits in closing:

I think it would be nice to have a parser module in the future. It would hold the binary inside itself and enable you to write code like:


true = jsonpull_parser:boolean(P).

Because of the architecture, the entire library is somewhat stream-compatible, although detection of when a value is cut off is lacking, namely with numbers.

I have not benchmarked the library against other options, but I have tried my best to abide by Erlang’s efficiency guide when it comes to binaries. It is written in 100% native Erlang with no NIFs.

The library is tested with unit tests, property tests (including fuzzing) and against the JSONTestSuite.

Currently, there is no JSON encoding in this library. I believe there is room for a similar paradigm when it comes to JSON encoding, but it is only a thought at the moment.

5 Likes