A new release if Jiffy is out. There are quite a few performance improvements, some new features, bug fixes, and increased test coverage.
Performance
I used nickva/bench Benchee-based json benchmark to test performance. Any perf numbers mentioned below are from running that. There are some individual benchmarks referred to by name such as “Issue 90” or “Canada”, those are just types of inputs used in benchmarking, some have mostly numbers, some mostly strings, some are mixed
-
SIMD vectorization for ASCII scan-ahead loops in both decoder string parsing and encoder
string emission. This meant replacing the byte-at-a-time scans with 16/32-byte chunked compares.
The most interesting part is it was done without writing a single line of assembly, just relying
on compiler auto-vectorization. This showed a 15x performance improvement on encoding large
strings like in the “Issue 90” benchmark. To get even better auto-vectorizer behavior, it’s
advisable to set-march=nativeor-march=x86-64-v3. That can make the auto-vectorizers on
recent compilers switch to using 256bit AVX2 registers and instructions. -
UTF-8 skip-ahead in encoder and faster UTF-8 validation. This is like the scan-ahead loop for
ASCII, but it’s for UTF-8 validation. This helps quite a bit on non-ASCII, Unicode-heavy inputs.
“UTF-8 unescaped” benchmark got a 5.8x speedup from it. -
Use Ryu for number encoding. This is the exact Ryu version from the latest Erlang/OTP release
with all the updates and tweaks they added. This makes the float output the same as Erlang’s.
However, this means the output is not exactly the same as before for Jiffy (we used to emit more
fractional digits, now it switches to the scientific notation a bit earlier). Number heavy
benchmarks like “Canada” showed a 2x speedup. -
ffc.h for number parsing in the decoder. This is the fastest C number parser around at this
time. I worked with the upstream author to add a new API to it parse JSON numbers as a single call
which returns back either an integer or a double, as opposed pre-parsing to figure out which is
which first ffc.h. Using this library yielded a 4x
speedup in the number-heavy “Canada” benchmark on decoding. -
Faster array and map creation for building the result term in fewer steps. This bulk creation
improved decoding across the board. Some examples are 2.5x for “JSON Generator”,
2.6x for “Github” and 3.3x for “Blockchain”. Most of those a mixed inputs so number
parsing and scan-ahead played a role in there as well. -
Branch prediction hints on encoder hot paths. I saw QuickJS library doing this, so experimented
around and saw few percent speedup from it. -
Unity build. Having handled a few issues over the years related to enabling, disabling and
detecting LTO (Link-time optimization) compiler features, decided to side-step it and go with a
unity build. This is where we include all the source file into onejiffy.cfile and compile
that. We get all the benefits of LTO but without having to juggle linker flags.
Yielding & scheduler behavior
-
Reduction count bumped to 4000 to match current Erlang VM defaults
-
Bytes per reduction lowered so cooperative yields fire more often on long input.
This results in better latency under contention without a measurable throughput hit.
Since Jiffy is a NIF, it’s crucial for it to never block schedulers and always yield appropriately.
As the concurrency increases it should degrade gracefully in proportion to the applied load. This is
not a trivial task to accomplish in a NIF, in general. Some json library NIFs use dirty schedulers,
however in cases where Jiffy is used that wouldn’t work as that is still a limited resource and
during high concurrency it would lead to bottlenecks.
A separate benchmark, bench_scheduling.sh in GitHub - nickva/bench: Benchee Benchmark for Jiffy · GitHub runs concurrent JSON
encoding and decoding scaled by the number of schedulers. Testing with a few Erlang json libraries
shows something like this:
./bench_scheduling.sh
...
scheduler responsiveness check
input: citm-catalog.json duration: 2000
schedulers: 12 online
impls: json, jiffy, simdjsone, jsone, jsx
[json]
1x encdec n=84 p50=135.0ms p95=182.9ms p99=191.9ms max=196.7ms
12x encdec n=86 p50=129.7ms p95=189.9ms p99=203.0ms max=206.2ms
24x encdec n=87 p50=263.0ms p95=461.2ms p99=506.1ms max=527.1ms
[jiffy]
1x encdec n=309 p50=38.3ms p95=51.9ms p99=57.4ms max=66.5ms
12x encdec n=300 p50=41.2ms p95=52.5ms p99=59.7ms max=66.2ms
24x encdec n=306 p50=80.2ms p95=111.8ms p99=118.8ms max=140.1ms
[simdjsone]
1x encdec n=20 p50=690.1ms p95=784.6ms p99=784.6ms max=784.8ms
12x encdec n=16 p50=790.9ms p95=887.5ms p99=887.5ms max=899.9ms
24x encdec n=24 p50=1448.4ms p95=1876.7ms p99=1879.5ms max=1882.7ms
[jsone]
1x encdec n=60 p50=213.1ms p95=261.8ms p99=263.9ms max=264.8ms
12x encdec n=60 p50=204.9ms p95=329.8ms p99=345.0ms max=350.9ms
24x encdec n=52 p50=440.1ms p95=700.3ms p99=773.3ms max=817.3ms
[jsx]
1x encdec n=24 p50=398.8ms p95=539.0ms p99=544.1ms max=548.3ms
12x encdec n=24 p50=391.5ms p95=684.9ms p99=687.0ms max=689.6ms
24x encdec n=24 p50=1181.3ms p95=1479.0ms p99=1558.1ms max=1654.7ms
There we measure both the latency of sending a term back and forth between two encoder/decoder
processes, as well as the throughput (n is how many times we managed to do that).
New Features
-
Pre-encoded JSON — embed already-encoded JSON fragments directly in a value being encoded,
saving a round-trip through the decoder. Use{json, IoData}terms and they will be embedded in
the emitted stream as is. This was a surprisingly popular feature over the years. Paul J. Davis
(Jiffy’s original author) suggested a nice and quick patch to make it work so I went with that. -
Encode UTF-8 atoms atoms with non-ASCII bytes now encode as their UTF-8 source. Unfortunately
this is for OTP 26+ only. -
Number-as-key encoding — integer/float map keys are encoded as string keys instead of
erroring. Both Python and Erlang/OTP’s built-in json already does this.
Correctness & compliance
-
RFC 8259 100% compliance. A new test suite based on
nst/JSONTestSuiteis wired in and all
conformance tests pass. -
Big List of Naughty Strings (BLNS) added in the test mix.
Build & CI
-
OTP 21 is the new minimum.
-
C coverage checks added so the test suite reports per-file C line coverage; several uncovered
paths were closed during this work.
(There was previous brief post about the 2.0 release but since I messed up the library update as I didn’t have an already existing library forum link, so that update was lost when I created a proper jiffy library link and this is a re-do of that)