Do you make use of term_to_binary?

AstonJ · February 28, 2023, 10:36pm

Joe asked this on EF a few years ago:

Wondering whether people make use of it in Erlang - do you use it?

Here’s what he had to say about it:

joeerl:

You have to remember that any SQL database will be a bottleneck since there is an impedance mismatch between the way dayt is represented in a daybase and the way it is represented in the beam VM.

Databases are basically rectangular tables of cells, where the cells contain very simple types like strings and integers - every time you access a row of an external database this list of cells has to be converted to beam internal data structures - this conversion is extremely expensive.

The best way to persist data is in a process - then no conversion is needed, but this is not fault tolerant - so you need to keep a trail of updates to the data and store this on disk.

Often you don’t need a database for example you might like to have a system where you store all the user data as in the file system with “one file per user” this will scale very nicely - just move the files to a new machine if you need more capacity.

Erlang has two primitives term_to_binary and the inverse binary_to_term that serialise any term and reconstruct it - so storing complex terms on disk is really easy.

I have mixed feelings about databases, they are great for aggregate operations (for example, find all users that have these attributes) but terrible for operations on individual users (where a single file per user is far better).

If I were designing a new system I’d go for ‘one file per user’ as much as possible and try to limit databases for operations over all users.

If you look at how many programs are designed you’ll see they follow this principle. Apple stores all images in the file system (hidden a bit) and has a database with metadata about the files. This is good since the database is small and many operations can be performed with minimal use of the database. What they do not do is put all the data in a database - there are good reasons for this.

dominic · March 1, 2023, 6:23am

When I’m going to use term_to_binary , first thing out of my mind is dets and there a question too

Great idea!

Finally, I need term_to_binary

kjnilsson · March 1, 2023, 11:37am

Not anymore so much. Mostly use term_to_iovec/1 now

mworrell · March 1, 2023, 12:08pm

In Zotonic we are using term_to_binary regularly to serialize arbitrary data structures. For backing stores, but also for data that is send along user requests (we are adding signatures for those).

caravan_muffin · March 1, 2023, 5:37pm

It is a great way to “pass by reference” if copying messages between processes becomes a botlleneck. Admittedly with some overhead.

For example, you may have a pool of workers and a dispatcher, and a stream of relatively large messages coming in. In a naive implementation your message will be copied into the dispatcher only to be immediately copied again into one of the workers. If you, instead, transform the message into a binary all you copy between the processes is a reference. You just need to leave enough of a message as a term in order to dispatch it appropriately.

lpil · March 2, 2023, 8:32pm

Oh that’s clever! Thanks for the tip

eproxus · March 3, 2023, 6:48pm

Also, base64:encode(term_to_binary(Anything)) is a great way to piggyback Erlang things over almost any (text-based or string capable) protocol.

For example, we had a need to ship Prometheus metrics from IoT devices in the field, which already had an existing semi-persistent web socket channel open talking JSON-RPC. What did we do? Exported the data from the Prometheus library on device, base64-encoded it and shipped it in a JSON field in a web socket packet. Then we just unpacked it on the server side and sent it to Prometheus when it was scraping for metrics.

nickva · March 3, 2023, 8:49pm

Apache CouchDB uses term_to_binary to store all its data on disk.

rvirding · March 18, 2023, 1:56pm

The only thing you have to be aware of here is that when you want to inspect data then you have to unpack the binary into a copy of its original data. So while you might save copying the actual message you still have to make a copy when you view the data. I personally think it is probably more efficient to let the BEAM copy instead of first making the binary then unpacking it every process.

However one situation where it is more efficient is when you want to pass the message through a chain of processes who don’t need to unpack just pass it on. E.g when processing pictures or video.

starbelly · September 24, 2023, 7:05pm

I’m interested in this. I saw that rmq is using for writing to files, but did you do any benchmarks outside of that scope?

I did some recent benchmarks and while the tool I used may be folly (need to try with erlperf and making my own test devoid of those tools), but I found term_to_iovec was slower than term_to_binary, which was not expected, but perhaps my expectation is wrong. Tests using term_to_iovec writing to files with a raw file handle were faster, but not significantly faster.

starbelly · September 24, 2023, 7:09pm

Quick follow up, bench marking with erlperf I found that term_to_iovec was on average 9% faster

SuddenGunter · September 18, 2024, 5:02pm

sorry for necroposting (not sure about this forum rules):

we use term_to_binary for most of our kafka messages, switching to json for interop with other languages (though I recently started digging into that subject - I think we could parse that term_to_binary format in Go directly with some packages)

kanga-roo · September 24, 2024, 4:13am

Is ‘one file per user’ in this context to be in the format of .dat or .txt or …?

Maria-12648430 · September 24, 2024, 11:38am

You lost me… what do you mean?

rlipscombe · September 24, 2024, 2:59pm

This is originally from Joe’s (quoted) text at the top. I assume that Joe meant that you could use term_to_binary/1 to serialize each user and then write each to a separate file.

This means that none of the files get too large, and you don’t need to worry about concurrent updates (as long as each user is treated serially). It’s basically Joe’s way of saying “shard by user”, I suspect.

I don’t understand this, though. I’ll take a guess.

Since term_to_binary/1 returns binary data (see External Term Format — erts v15.0.1 for details – I guess you know this already, @Maria-12648430), you’d probably want to name your files .bin or .dat (if that’s a convention you prefer).

kanga-roo · September 25, 2024, 2:53am

Thanks, Roger. Your response has answered my question. Happy coding!

kanga-roo · October 27, 2024, 3:26am

With ‘one file per user’ [.dat or .bin], would mnesia be suitable for metadata about those files? If so, is there a link to an example?

rlipscombe · October 28, 2024, 11:14am

No.

Think about it like this: mnesia is a cluster-replicated data store. “one file per user” (being stored in the filesystem) isn’t.

So you have a mismatch between the intended use patterns of the two.

kanga-roo · October 30, 2024, 4:11am

Thank you for your insightful and concise explanation, Roger.