Khepri - a tree-like replicated on-disk database library for Erlang and Elixir (introduction & feedbacks)

dumbbell · November 2, 2021, 5:09pm

Hi!

On behalf of the RabbitMQ team, I’m proud to announce the Khepri database library!

Khepri is a tree-like replicated on-disk database library for Erlang and Elixir. This is a library we plan to use inside RabbitMQ as a replacement for Mnesia. Indeed, we want to leverage Ra, our implementation of the Raft consensus algorithm, to have a uniform behavior, especially w.r.t. network partitions, accross all types of data managed by RabbitMQ.

Data are organized as nodes in a tree. All tree nodes may contain data, not only the leaves. Here is a quick example:

%% Store "150" under /stock/wood/oak.
khepri:insert([stock, wood, oak], 150),

%% The tree now looks like:
%% .
%% `-- stock
%%     `-- wood
%%         `-- oak = 150

%% Get what is stored under /stock/wood/oak.
Ret = khepri:get([stock, wood, oak]),
{ok, #{[stock, wood, oak] := #{data := Units}}} = Ret,

%% Check /stock/wood/oak exists.
true = khepri:exists([stock, wood, oak]),

%% Delete /stock/wood/oak.
khepri:delete([stock, wood, oak]).

Most of the concepts and a part of the API are described in the documentation. Note that the documentation is also far from complete.

The implementation is very alpha at this stage. The internal design should not change drastically now but the API and ABI might.

I would love to have a library which is as easy to use and intuitive as possible. That’s why I’m reaching out to the community! If anyone is interested, could you please take a look at the documentation and/or the code and share your comments?

Thank you very much!

ciprian.craciun · November 3, 2021, 5:43pm

First of all thanks for sharing this project as open-source!

I assume that the main use-case is service configuration (knobs, parameters, constants, etc.), or?

Also, although this is perhaps more a question about Ra, how does it handle if the cluster is built over across a WAN (say cross-continents, and setting asside security concerns, VPN’s, etc.)

dumbbell · November 3, 2021, 5:52pm

In the context of RabbitMQ and in the first phase of hte integration, we plan to use Khepri to manage metadata such as vhosts, internal users (and their permissions) and perhaps runtime parameters. Once this is considered stable, we will move things such as queues records (this does not include message bodies and the queue index maintaining the order of messages), bindings and exchanges.

Outside of RabbitMQ, Khepri should be suitable for small piece of data. The reason is that Ra commands (the thing which is passed between members of a Ra cluster and then applied by all Ra state machine) should be small if possible.

Now for uses in a WAN, Raft is designed to handle latency spikes and unreliable networks and recover from problematic situations: as long as there is a majority of cluster members working fine, the cluster can make progress. That said, I don’t know the internals of Ra and how it determines if a cluster member is gone. Let me summon @kjnilsson, the author of Ra

ciprian.craciun · November 3, 2021, 7:10pm

So, assuming that the values are small (say a record with a few fields), how about the number of stored path-value pairs? Will it work for a couple of million of such pairs, or one should keep it up to a thousand or so?

Also, regarding performance and reliability, I am wondering:

what is the memory impact? (does it store everything in memory, or just some meta-data;)
I assume writes should be rare, but can it handle read-intensive patterns? (or should one cache values;)
is it crash resistant? (I would assume so, given it would store RabbitMQ internal data;)

Perhaps it would help you answer my question if I give you my intended usage: I intend to implement a SaaS product in Erlang, and I’m thinking of storing the user data (profile, credentials, etc.) and product configuration in Mnesia as opposed to PostgreSQL (or perhaps keeping a copy of the data in PostgreSQL just in case). (For the initial iterations, I expect at most a couple of (tens of) thousand users, thus scalability is not that a big of an issue, and everything could fit into RAM / swap.)

However, some services should be deployed in various geographic regions, thus over WAN, and therefore Mnesia mirrors could be useful.

Now, seeing Khepri, I’m thinking if I couldn’t be using it instead of Mnesia for these purposes.

vkatsuba · November 3, 2021, 7:46pm

Very good code stuff and documentation style and UI! Thanks for sharing this great library! When will it be possible to wait for publish into hex.pm? I see that ref is used instead of tag - looks like this is related to point that the latest changes in ra i still not released and not published too.

dumbbell · November 3, 2021, 9:54pm

I didn’t have a chance to play with large databases yet (whether it is large value sizes or a huge number of entries, or both). Therefore I can’t tell you today how it would behave in these situations. I don’t expect Khepri to be faster than Mnesia.

Here is a description of how Khepri stores databases in memory and on disk:

As a Ra state machine implementation, Khepri depends entirely on Ra for storing the data safely. Khepri itself keeps everything in memory. When you call a Khepri function which modifies the database (a simple insert/delete or a full blown transaction doing some updates), a Ra command is created (an Erlang record). That record is stored on disk by Ra in its log. That log is fsync(2)'d and replicated to other cluster members who will also write it to disk and fsync. Raft, the algorithm, relies on this for its guarantees.

To allow Ra to get rid of old already applied commands, the state machine can make snapshots of its state. Khepri does that, so every N commands, the state (i.e. the entire database) is stored on disk as well and the Ra log is truncated. They are full snapshots, not a diff from a previous snapshot.

When a Ra cluster member comes back up, Ra reads the previously snapshotted state on disk and uses it to initialize the state machine (Khepri). After that, all subsequent Ra commands are applied.

This is definitely something I should add to the documentation, because understanding that helps understand Khepri and what to expect from it. Thanks for the good questions!

Now to your specific questions:

Everything is in memory: meta data, keys, values, everything. The current implementation uses a tree of Erlang records. In other words, a root node record contains the entire database and that root node is part of the state machine state record. So the memory impact will be larger than the sum of the keys & values memory footprint, given there are metadata in addition to that.

Reads are more efficient than writes: they don’t need to go through the Raft logic (i.e. write a command on disk, apply that to state machine instances, etc.). Instead the read request is sent to the “leader” of the Ra cluster (which is elected internally by Ra). Therefore the read request is serviced by the node having the most fresh state.

If speed is more important than consistency, it is also possible to service the read request from the local node if it is a member of the Ra cluster. However this is not exposed yet by the Khepri API.

I hope it will, once we are confident enough to mark it as stable

The testsuite is already extensive, but there is still plenty of work in that area. In particular, I need to play with a PropEr-like testsuite.

Also, if for whatever reason the code or the host crashes, the data on disk is safe (assuming the disk isn’t the culprit of course). Ra is already used in production in RabbitMQ for “quorum queues” for a few years now. Raft is designed to make sure the state machine’s state stays consistent.

dumbbell · November 3, 2021, 9:59pm

Thank you! The documentation is plain EDoc, however, the output is patched to use a GitHub Markdown-like stylesheet and to add syntax highlighting.

Right, I planned to do that before announcing it to make it easy for people to play with it and I forgot… Let me do that tomorrow! I will ask @kjnilsson if he can release a new version of Ra so Khepri can depend on it from Hex.pm as well.

AstonJ · November 4, 2021, 4:47am

Khepri looks great @dumbbell - I look forward to seeing how it develops

dumbbell · November 4, 2021, 8:36am

Thank you very much!

dumbbell · November 4, 2021, 3:19pm

Khepri 0.1.0 is now available on Hex.pm!

vkatsuba · November 4, 2021, 3:35pm

Cool!

kjnilsson · November 4, 2021, 6:48pm

@ciprian.craciun Ra was not specifically designed for WAN replication which is not to say it wouldn’t work but that isn’t its primary use case. All writes need to go through the leader for example but you can’t control where the leader runs, the Ra “cluster” will do that for you.

There are most likely better ways to do WAN replication for your use case.

ciprian.craciun · November 6, 2021, 5:48pm

All writes need to go through the leader for example but you can’t control where the leader runs, the Ra “cluster” will do that for you.

Indeed this situation is not optimal for my use-case, however the fact that reads are also serviced by the leader is a major issue in case of WAN deployments.

However, given that at some point reads could be serviced locally, I think that Khepri is still a good option at least for configuration or discovery.

kjnilsson · November 8, 2021, 7:45am

Reads can be serviced locally but if you do a write then a local read you may not necessarily read your own write.

mworrell · November 9, 2021, 7:51am

Great work, thank you! The pointer to Ra is also interesting to me.

This library looks excellent for distributing retained values in (MQTT) topic trees, I guess that is also one of the use cases in RabbitMQ.

Are you also using it for subscriptions?

georgfaust · November 9, 2021, 8:15am

Is Khepri suitable for embedded systems with unreliable power?
So can it handle a power loss at any time without losing data/integrity?

kjnilsson · November 10, 2021, 4:19pm

Ra will only confirm a command (e.g a write in khepri) after a majority of members have persisted and fsynced the command to disk. So as long as the filesystem /OS isn’t broken in how they implement fsync you should have very good data safety guarantees.

adrianroe · November 11, 2021, 4:57pm

Many thanks for a really exciting library. The combination of ra for consensus and khepri for data / config is compelling.

This has inspired me to explore ra in some detail. Is there a way that the leader (or indeed follower) can be informed when a cluster ceases to be quorate (without issuing a read / write). We manage certain assets where having 2 instances is (considerably!) worse than having none and I was hoping ra leadership would provide that, but my (simple) experiments don’t show a notification if a follower dies (they do if the leader dies).

kjnilsson · November 11, 2021, 7:29pm

A Raft leader can’t distinguish between a follower having crashed vs just being slow so it doesn’t try.

afa · November 12, 2021, 9:00am

Interesting, I was thinking the same regarding retained messages in MQTT.

Especially for topics receiving low-frequency publish frames only. (where it makes sense to ensure that every new subscriber gets the same retained message because that could be the last message for a long time).
For high-frequency topics, there’s less to no benefit. The problem, as often, is that you have to decide what use case you favour and implement

If you run any Khepri experiments, I’d be interested in your results.