How to use Khepri gracefully

leeyis · July 1, 2022, 4:36am

I have been following Khepri for a long time, but I don’t know how to apply it gracefully to my own projects

I’m doing a chat service based on Cowboy’s Websocket, user sessions are stored in Mnesia, chat messages, user information is used in MySQL

Can you compare Khepri to MySQL and Mnesia and explain how to use it in my project?

dumbbell · July 4, 2022, 4:27pm

Hi!

I’m going to submit several posts to cover each aspect, because each post could be a bit too long.

Data model

Let’s start with how data is organized in all three databases.

Mnesia and MySQL (and any RDBMS) are quite smilar. You have tables and in each table, you have rows. Each row is defined by a schema (using an SQL statement or an Erlang record) so that you and the database know how a row is structured:

There are N fields, all of them having a name, possibly enforced type and constraints, possibly default values, etc.
One or more fields form the key for that row and the key is used to build an index for quick lookup and search.

Khepri is closer to a key/value store except keys are organized in a tree. Each key can be assigned zero or one value. To create that tree, a key can have zero, one or more child keys. The tree’s root is an unnamed key (but it can still be assigned something). The value is unstructured from Khepri’s point of view and there is no notion of schema at this point. Therefore a value could be an integer, a record, a string or any complex Erlang term. Nothing is enforced or verified.

Other properties may differ:

Key uniqueness. In Mnesia/MySQL, a key must be unique for a given table. In Khepri, a key must be unique among sibling keys.
Key complexity. In Mnesia, a key can be a complex Erlang term. In MySQL, the key can span several fields to achieve about the same thing as the complex Erlang term in Mnesia. In Khepri, a key is an Erlang atom or an Erlang binary currently (I have plans to revisit this constraint in the future).

A quick diagram might be clearer:

Table name: people                        <root>
+-------------+-----+----------+          `-- people
| name        | age | role     |              |-- <<"Alice">> = #{age => 25,
+-------------+-----+----------+      VS      |                   role => manager}
| <<"Alice">> | 25  | manager  |              `-- <<"Bob">> = #{age => 41,
| <<"Bob">>   | 41  | engineer |                                role => engineer}
+-------------+-----+----------+

This organization changes how you will reason about and query the database. The more suitable model depends on your usecase.

For a chat service, here is an example of how you could write your data:

<root>
|
|-- users
|   |-- <<"alice">> = #user{name = "Alice"}
|   |-- <<"charly42">> = #user{name = "Charly"}
|   `-- <<"thebob">> = #user{name = "Bob"}
|
`-- rooms
    `-- <<"uid-1234">> = #room{title = "General",
        |                      topic = "..."}
        |-- members
        |   |-- <<"alice">> = #member{role = admin}
        |   `-- <<"charly42">> = #member{role = guest}
        |
        `-- history
            |-- <<"2022-07-04 18:08:48Z">> = #msg{sender = <<"charly42">>,
            |                                     content = "Hello!"}
            `-- <<"2022-07-04 18:10:03Z">> = #msg{sender = <<"alice">>,
                                                  content = "Welcome!"}

You need to determine the right balance between keeping the tree very basic (close to a flat key/value store) and store structured data inside a record vs. splitting every piece of data and store them in a more complex structured tree in Khepri. For instance, the history of a room could be a simple Erlang list assigned the history key instead of breaking everything in several sub-keys.

In MySQL and Khepri, you can even configure triggers to automatically remove e.g. charly42 in every room he is a member of, as soon as you remove it from the users. In Mnesia, you need an explicit transaction going over every rooms.

dumbbell · July 4, 2022, 4:55pm

Memory usage

Khepri stores everything on disk but also loads everything into memory. That’s how the underlying Ra library, responsible for the replication and consensys works. In the future, we might make something different to only maintain a cache in memory and keep everything on disk.

Mnesia uses ETS underneath, so I believe everything lives in memory too.

MySQL can do clever things here and will certainly have a lower memory footprint for a large set of data.

For your chat service, having the entire chat history for all rooms in memory may be a waste of ressources as users won’t need that often I suppose. You may want to mix Khepri/Mnesia and MySQL (or something other service) to have the most important data at hand in Mnesia/Khepri and but keep the historical data that is not accessed often elsewhere.

Network topology

MySQL is a full featured standalone service and you communicate with it through a network connection or perhaps Unix socket. It is easy to deploy separately from your Erlang application.

Mnesia and Khepri are Erlang libraries which must started and managed from an Erlang application. If you want to host the database on a subset of your service’s Erlang nodes, you need to manage that yourself inside your Erlang application.

Clustering

W.r.t. clustering, I can’t tell for MySQL as I don’t know how it works.

For Mnesia, you cluster nodes for the entire set of tables, but you can tune which table is replicated and how on a per-table basis.

For Khepri, you cluster nodes for an entire store. Everything in that store is replicated to all clustered nodes and written on disk. However, you can configure multiple stores with a different directory to write data and a different set of clustered nodes (or no clustering at all).

When you stop and start nodes, Mnesia and Khepri will behave very differently.

Mnesia will still serve data as along as there is one node running in the cluster. When you stop the cluster and start it again, it will only start to serve data again only after the last stopped node is back online.

Khepri, relying on Ra/Raft, will stop processing writes when there is less than a quorum number of nodes running. Reads will still be possible though. When a cluster is restarted, writes are possible again when a number of nodes is back online.

Conflicts handling and network partition recovery

After a network partition Mnesia and Khepri will behave differently for the same reason as the paragraph above. I can’t tell for MySQL.

Mnesia usually leaves that responsibility to the caller. It emits some events to warn the application above that there was some network partition, but that’s about it.

Khepri follows Raft principles and there is no recovery to perform. Changes to the database can’t happen if there is no quorum in the cluster.

Let’s take an example of a 3-node cluster. There is a network partition where node A can’t reach nodes B and C.

Mnesia: An event is emitted to warn about the lost node(s) on both side. Changes can still be made to all nodes during the network partition. When the network is repaired, another event is emitted and the application is responsible for solving any issues.
Khepri: Changes can still be made to nodes B and C, however, only (inconsistent) reads are allowed on node A. When the network is repaired, node A applies the backlog of changes it missed from nodes B and C. No intervention is required from the application, however the service was degraded on node A during the network partition.

It’s difficult to give an advice here, it really depends on where you want to put the cursor between availability and consistency.

Conclusion

I think I covered several parts already. Does it help you understand which one or which combination might be best for your project?

lpil · July 4, 2022, 5:01pm

Wonderful posts, thanks for sharing. I think this would make a great blog post.

dumbbell · July 4, 2022, 5:09pm

I was thinking of adding something like that to the documentation, once the OP confirms it’s clear for him. We don’t have a blog for Khepri (it would go to the RabbitMQ one or we could create one) and I’m not sure GitHub discussions are a good fit.

And thank you for your feedback

lpil · July 4, 2022, 6:48pm

In the official docs sounds even better

leeyis · July 5, 2022, 1:03am

Your answer is very professional and meticulous. Thank you very much for your professional answer.
I need a little more time to digest this knowledge after my busy week at work.

AstonJ · July 5, 2022, 1:29am

We can make your thread in the libraries section a wiki if you like: Khepri - a tree-like replicated on-disk database library for Erlang and Elixir (introduction & feedbacks) This will allow you to update it at any time. It will also allow anyone at TL1 to update the thread (but you will get a notification when that happens). Just let me know if you’d like us to make it a wiki