CETS - clustered ETS replication

arcusfelis · August 21, 2024, 12:18pm

We had some (or a lot) of issues using Mnesia for MongooseIM clusters. We were using it for in-memory cache (i.e. no persistent on disk data storage).
First idea was to move to Redis, but it had its own issues.

Another idea was to write bare-bone implementation of dirty operations of Mnesia, but in a way we can tweak it if we have some issues. And issues with mnesia include:

Dirty operations could get stuck on other nodes if a remote node gets restarted quickly.
Joining node could crash cluster if there are a lot of write operations.
Issues with not-matching Mnesia table cookies.

So, we wrote our own mini-library, which allows to write into several ETS tables across the cluster.
We also added pluggable discovery backends, so nodes could find each other. We use RDBMS to find a list of nodes in the cluster, and there is a way to have a list of nodes in the file. Theoretically, it should be easy to add more backends, like k8s API or mDNS.

Performance is similar to Mnesia dirty operations. Instead of mnesia_tm, there is one gen-server per table, handling the remote messages (so, should be less bottlenecking).

There are still some issues with global and prevent_overlapping_partitions=true settings in Erlang/OTP, though. They both do not like conditions when some nodes cannot easily create connections to other nodes. So, discovery is not 100% bullet-proofed.

Example of usage of the library is in MongooseIM
The link to the library:

zabrane · August 21, 2024, 1:45pm

Hi @arcusfelis

Thanks for sharing and well done. I will try it this weekend.

How does it differ from khepri?

Z.

arcusfelis · August 21, 2024, 10:32pm

Khepri is a tree-like replicated on-disk database library for Erlang and Elixir. (and uses Raft).

CETS is just in-memory ETS, but writes would be send to all nodes.
Plus some extra logic to join nodes/handle loosing nodes.

Basically we needed to store a list of connected clients (their sessions), so we can convert JID (username) to a Pid (process ID). When we loose a node, we remove records with Pids from that node.

Probably CETS is faster, because simpler. But the stored data should be simpler too.

Also, partition tolerance is still a tricky topic, especially with prevent_overlapped_partition=true. It feels like global and prevent_overlapped_partitions are a bit broken in OTP, some tests say that non-related nodes could be kicked from the cluster when disconnect happens. And of course, DNS discovery is a bit broken with k8s, because sometimes DNS name could not be resolved ASAP from all nodes in the cluster, when a new node appears… So, there are some tests and logic related to it (i.e. cets/src/cets_dist_blocker.erl at main · esl/cets · GitHub and MongooseIM/src/mongoose_epmd.erl at master · esl/MongooseIM · GitHub). Basically, we had some “fun” testing erlang-distribution with k8s So, Mnesia is not the only source of grief, dist logic and global is another source of interesting race conditions and networking bugs

I guess I need a separate post describing k8s DNS issues and default full meshes configurations

tsloughter · August 28, 2024, 9:26pm

On k8s why not just use a static port for the listen port of erlang distribution and go without epmd?

arcusfelis · August 28, 2024, 10:22pm

Could be probably. Does remote shell still work in this case?

tsloughter · August 28, 2024, 10:41pm

Yes, it does. A remote shell no longer needs to bind to a listen port like it used to try to do.

Using latest rebar3 you can simply set ERL_DIST_PORT when running the release and epmd will not be started and that port used to bind to.

This has more details Running Erlang Releases without EPMD on OTP 23.1+ · Erlware Blog but everything is handled by the relx release start script that rebar3 will give you when you run rebar3 release.

zabrane · September 2, 2024, 8:14pm

@tsloughter would it be possible to enrich the README.md with a minimal 3-nodes example pls?

arcusfelis · September 2, 2024, 8:33pm

Sure, I can try

arcusfelis · September 3, 2024, 4:04pm

Example in the PR:
Example in the README

Generally, you want to use a custom disco backend. But cets_disco_file is good for an example, it would reload the file automatically, when you add a new node.

zabrane · September 3, 2024, 5:50pm

@arcusfelis awesome. Gonna play with the instructions now.

zabrane · September 3, 2024, 5:56pm

@arcusfelis works perfectly

Could you please fix the following:

$ echo "node1@localhost\nnode2@localhost\nnode3@localhost" > /tmp/nodes.txt

with

$ printf "node1@localhost\nnode2@localhost\nnode3@localhost" > /tmp/nodes.txt

Many thanks.

Tangui · September 4, 2024, 2:11pm

Thanks for publishing this library!

I’m curious about this issue. I read somewhere that dirty operations are just send a message to another node and forget about it. What is the problem you’re referring to?

arcusfelis · October 10, 2024, 12:40am

Dirty async just sends operation to all nodes and returns immediately.

Dirty sync operation waits for all nodes to return ack.
But it does not create monitors to the remote processes, instead it waits every 5 seconds and checks if the remote nodes are still in the list of alive Mnesia nodes.

So, if node reappears in the list very quickly, the dirty_write would think that it still needs to wait. But it will not receive anything, because the reincarnated node does not know anything about the fact that it should ack something.

Maybe I should report it actually