CETS - clustered ETS replication

We had some (or a lot) of issues using Mnesia for MongooseIM clusters. We were using it for in-memory cache (i.e. no persistent on disk data storage).
First idea was to move to Redis, but it had its own issues.

Another idea was to write bare-bone implementation of dirty operations of Mnesia, but in a way we can tweak it if we have some issues. And issues with mnesia include:

  • Dirty operations could get stuck on other nodes if a remote node gets restarted quickly.
  • Joining node could crash cluster if there are a lot of write operations.
  • Issues with not-matching Mnesia table cookies.

So, we wrote our own mini-library, which allows to write into several ETS tables across the cluster.
We also added pluggable discovery backends, so nodes could find each other. We use RDBMS to find a list of nodes in the cluster, and there is a way to have a list of nodes in the file. Theoretically, it should be easy to add more backends, like k8s API or mDNS.

Performance is similar to Mnesia dirty operations. Instead of mnesia_tm, there is one gen-server per table, handling the remote messages (so, should be less bottlenecking).

There are still some issues with global and prevent_overlapping_partitions=true settings in Erlang/OTP, though. They both do not like conditions when some nodes cannot easily create connections to other nodes. So, discovery is not 100% bullet-proofed.

Example of usage of the library is in MongooseIM :slight_smile:
The link to the library:

9 Likes

Hi @arcusfelis

Thanks for sharing and well done. I will try it this weekend.

How does it differ from khepri?

Z.

1 Like

Khepri is a tree-like replicated on-disk database library for Erlang and Elixir. (and uses Raft).

CETS is just in-memory ETS, but writes would be send to all nodes.
Plus some extra logic to join nodes/handle loosing nodes.

Basically we needed to store a list of connected clients (their sessions), so we can convert JID (username) to a Pid (process ID). When we loose a node, we remove records with Pids from that node.

Probably CETS is faster, because simpler. But the stored data should be simpler too.

Also, partition tolerance is still a tricky topic, especially with prevent_overlapped_partition=true. It feels like global and prevent_overlapped_partitions are a bit broken in OTP, some tests say that non-related nodes could be kicked from the cluster when disconnect happens. And of course, DNS discovery is a bit broken with k8s, because sometimes DNS name could not be resolved ASAP from all nodes in the cluster, when a new node appears… So, there are some tests and logic related to it (i.e. cets/src/cets_dist_blocker.erl at main · esl/cets · GitHub and MongooseIM/src/mongoose_epmd.erl at master · esl/MongooseIM · GitHub). Basically, we had some “fun” testing erlang-distribution with k8s :slight_smile: So, Mnesia is not the only source of grief, dist logic and global is another source of interesting race conditions and networking bugs :slight_smile:

I guess I need a separate post describing k8s DNS issues and default full meshes configurations :slight_smile:

4 Likes

On k8s why not just use a static port for the listen port of erlang distribution and go without epmd?

2 Likes

Could be probably. Does remote shell still work in this case?

1 Like

Yes, it does. A remote shell no longer needs to bind to a listen port like it used to try to do.

Using latest rebar3 you can simply set ERL_DIST_PORT when running the release and epmd will not be started and that port used to bind to.

This has more details Running Erlang Releases without EPMD on OTP 23.1+ · Erlware Blog but everything is handled by the relx release start script that rebar3 will give you when you run rebar3 release.

1 Like

@tsloughter would it be possible to enrich the README.md with a minimal 3-nodes example pls?

Sure, I can try :slight_smile:

1 Like

Example in the PR:
Example in the README

Generally, you want to use a custom disco backend. But cets_disco_file is good for an example, it would reload the file automatically, when you add a new node.

1 Like

@arcusfelis awesome. Gonna play with the instructions now.

@arcusfelis works perfectly :tada:

Could you please fix the following:

$ echo "node1@localhost\nnode2@localhost\nnode3@localhost" > /tmp/nodes.txt

with

$ printf "node1@localhost\nnode2@localhost\nnode3@localhost" > /tmp/nodes.txt

Many thanks.

Thanks for publishing this library!

I’m curious about this issue. I read somewhere that dirty operations are just send a message to another node and forget about it. What is the problem you’re referring to?