How are you approaching distributed resilient systems? Any best practices to share?

I’m curious how many erlang orgs have deployed systems that are doing live code upgrades and/or true distributed/resilient systems. If you’re doing this, are you rolling your own distribution model, using riak-core, using raft, something else? Trying to discover what are the latest “best practices” in the space. We’ve got a CQRS architecture that we’re looking to scale out and eliminate all single points of failures (including data centers). Our backend is almost entirely Erlang but we also have Elixir/LiveView for web-facing apps. I don’t find much recent conversations about this kind of resiliency and would definitely like to explore this more with the community. Appreciate any pointers/suggestions/conversation!

(I did see the recent post regarding using raft to distribute a db. It reminded me to ask this question that’s been on my mind for a while before we go committing to something without understanding our options more fully.)

3 Likes

This is a very engaging talk on the subject:

3 Likes