Framework for reliable processing

turnerjasonuk · July 2, 2022, 1:32pm

I am looking at adopting an actor model to implement a product, possibly Erlang.

The area is financial services data integrity.My use case has high throughput & volumes, which makes it of interest.

However I also need to guarantee processing for each incoming entity & command; I don’t think that rules actors out, but clearly involves additional work.

I need some guidance on how to implement: really I would like to be using some framework offering guaranteed delivery on top of the base actor model, such that we can rely on that and not try to roll our own / implement in each service.

max-au · July 2, 2022, 5:52pm

Distributed systems guarantees? That sounds more a consensus problem, e.g.SMR (State Machine Replication).

I am not aware of any solution that guarantees “only once” delivery (if that’s what you were looking for).

mmzeeman · July 2, 2022, 6:17pm

MQTT has exactly once delivery (QoS 2).

turnerjasonuk · July 4, 2022, 8:46am

Thanks. To be clearer: I am building a system with high throughput but I cannot lose any data or commands. I am using Kafka as a ‘message bus’ with either exactly once or at least once depending on particular area - MQTT would be similar I think. I am interested in Erlang & BEAM for creating my processing - however to do so I need to understand how I can ensure I do not lose messages. My understanding is that Erlang in itself does not attempt to provide any guarantee on message delivery to an actor given that actors may crash - I can see why, and I guess that works for some use cases, but I am pretty sure there are many use cases like mine where that is not viable, so I assume that there are many people implementing or using common implementations of designs that handle overall message delivery guarantees.

lpil · July 4, 2022, 8:55am

An Erlang process only loses messages if your code crashes, which is quite similar to any other programming language. What makes Erlang different is that it has an established mechanism for isolating and recovering from these errors, while in other languages such as Python, Go, etc the data loss and service interruption would likely be greater.

Sending a message within an instance of an Erlang program has more in common with calling an async function in another language than it does sending a message over the network to a message broker or other service. We’re just a bit more up-front about how this too can fail.

mmzeeman · July 4, 2022, 1:21pm

Also note that in error situations being able to “loose” messages is very important. Because when a process always crashes on a particular message it will never be able to recover, because when the message is kept, restarting will crash the process again. It has to be moved out of the way to be able to recover.

turnerjasonuk · July 4, 2022, 1:40pm

Thanks both - I am digging into “Designing for Scalability with Erlang/OTP: Implement Robust, Fault-Tolerant Systems” - I think I need to look at how I can use persistence to ensure that we don’t lose anything but at same time processes can fail appropriately.

nzok · July 11, 2022, 8:05am

I think it always goes back to basics.
The basic fact here is that in practically any imaginable system on practically any imaginable hardware, messages in flight WILL be lost.
The only way to ensure that messages are not lost is to retain copies
until you know they have been received.
I’m currently working on a grant proposal for a sensor network system where we KNOW that quite a high proportion of packets may be lost, and where part of the design calls for all reports to be held in flash memory for weeks if necessary.
Even when you know that a message has been received, you do not know that the consequences of that message have been accepted unless and until you hear from the ultimate destination. (This is an aspect of the end-to-end principle.) This is why the project I mentioned above needs to retain data for longer than you might think. Simply because the base station has received a report (possibly after several hops though peers) that doesn’t mean the aggregator upstream from the base station has updated its models.
So if we have A --chan1–> B --chan2–> C and channels chan1 and chan2 are absolutely perfect, and B acknowledges receipt to A and C acknowledges receipt to B, if A sends a message M perfectly to B, but processing goes wrong in B so that message M’ is never sent to C, data has been lost. (A) needs to retain data until it knows that C has got it. And of course one day everything goes perfectly until C is destroyed in a fire before any human being gets to C the results of C’s processing. So now you need geographically separated C1 and C2.

That’s why I say there will ALWAYS be lost messages. To reduce the impact of this, you can retain data, acknowledge receipt, introduce replication. What exactly you do depends on the value of the data and the cost of the measures. And little of it is language-specific or middle-ware-specific. End-to-end principle.

max-au · July 12, 2022, 5:00am

What if “the ultimate destination” actually breaks (and loses the result)?

This is where we get to my original reply of “consensus problem, e.g. State Machine Replication”. Instead of one “ultimate destination”, you have a quorum of these (say, 2 of 3, or 3 of 5).

turnerjasonuk · July 12, 2022, 8:33am

Thanks all.

I’m continuing to read into it - this is my first look at Erlang or actors so a lot to learn. However as always in a hurry as need to work out broad tech and architecture for project.

To rephrase slightly:

I am looking to build a system composed of a number of services communicating asynchronously. My domain and use cases require the system to process data across a number of services where we need to ensure that any inbound command/request is processed, either successfully or with a clear indication to users of where & how it failed.

I appreciate that Erlang and actor model itself is very much about embracing failure, and thus out of the box while the base tools are there to put together a solution for such ‘loss-less’ processing it is not part of the baseline approach - and indeed many use cases will not want or will have specific needs in how to handle failures.

Really what I was thinking was that this loss-less requirement must be common enough that there would be some separate libraries/frameworks that provide ‘patterns’ for achieving on top of the Erlang actor model - and I was hoping to find and take advantage of one of these rather than build what seems would be a shareable artefact.

cevado · July 12, 2022, 12:25pm

I’d say you can find tools ready to use closer to what you want that in Elixir. broadway is a good reliable way to ingest data from message brokers in general. oban would be a good way to have persistent background processing.