Modify the behavior of the ERTS, let the old processes live

yeger · December 2, 2023, 6:30pm

Wow i’m amazed by the contribution that everybody gave to this discussion!
I really wanna thank every one of you! I’ll have to take some time to be able to analyze everything and collect my thoughts tho!

For the moment i’ll reassure you, i don’t have any particular or urgent work-case where i need to apply this, actually i’m just a student working on his thesis about the limits and possible work-arounds on the infamous Hot Code Reloading in Erlang/Elixir!

This idea of having old (and yes, possibly very old, even months in my case) processes running old versions of the code came to me because i’m working on a little reminder app, where a server creates some processes which have a timeout and represent an event coming up.

In my case, as @Maria-12648430 and @nzok said, i’m guilty of not having followed the OTP and good rules that somebody should follow when working on a project in this language, always keeping in mind the idea of being able to hot code reload. Add to that some laziness and the idea that “if it works don’t touch it” and here you are!

This is why i came up with this idea of letting even the old processes live. If they can still “deliver” their job, in this case waiting for the timeout, and then die “gracefully”, while coexisting with new processes that have some new features or a different implementation of their state.

Again, it’s simply laziness, why should i have to work so “much” to upgrade the old processes and modify their state or kill them and recreate them based on their state when they died but now following the new version of the code if in “theory” they could keep living and doing their job?

This has OBVIOUSLY a large amount of problems, i can’t even imagine the mess it can create on a VM with tens of thousands of processes running all different versions and having different states.

But i also think that as @jhogberg said there could be a case where this is actually useful, maybe because of errors made in the past that now can’t be fixed or maybe this is what simply the userswants or needs (it’s 3a.m. and he wants to go to sleep), and who are we to negate that to him.

Again, thanks to everybody for such an interesting share of opinions, i’m very grateful and it has helped me a lot|

starbelly · December 2, 2023, 6:48pm

Yup, I get it. I suppose the concern that has been raised is long term drift. I think this depends on the architecture of the system in question and the domain it resides in. That is, optimally (IMHO), we don’t have to rely on hot code loading so much. If we can craft systems such that you can perform rolling deploys without blinking an eye AND design your application and its component so that you can reach for a scalpel sometimes (hot code reload), then you’re in a super nice place. In such a system, drift becomes less of a concern because you’re normal is rolling deploys, which will get entire nodes on the same page (I work on such a system ), and sooner rather than later .

That said, I myself have not come across a case yet where I needed to hot code reload and it would be okay to let old code linger. This is usually because of precisely what I stated above, namely, hot code reload is a scalpel to me However, I can imagine instances where there is something terribly minor going on in such a system, that the rationale could be : Old processes can linger, but let’s not let any more damage happen in regard to new processes, and then we can perform a rolling deploy in the morning

Thank you so much for sharing your thought @yeger !!!

JeromeDeBretagne · December 3, 2023, 4:34pm

One use case I’ve had in mind for some time is a project that could try to be self-healing / “self-fixing” when deploying an Erlang-based application on many mobile Android devices. The idea would be to have versioned modules deployed in a staged roll-out, and in case a regression impacts a specific set of devices with the latest modules, these devcies would try to automatically revert to a previous combination of modules working well together (while raising an alarm that something went wrong ofc). For such a project, the constraint that only one “current” and one “old” module are allowed to remain has always felt like an arbitrary limitation. Indeed I could see cases needing to test module m1.1 next to module m1.2 next to m1.3 in parallel with module n2.1 next to n2.2, etc. Of course, the current defaults should be kept.

Thinking a bit more about it, maybe the longer-term goal discussed above about a “Safe Erlang” approach with virtual nodes would offer the kind of flexibility I’m looking for…

Just out of curiosity, since it has been brought up a few times on the forum recently, does anyone have a guesstimate of the effort it would require to introduce “Safe Erlang” ?

rvirding · December 4, 2023, 11:19pm

It would become extremely simple to completely lose track of which processes are running which version of a module so code updating could easily become a right mess.

Maria-12648430 · December 5, 2023, 10:57am

This sums up my concerns pretty well. Over time, n processes may end up running n different versions of the same module.

In a perfect world (well, my perfect world at least), every process would be running the same (that is, the new) version of the code after a code upgrade. As I understand it, the whole reason why we even have old code hanging around beside the new is because there is a transition period needed in which the individual processes do the switch from old to new code. This transition period is assumed to be short, because if we can’t have a perfect world, we at least want to be as close to it as possible. Right?

asabil · December 5, 2023, 12:55pm

Isn’t that easily solved with the proper tooling? A function that allows you to list all the versions of a particular module and the count/list of processes associated with them.

raimo · December 5, 2023, 1:51pm

That is assuming that you want to do a code upgrade of the running processes…

In the old code, the process loop must be written to normally avoid looping with fully qualified calls. When you want to code upgrade there must be a way to make the process loop do a fully qualified call. That call enters the new code, and the new code must be prepared to handle arguments from all old code versions. That is about it.

So, for a code upgrade, the only requirement on the new (current) code is that it must be able to handle state from all possibly running old code versions.

Another way is to not code upgrade, i.e. you instead wait for all old process loops to exit before purging old code. I guess is this option that would in some cases become easier to manage by removing the limitation of max one old code version.
The open question is how useful this is…

rvirding · December 5, 2023, 10:48pm

At one time we were toying with having a module datatype in which you could call its exported functions. It was never developed and one reason of many why it was never done we worried that you might end with a deluge of modules .

nzok · December 10, 2023, 7:46am

I think we can all agree that

being able to upgrade a running system without a complete shutdown and restart is a Good Thing
Erlang tries to make it straightforward
the current Erlang approach tries to make it straightforward to build an upgradable system
but it is not foolproof
in fact a foolproof approach might not even exist
there are several PhDs in considering the theory of live upgrades
speculation about alternatives is always warranted
proposals to actually change the system are something else where we need to know that the proposal will make things better and will have a favourable benefit/cost ratio

I wish I were in a position to supervise a PhD on live update. There are obvious links to things like the theory of patches, the theory of belief revision, firmware update in distributed IoT applications, relective systems, … To be brutally honest, what Erlang has strikes me as a hack devised by people who needed the feature too urgently to wait for the theory, which works well enough that we can mostly rely on it, but is just awkward enough to ensure that it’s not overused.

Given that an Erlang “program” may run on multiple nodes, possibly connected by a non-“native” network protocol, there’s an obvious possibility of M nodes running version X and N-M nodes running version X+1
of some module, and that this situation might persist for some time. So we already have the possibility
of module version skew within a multi-node “program”. Module consistency in a multinode environment seems as though it might be subject to the same limitations as distributed data base consistency. Which suggests putting that Joe Armstrong was right that the focus should be on protocols between nodes, not on what’s happening inside the nodes. (Which in turn suggests that Safe Erlang would really REALLY be nice to have.)

What would it take for a node to “know” that an upgrade was safe?
What would it take for a node to “know” that keeping old processes alive was safe?
What would it take for anyone to “know” these things and certify them to the system?