Modify the behavior of the ERTS, let the old processes live

yeger · November 28, 2023, 12:39pm

Hello!

I’ve been working with Erlang and Elixir for quite a while and constantly been dealing with the recompilation of modules that are being executed and eventually some hot code reloading.

I find it a fascinating feature but extremely time consuming and test driven because of the fact that it is so important to keep coherence in the starte of the process. As we all know when you recompile a module the old version is gonna be marked as “OLD” and the processes running it, if not updated are gonna keep running only using their local calls, and the new one is marked as “CURRENT” version and all the public methods and new processes created are gonna be redirected to this one.

In the case of a third recompilation of the same module the oldest version of the code is going to be purged from the system and every process running it will be killed. I’m interested in modifying such behavior, so not killing the oldest processes but let them live until their “natural” death.

It would may be interesting to implement a stack of versions and not only an “old” and “current” limitations as now, but i’m mostly interested in the idea of letting the process live.

Any idea or suggestion in what to look for?

mikpe · November 28, 2023, 3:48pm

You could “just” disable the deallocation of the old module when it’s about to be purged. There will be a memory leak, but that’s what you asked for.

It is however not the Erlang Way™. Every code path in a module is supposed to either be relatively short-lived (well, shorter than your module update interval anyway), or to tail-recursively call back into itself via an exported function (which will always enter the newest version).

So I think you’re fighting a losing battle here. But hey, it’s open source so hack away

rvirding · November 28, 2023, 9:02pm

The BEAM and previous Erlang VMs have only allowed 2 versions of a module, the current, or new, and the old. Changing that would require a major rewrite and also affect a lot of code which knows about about this. E.g. the module code and the code server would require a major change which could affect most code which knows about modules.

jimdigriz · November 28, 2023, 10:35pm

Maybe with some creative thinking you can make use of gen_statem’s {change,push,pop}_callback_module.

You will need to amend things so that new versions of the module actually get compiled with a unique name (append the version maybe) and then you message the processes to migrate over.

Of course this works only for gen’s.

A trick I used was pushing the compiled bytecode to the nodes directly and having them load it in which may help here.

jhogberg · November 29, 2023, 4:23pm

I’ve played around with the idea and think it can be done without much hassle or breaking existing code.

I’m interested in hearing more about the use case. Is there a particular project that would be helped by this?

Maria-12648430 · November 30, 2023, 8:00am

Couldn’t that lead to long-running processes running old code for ages?

nzok · November 30, 2023, 9:01am

One of the key ideas in Erlang is that you are supposed to design for hot-loading. Amongst other things, every loop (whether explicit, via data base, or via higher-order function) in a module should either be obviously bounded so that it completes in rather less time than the interval between updates or it visibly must go through a remote call. It also means that if you want to replace version X with version Y, the entry point for the main loop in version Y when entered from version X must convert version X’s state data to version Y’s state data before proceeding with version Y’s main loop.

Presumably one hot-loads version Y because version X is now in some sense WRONG or INADEQUATE – somehow it fails to correctly handle all the cases version Y needs to handle. That being the case, why would you want lots of old versions hanging around all continuing to do the wrong thing?

I note that nothing stops version Y containing a verbatim copy of everything in version X, and adding a new main loop running the improved code. So that when a process running in version X does a remote call, it’s now in version Y but still running version X code.

Way back when I met Erlang’s “replace a module while the system is running”, I too thought “why don’t they allow any number of old versions?” Then I thought about it. Even allowing one version to temporarily survive is potentially dangerous. One old version is allowed to remain, but it should not remain for long . Just long enough to let existing clients switch over to the new version safely. We’re talking milliseconds here, not days.

jhogberg · November 30, 2023, 9:32am

What if that is what the user wants? I see no problems with it as long as it’s deliberate.

I didn’t have changing the defaults in mind, but rather letting users explicitly say that “yes, I want to load new code even though there’s already old code that hasn’t been purged.” Code that doesn’t supply said option would fail to load new code with a not_purged error just like before. As said error is the only way to observe the current limit of two generations, and having several entities fighting over a single module is a massive disaster to begin with, it seems like a safe enough change.

On the other hand, what if the oldest code cannot be upgraded? Saying that one should design for hot code upgrades doesn’t help those who didn’t (or did it wrong), and are now stuck between running buggy current code or loading new code and killing all the processes that are running old code. I don’t see many problems with allowing deliberate uses of this while keeping the old two-versions convention as the default and blessed way to do things.

Whether we should do this is a different matter. Making the VM and code server support this wouldn’t be a huge task, so if we put such arguments aside, would this be a helpful feature?

nzok · November 30, 2023, 10:42am

If the old code cannot be upgraded, it needs to be terminated with extreme prejudice.

I already explained how you can organise the new version so that the old code doesn’t KNOW it has been replaced. As long as it does that one thing right: no long-running loop that doesn’t go through a remote call.

Look, why do you WANT to replace a module?
To add new features? Fine. That’s the case where the old process doesn’t need to know that the module has changed.
To fix errors? Not good. The last thing you want is to let that code run any longer than you can help.

So you are asking me to believe in code

that was not designed with Erlang principles in mind,
that does not use any of the Erlang behaviours that make it easy to write code that does the right thing.
that cannot be upgraded using Erlang’s normal means
that DESPITE these design failures, is so close to working that it MUST be allowed to continue operating indefinitely
yet at the same time is so broken that it must be replaced repeatedly within a short time.

I have a good imagination. I can imagine square circles. (Use the L_1 or L_\ifty metric.) I can believe that 1+1+1 = 1 (which is TRUE in the field Z_2). I’ve contemplated finite processes taking infinite time. (Think about black holes.) I’ve even thought that some of my code wasn’t rubbish (testing cures that). But I just cannot believe in code that must be replaced but must not be replaced (because it can’t be upgraded).

I want to hear about REAL examples. I want DETAIL about why some particular module needs multiple replacements in quick succession but can’t follow the usual upgrade process.

By the way, if there are real examples, this is yet another argument for Safe Erlang.
To the extent that an Erlang system can be partitioned into virtual nodes, a virtual node can be left running old module versions as long as you want, because in that node the module is not replaced.

nzok · November 30, 2023, 10:51am

Looking at your final sentence, NO. It would not be a helpful feature. We’ve already had a comment from someone whose message I’ve deleted to the effect that it WOULD be a huge task.

Implementing this feature, to support a corner case that I argue probably doesn’t really exist, would divert resources from implementing something we see more and more clearly we’ve always needed, which is Safe Erlang/virtual nodes. Do that, and this questionably real corner case gets handled without a special feature. Implement, and test, and document, this feature, and that pushes Safe Erlang off further into never-never land.

Once you perceive the module name space as a collection of global mutable variables, you realise that we need to break that space up into smaller ones which can be independently updated. (and not just modules but the whole pidname system.) Logically isolated mininodes physically sharing such resources as they can, that’s the way to go. Security, maintainability, manageability, resilience, all point the same way. And it’s NOT “let buggy code live forever.”

jhogberg · November 30, 2023, 11:14am

We’re not in disagreement: I too want to hear good reasons for why we need this. I just don’t want people to shy away from speaking up just because some people believe it would be a difficult change from a purely technical perspective, as it doesn’t look like that’s the case.

I simply want to hear more from the other side. I don’t think it’s fair to reject it without listening to what they have to say.

Yes, a comment from someone who is not on the OTP VM team like I am. I co-authored the current loader, and am pretty sure that I’ve spent more time replying to this thread than it would take me to make the proposed change.

I do not see technical hurdles as the problem here. I want to hear reasons why this should be added. If there are good reasons for making this change I’d be happy to do it, and if not, then I’m just as happy to leave things as they are.

mikpe · November 30, 2023, 12:50pm

I would argue that if someone wants an unbounded number of versions of some module to be available for processes to run in, they could do so today by including those versions in their module names, and then have a proxy that routes new calls to the appropriate one.

But I have yet to see a concrete use case for this.

Maria-12648430 · November 30, 2023, 12:54pm

I never said that I want that, quite the opposite

My reasoning is this.

If it is a long-running process, it is likely important, a crucial gear in the system that some other stuff depends on, I would want that to always execute the latest code, not run old code for decades. Also, if I can’t upgrade that component, the system itself is more or less stuck at that version, other components that depend on it must use it the old way.
OTOH, if it is a short-lived process, it is likely unimportant, expected to fail for any reason. Being short-lived, they should exit by themselves long before another code upgrade purges the old code they may be running.

Well, I’d argue that for those who are running such code now will have that problem no matter what, even if you make that change. To be able to use it, they will have to upgrade OTP, for which they will have to restart it anyway. And if that is the case, they could as well redesign their system and do it properly.
OTOH, if you do that change to make life easier for people who did it wrong, you’re also removing a compelling reason to do it right, or set wrong things right. “Yeah, you did it wrong, but you can get away with it”, you know?
Also, I would argue that a system that made it to production at a scale that it must not be stopped but hot upgraded is likely to have gone through a period where the developer(s) realized that they didn’t follow proper design and set that right.

“It isn’t hard to do - famous last words” is a quote that has served me well as a rule of thumb (just kidding)

jhogberg · November 30, 2023, 10:08pm

One of the annoying parts of operating high-availability systems is that they inevitably fail in ways no one expected. Arguments about incentives and proper design sound hollow to the poor operator that has to deal with the mess at three in the morning, they just want to get things working with as small of an impact as possible. I’ve been there countless times. It’s awful, and you can cuss over the design decisions all you want (someone else’s more often than not), but that doesn’t change the fact that it’s your responsibility to fix things now.

Right now, an operator cannot make a quick fix and let the day shift build a more complete solution, so to speak, because we only allow components that weren’t designed with upgrades in mind to be upgraded once without risking data loss. They could do so without blinking if this limitation was lifted.

Do I think that is reason enough to add this? No, I’m just pointing out that reality isn’t as clean as we’d like and I can imagine this being useful in some cases.

I would like to hear what @yeger has to say. I don’t want to reject this because of a lack of imagination on our part.

Maria-12648430 · December 1, 2023, 7:30am

I don’t disagree, I’d sure hate to be that poor person

Anyway, yeah, we definitely need a use case. Without it, we are like donkeys arguing over the best way for laying eggs (is that a proverb outside of Germany?)

nzok · December 1, 2023, 9:09am

I am a big fan of Systemantics.

“Complicated systems produce unexpected outcomes (Generalized Uncertainty Principle)”
So how can we make life better for the poor schlub at 3am?
Well, not making systems so complicated would be a help.

“A complex system cannot be “made” to work. It either works or it does not.”
Again, the cybernetic principle is “SUFFICIENT variety”.
The system has to have enough complexity to recognise the situation it is in and select the appropriate response.

“The Fundamental Failure-Mode Theorem (F.F.T.): complex systems usually operate in a failure mode.”

“Loose systems last longer and work better. (Efficient systems are dangerous to themselves and to others.)”

What imaginary problem is our 3am working stiff faced with?

A complex system is broken. The breakage is traced (rightly or wrongly) to version X of module M.
Module M is replaced with version X+1. At least one critical process continues to run in version X.
Before that process has time to switch over to the new version, via an inter-module call, it is discovered that version X+1 is even more broken.
Module M is now replaced with version X+2 very very soon after the first replacement. But the critical process(es) is(are) still running in version X.

That’s what we are talking about, remember? Existing practice is that at this point Erlang would kill the process(es) using version X.

Processes built using Erlang principles like supervision are straightforwardly restartable, but this critical process is not. State that needs to be retained for long periods has been held as the process state instead of being periodically saved to ETS, DETS, or some data base. It is critical to the function of the system that the process not be killed, but must continue to run in a module that is known to be buggy. It is better to be mad and bad than dead.

I have been present when the following occurred:

someone leaned on a keyboard, shutting down a key server
a cleaner unplugged a server in order to plug in a vacuum cleaner
a technician doing preventive maintenance dropped a screwdriver and the flash lit up the machine hall, taking an entire bank of discs offline
a head crash dug a groove into a hard drive (of course the data were lost)
a live update to Windows rendered WSL non-functional
an iOS update disabled sound on an iPad
another iOS update bricked another iPad
a software update in macOS X deleted a crucial performance tool
a software update in Linux rendered backtrace() nonfunctional
(as a remote client) an entire cloud service went down
a software update rendered Ctrl-P and Ctrl-S in Firefox (and the same functions accessed through menus) nonfunctional – still doesn’t work
a Linux upgrade rendered the machine non-bootable, requiring a complete disc wipe and reinstall
a memory stick got bent in half
a child discovered that she could pull the DVD drive right out of a laptop, then “ooh, what’s this shiny thing? what happens when I rub it on the floor?”
some weird incompatibility between Zoom on Windows, macOS, and Ubuntu
a laptop battery expanded and rendered the track pad unusable
a still unexplained glitch wiped Linux off a dual-boot Inspiron; Windows survived. Later the battery on that machine swelled and euchred the trackpad
overrunning the atom table of a Macintosh Prolog interpreter caused it to not only crash the OS but to wipe the floppy it came on
a street of University buildings lost power abruptly
a demonstration failed because the web site it linked to had been decommissioned

We are not promised that we will be free of problems. (That is by no means a complete list. I don’t want to remember some of the others.) Erlang does not promise a trouble-free life. It CAN’T. Sometimes a system that’s supposed to be always on WILL shut down because of human error or hardware error. (In this country we have to worry about MTBE as well as MTBF – Mean Time Between Earthquakes.)

Given all the existing support Erlang has for maintaining and debugging live systems, just how often do people get stuck at 3AM because Erlang doesn’t allow 3 versions of a module to be in use at once? Compared with how often there’s a fire, or a back-hoe, or rats in the wiring, or a battery fire https://www.theguardian.com/australia-news/2023/sep/27/tesla-lithium-battery-fire-bouldercombe-energy-storage-site-project-rockhampton or a cyber-warfare attack https://www.reuters.com/world/europe/ukrainian-government-foreign-ministry-parliament-websites-down-2022-02-23/ or a Windows update bricked your machine(s) https://www.quora.com/What-do-you-do-if-your-computer-is-bricked-after-a-Windows-update or a LOGICAL_BACKUP produces a corrupt DB image https://www.ibm.com/support/pages/fix-list-db2-version-115-linux-unix-and-windows

If this is a problem that Erlang users meet often, then yes, put the effort into addressing it.

Perhaps by having erl_lint report ‘possible infinite loop with no remote call’.

raimo · December 1, 2023, 10:04am

If we just look at the current and possible semantics, then:

Current semantics:
code:load_*(...) Loads new code for a module. Local function calls within the old module version still work. Even older module versions are auto purged so local function calls within in them crash.
code:purge(Module) Purges the old module version. Local function calls within them crash.

Possible new semantics, with an explicit flag to code:load_*/*:
code:load_*(...) Loads new code for a module. Local function calls within old module versions still work.
code:purge(Module) Purges old module versions. Local function calls within them crash.

Edit
That and the following was incorrect. See @jhogberg:s answer below.

Now we can argue about which is cleaner and simpler: that there can only be one old module version, or that it is only code:purge/1 that can cause old module version’s local function calls to crash.

An upgrade still works the same. Load new code. Make your server do a qualified call to the new module version. Purge the old code.

The only thing changed is that it becomes possible to repeat the first two steps.

jhogberg · December 1, 2023, 12:01pm

Having spent eight years as said “schlub,” wishing for simplicity all the while, you can’t always get what you want.

Existing practice is to kill all processes using version X, including those that may be completely unaffected by the problem we’re trying to address and it’s rarely fun to throw the baby out with the bathwater. Only a small subset may be “mad and bad” and they can be killed manually if they don’t already crash on their own.

All I’m saying is that I can see this being useful. Whether it’s useful enough to do something about this is another matter.

The current semantics is not to automatically purge old code but return a not_purged error if the old code needs to be purged through code:purge/1 or code:soft_purge/1 first (the latter cancelling the operation if there’s a process that runs old code, instead of killing them).

raimo · December 1, 2023, 1:05pm

Aaalright! My bad.

The change is that code:load_* no longer will return not_purged which relieves the user from to handling that, often by purging. Auto purge is a common pattern done by the user.

Still, what is simpler and clearer?
To be able to have only one current and one old module version, or to always be able to load a new current and treat all old the same?

starbelly · December 1, 2023, 2:14pm

jhogberg:

One of the annoying parts of operating high-availability systems is that they inevitably fail in ways no one expected. Arguments about incentives and proper design sound hollow to the poor operator that has to deal with the mess at three in the morning, they just want to get things working with as small of an impact as possible. I’ve been there countless times. It’s awful, and you can cuss over the design decisions all you want (someone else’s more often than not), but that doesn’t change the fact that it’s your responsibility to fix things now.

Right now, an operator cannot make a quick fix and let the day shift build a more complete solution, so to speak, because we only allow components that weren’t designed with upgrades in mind to be upgraded once without risking data loss. They could do so without blinking if this limitation was lifted.

Do I think that is reason enough to add this? No, I’m just pointing out that reality isn’t as clean as we’d like and I can imagine this being useful in some cases.

I think this was very well put and I’m completely in support of this approach to new ideas. We can’t simply shut down ideas because of notions we’ve long held, especially without hearing and understanding the why. I agree with a lot of the rationales expressed around not adding this functionality, but without understanding the potential use cases, I think it’s quite difficult to have an informed response.

Likewise, we need to be welcoming to new people within this community, so please all, let us try to pause arguments against this, those have been made quite clear. Let’s hear from op on the why.

Please @yeger, I’d love to hear from you on this.