For those unfamiliar, Erlang archives is a mechanism where Erlang can load .beam and .app files from add ZIP files (with the .ez extension). Archives are expected to have an ebin directory inside of them and Erlang does a trick where, if you specify the wrong ebin directory, it will try go guess the correct one.
For example, imagine you have a foo.ez archive. You might have the ebin either at one of the two directories:
foo.ez/foo/ebin
foo.ez/ebin
Then, regardless if you do -pa foo.ez/ebin or -pz foo.ez/foo/ebin, Erlang will try to find the correct ebin.
Are you using archives?
1.a. If yes, are you passing the proper paths to the archives on calls to -pa, -pz or code:add_path?
1.b. If yes, does your app (and its release) still work if you set the -code_path_choice strict flag?
It is possible to load NIFs - just store NIF code inside the archive, and write to a temporary location on disk before calling load_nif.
The problem I see with archive, it still requires escript interpreter somewhere. It would be a lot more convenient if there is a way to embed a small ERTS instance into the archive as well, making it a portable archive. This is, however, not an easy task, if more than just one OS is supported.
I agree with @josevalim that guesswork (checking in ebin and in app/ebin folders) may be confusing. However I think that the whole code loading story should be thought through, as right now with 300+ apps in a release we had to reinvent various wheels to speed up linear search done by code_server trying to load files.
In a nutshell, the issue is that if you prepend 100 dependencies/apps to your code path and you need to lookup a module in Erlang/OTP, you have to do 100 lookups, one per app, to find if they provide such module before falling back to Erlang/OTP. This makes loading modules very expensive, the more dependencies you have.
There were previous attempts at solving this by hashing all of the code paths, building a map from Module => File, but that was hard to use in practice because it meant adding or removing code paths would have to update the hash upfront and some code paths could still be writable and receive new .beam files, which would not be possible once hashed.
My approach is to make the linear lookup faster by marking some directories as cached. If they are cached, the first time you attempt to look something up, we perform a ls and store all module names. Then a cache miss, which is the most common operation, is much faster.
This won’t be applied to all paths but a project has several paths that could benefit from this:
the paths from Erlang/OTP won’t be written to by most projects
same for the paths from Elixir
dependencies that come from Hex/Git are likely frowned to be changed locally
Even if you want to write to ebin directories after, my proof of concept shows that caching during boot itself and then immediately disabling it should make boot reasonably faster.
A future optimization is to remove parts of the linear lookup. For example, Erlang/OTP directories are typically stored sequentially in the code path. If you cache all of them, instead of having one cache per application, we can bundle into one cache, bringing similar performance characteristics to the previous cache implementation in OTP but being more granular and also lazy.
Probably unrelated, but would this potentially help with identifying module name conflicts early?
This is something I’m sure everyone has stumbled over at some point since the compilation process does not identify conflicts across applications without using a tool like xref.
That was my suggestion, and a patch I used before: whenever application:load_application is called, it loads the list of modules from *.app file, and then sends this info to the code_server. I did not submit the patch because it lacked proper release upgrade handling (release_handler needs to wipe the cache from code_server).
However now we are no longer using it, in favour of explicit module loading triggered in our application code. It reads the list of all apps to start, finds corresponding *.app files, loads all modules from these files already knowing necessary paths. It is also doing it concurrently, leveraging file I/O NIFs. Essentially, it’s a variant of “embedded” code loading mode, but executed concurrently.
The issue with preloading all modules upfront is that it can be expensive and bloats memory usage, especially on embedded devices. This is an issue we see in Livebook, where you can use it on embedded devices, but you have to choose between embedded mode (fast boot, high memory) and interactive (slower boot, low memory). Frank told me that in one device it takes 40s for embedded and 70s for interactive but I am hoping that with caching interactive mode will be faster and use less memory.
However, I think the parallel loading you described would be super cool and useful for the boot instructions themselves. Today we are doing things such as:
Frankly I always favoured embedded mode for embedded devices, simply stripping unused modules (there are plenty!) from applications during post-processing release phase. I never automated discovery of such modules, but it should not be hard to do it by listing modules actually loaded when running tests in CI.
I think the parallel loading you described would be super cool
Here is the hack I’m using. It works as a post-process step after release is generated, keeping only the preloaded apps, and then inserting itself as the only app starting, with a list of apps that should be started. Then it reads all *.app files and builds a DAG to enable concurrent startup of all the applications, honouring their dependencies (also it does even gross-ier hack to do the concurrent shutdown). I was hoping to convince my employer to let me work on a more powerful upstream patch, but wasn’t successful. If there is anything that you or OTP can salvage from my hacks, I’m happy to contribute.
What do you think?
This solution is too narrowly scoped, not helping with concurrent application startup (which has much higher impact on us). I’d rather rewrite application_controller to allow non-blocking application startup. It would automatically result in loading modules concurrently both within application (modules of the application itself), and across applications (for many of them start concurrently).
I may be extremely naive here but making application_controller start applications concurrently seems to be doable! The part I don’t have the chops to change is the release scripts but then I think other people will be glad to jump in.
It also most likely means this needs to be opt-in behaviour, because some applications may implicitly rely on the boot order? Erlang/OTP 25+ (or 24+?) supports optional dependencies though, so there should be no reason to not list all dependencies but not all code may be using it.
Apologies but I don’t get this part. In a release, the loading of the modules happens with path+primLoad instructions per application. I have just learned that fetching the modules for loading happens in parallel but loading the .beam into the VM does not.
So I don’t see how concurrent application startup will help with loading modules. We can potentially parallelize and speed up the module fetching bit but I don’t think the code loading itself can run in parallel. This also means my previous suggestion for loading code in parallel won’t help much either but, as you said, perhaps this is not the part to focus on right now.
There is a more radical approach to speed up loading: don’t do it at all.
I did a proof of concept of statically linked beam files. In short during build load the files and stop after all transformations on the beam and just before relocation. Then dump as ELF file with a relocation table (a normal *.o.file). These files can then be linked as an additional library when building erts. This results in no loading at all, everything is runnable right after the erts + beam file lib is mapped into memory.
Smaller details: the indices into the atom table is hard coded into the loaded code and the index values depend on load order. This can be resolved by grouping together all converted ELF files with an additional one for the atom table (also including a lookup table for the BEAM files) and group it with the converted BEAM files into a library.
I stopped the project for two reasons: 1. no one was interested enough back then to pay for its completion and 2. the JIT came out.
The JIT doesn’t obviate the approach it would actually improve even more since we can stick the JITed code into the ELF files. But our own use case was on the GRiSP platform and for that we would need ARM32 JIT first. So ARM32 JIT became a prerequisite for us implementing that without third party payment.
We will be probably working on ARM32 JIT when we can secure funding for it (possibly EEF + Kickstarter, reasons for that is beyond the scope here).
As soon as I have ARM32 JIT we have incentive to make the static linking work.
OR someone is interested enough in having that today to fund its development (or push EEF to do so, there needs to be sufficient community interesting it) for i386 and ARM64 JIT
The gain would be much reduced startup time + having executables that include Erlang runtime + a Release which can be just run somewhere.
For some details and a early stage see the talk attached here.
This is definitely an option worth exploring but I believe it is not a suitable technique during development/testing right? The goal of my improvements is for when you have to run in interactive mode, for one reason or another.
I believe you can force some applications to boot earlier in a release. But that behaviour is opt-in already, so I think you have a point that order is not guaranteed unless the user explicitly says otherwise. Not sure though as reltool is not my strong suit.
It is already concurrent at startup, but for a different reason. But the boot (release) script is executed synchronously, blocking any concurrency. Hence the hack with removing sequential steps from the boot script. If there are applications relying on the boot order, they’d need to be updated with correct dependencies list - as you mentioned, it’s now possible.
I have just learned that fetching the modules for loading happens in parallel but loading the .beam into the VM does not.
Some time ago I asked @garazdawi whether JIT can leverage multiple CPU cores, and I vaguely recall that code:atomic_load/1 does something in that vain. That’s something I haven’t verified yet.
Concurrent application startup may also include code load for these applications. That is, instead of explicit primLoad instructions, it could be done as a part of application_controller code.