Prim_file speedup

nato · August 8, 2023, 3:53am

I swapped out some file:read_file and file:write_files today, for some prim_file ones, and I noticed an astounding speedup. Other than the the `don’t use this module’ standard reply, can I get some context on this module, and why it’s there (along with the other erts prim modules)?

Appreciate the info!!

mikpe · August 8, 2023, 9:13am

prim_file is a building block for the so-called “file server”. The proper question to ask is why does Erlang need the file server. (I don’t know the answer to that, and I prefer not to speculate.)

raimo · August 8, 2023, 2:47pm

The purpose of this indirection is to be able to run on a diskless machine, so a node can be configured to run with a remote file server. Therefore file calls a file server process for all its operations. The file server uses prim_file as backend to its local file system.

Applications then does not need to know on which machine the file system is, as long as they are using file to access it.

There is a raw file mode option that can be used for most file operations, e.g. for file:write_file/3. Try it and see if the performance is close enough to prim_file:write_file/3.

Unfortunately; file:read_file/1 does not today have an arity that takes a Mode argument…

mikpe · August 8, 2023, 3:00pm

It’s one thing to have an indirection to add flexibility, it’s another to indirect through a single (*) server process. Is the latter still needed, and if so why?

(*) It used to be a single process per node, please correct me if that’s no longer the case.

raimo · August 8, 2023, 3:42pm

It is still a single server process per node, called by a registered name.

It needs to have a registered name so clients know who to call, and therefore it has to be a single process.

The process is a dispatcher, though, so file:open/* creates a new process that becomes the file handle, that is accessed by pid(), so operations on different open files can be done concurrently.

Other operations are considered “fast”, such as delete and rename, so they are handled by the single file server, sequentially, to get a well defined order between the operations.

Unfortunately there are a few operations that are not that “fast”, such as read_file, write_file and copy. Maybe they could do well to spawn off in a worker process to get more concurrency…

Led · August 8, 2023, 7:07pm

And of course you used file:read_file and file:write_files with ‘raw’ option?

nato · August 8, 2023, 11:50pm

Thanks for all the info. On my short-list was to try the raw option which I suspect will be comparable. If it’s not, I will certainly report back, here.

raimo · August 9, 2023, 6:39am

@Led: As I said; unfortunately there is no file:read_file/1 variant that takes a Modes argument. Therefore it is not possible to give it a raw mode option. But file:write_file(Filename, Bytes, Modes) has.

Led · August 9, 2023, 8:12am

1> file:read_file("/tmp/erlang.mk", [raw]).
{ok,<<"# Copyright (c) 2013-2016, Loïc Hoguin <essen@ninenines.eu>\n#\n# Permission to use, copy, modify, and/or dist"/utf8...>>}

UPD: Sorry, my mistake, it’s b4230bea4d743f189c2e23aee1651f4ffd26de74 backport to my personal build.

TD5 · August 9, 2023, 8:22am

Looks like the function exists, but is undocumented.

jhogberg · August 9, 2023, 8:22am

It both exists and is documented, but in OTP 27 which hasn’t been released yet.

TD5 · August 9, 2023, 9:42am

in OTP 27 which hasn’t been released yet

Count me excited, then

josevalim · August 11, 2023, 3:50pm

I wonder if it could be replaced by persistent_term which would store which file_server module to invoke? Otherwise you end-up in this split scenario where some may want to use diskless machines but a lot of existing code is hardcoding [raw] for performance, making the diskless mode less interoperable.