Thoughts on the default timeout for gen_server:call/*

MononcQc · August 23, 2023, 3:06pm

My kneejerk reaction to anyone asking for infinity timeouts is to just say “why not set it to 15 years?”

People inevitably go “that’s a ridiculous time” for me to just say “15 years is much shorter than infinity. Maybe we want to pick a number we think is reasonable then” and have a discussion about what’s the actual expectation for performance.

I don’t feel strongly either way, because there’s no way for me to go around and have that chat with everyone who wants infinite timeouts. But the idea of having a timeout, any at all, isn’t a bad one if the cost of setting up the timers/receive counters is low.

erszcz · August 23, 2023, 4:10pm

I second that explicit timeouts are better than relying on a default, whatever it is.

The value of a timeout in a gen_server:call is that it leads to a visible error logged somewhere. It might be daunting having to extend these wait times, but at least the errors are loud and easy to spot.
Imagine having to hunt down a process that silently waits with an infinite wait time instead. What tools would we use to spot that easily? Or what tools easier to employ than just browsing the logs?

When choosing a timeout value I default to something orders of magnitude bigger than the expected duration of the operation. The request should finish in a second? Fine, let’s give it a timeout of 1 hour. It should be ready in a minute? Sure, 24 hours likely is “infinite-enough”.

This has two benefits:

the wait times can be adjusted in tests, so tests run (fail) faster if stuff doesn’t work (in tests there’s usually no load, so no need to wait long)
in the worst case if something doesn’t work in staging / prod, you learn from the logs the next day, because a request will ultimately fail instead of hanging forever (or until hunted down)

Infinity is the only sane default.

I’m afraid it might still lead to a situation worse than the current one, i.e. when paying attention we would use an explicit timeout, but in anger or in a hurry we would use the default. Ultimately, we would end up worse than now, i.e. with hanging requests instead of fail-fast errors.

I think the “to default or not to default” can be approached from a different angle. It’s simple to do a linter check which checks if it’s a gen_server:call/2 or gen_server:call/3 called and warn on the former, i.e. teach and remind ourselves to use explicit timeouts.

starbelly · August 23, 2023, 11:09pm

I hear the point and it’s hard for me to push back on it. We can say that we’ll design carefully, test, QA, stage, etc. thoroughly, but still you drive a point home (one that I’m picking up anyway) : bugs are going to happen and it wouldn’t be great if this happened in a way that a proc spends infinity waiting. This of course leads me to the questions you asked, and one I have no answer for… but I feel like it should be, even if we don’t ever have infinity as a timeout, there should be something… don’t say timeouts other than infinity There has to be something else

^ this is where the conversation is invaluable regardless of any action items that might (most likely will not) fall out of it. It’s thought provoking

I think that’s an excellent practice. Not unlike what I said above, at the very least, this thread can help people in the future I do think developers struggle with this as it’s not intuitive

^ terribly good idea. This could be put in place regardless of the original discussion.

elbrujohalcon · August 24, 2023, 4:31am

I think the “to default or not to default” can be approached from a different angle. It’s simple to do a linter check which checks if it’s a gen_server:call/2 or gen_server:call/3 called and warn on the former, i.e. teach and remind ourselves to use explicit timeouts.

You can already do this with Elvis, just add the following rule to your elvis.config:

{elvis_style, no_call, #{ no_call_functions => [{gen_server, call, 2}] }}

erszcz · August 24, 2023, 8:13am

Of course there is, we have great introspection tools available on the BEAM! The basic thing we could leverage is erlang:process_info/2:

> P = some_pid().
> erlang:process_info(P, current_function).
{current_function,{prim_eval,'receive',2}}

How do we know the suspected P in the first place, though?
We can see the same info in Observer or use tracing to find a hanging process, so there are tools that help in this case.

My point was that those are high-effort tools for active troubleshooting. With a default of infinity it stops being a matter of just looking out for the red colour or an error pattern when browsing logs.

eproxus · August 24, 2023, 8:58am

One big difference with changing the default to infinity is that it will be much harder to detect dead locks, e.g. making a mistake where two gen_servers happen to call each other and both are just waiting forever for a reply.

starbelly · August 24, 2023, 11:37am

What you and others have said give me pause. I still feel like infinity is generally a nice way to go, but it depends on the design. Maybe forcing people into a particular way of designing isn’t great. Then there’s of course what you and @erszcz have alluded to : mistakes happen. I’d like to dig more into that.

All that said and going back to the original post (from 13 years ago) the suggestion was to remove gen_server:call/2 and replace it with gen_server:insert_new_name/2 and/or steer people towards gen_server:call/3 with a suggestion of infinity.

What if instead we gave up a convenience and soft deprecated gen_server:call/2 and never looked back. We could better document gen_server:call/3 with suggestions and considerations. I wonder what if that would be a nice middle ground without the problem of worrying about backwards compat?

It would at least give people pause to think about the timeout value they’re going to put in place

asabil · August 24, 2023, 5:06pm

Don’t know if it’s worth the trouble, but maybe it’s time for a potential gen_service to replace gen_server, the same way gen_statem replaced gen_fsm?

starbelly · August 24, 2023, 5:31pm

I’d go for that, but I’m not sure there’s going to be consensus around infinity, but you may also be suggesting just in general?

ingela · August 25, 2023, 6:27am

Making a call is saying that you want synchronous communication and the whole point is that you want the process to hang/block until you get the answer (or fails due to node going down or the process crashes). Most of the time if it is worth having a timeout it is worth making it server side. And if you still want a client side timeout, you can, and you need to consider what it means an set an appropriate value. What is waiting long enough is very application specific, and hence there is no integer value that makes sense in all cases. If you do not need communications to be synchronous do not use call, use cast that is what it is for! Also you need to consider your applications process structure as a whole and have supervision structures that may kill and restart processes. Having two gen_servers that call each other so that you end up in deadlock is of course troublesome, but it is design flaw (and yes people make mistakes) but we can never fool proof the language. But the gen_behaviours are not here so that designers can be oblivious to how processes work and communicate. They are here for facilitating to concentrate on the problem at hand and being able to reuse code and make the processes fit into the supervisor tree, however we still need to understand what we are doing!

asabil · August 25, 2023, 6:36am

Remove call/2, add support for proper server side timeouts/deadlines, figure out a way to handle an overloaded services…

I think there are quite a few improvements that could be brought to gen_server, but are being restricted for backward compatibility reasons. Thus the suggestion of considering a new gen_service.

starbelly · August 25, 2023, 2:56pm

FWIW I agree with everything you said to the letter. That said, if there was such push back in a myriad of directions, I’d also be happy with a compromise. It is of course in the end up to the OTP team

max-au · August 25, 2023, 6:19pm

My suggestion would be to simply deprecate call/2, and require developers to specify the timeout explicitly. It would’ve saved me from a couple of nasty bugs, and made me think more about timings in my system. While having the default is convenient, I feel it is safer not to have one, and be explicit about it.

starbelly · August 25, 2023, 6:25pm

Yeah, that’s the middle ground to me. I completely agree with @ingela’s rationale of course, but I’d take the middle ground over the current setup

jj1bdx · August 29, 2023, 4:32am

I would like to state that I do appreciate all the participants of this topic and thread for a very much informative discussion. I don’t have a personal preference or a tip to contribute, but now I understand there are many things to consider and to be taken care about.