Thoughts on the default timeout for gen_server:call/*

starbelly · August 21, 2023, 6:40pm

Something I’ve gone back and forth over the years is the default timeout for gen_server:call/{2,3}. It’s currently at 5 seconds. I dug into the history on this, at one point there was a discussion and I believe @kennethL was involved in that, where it was agreed upon that the timeout should be infinity. The rationale being that we should be crafting our gen_servers and such with care and caution (i.e., ensure that your process always returns) which in turn also leads to being less defensive on the caller side (i.e., don’t have to worry about catching exits). I agree with that rationale, but perhaps someone can talk me out of it. I assume this never happened because of backwards compat (changing it to infinity would result in huge upgrade challenges for existing systems).

There’s of course exceptions to this, we don’t want to make a call to another node with infinity as the timeout (in the case we’re not using rpc) and there will be exceptions within a node, but in general this feels right to me.

It can be argued that having a very short time out may also lead to you to careful design, but I have doubts about that.

Quite curious to hear from all!

Inforista · August 21, 2023, 8:20pm

Interesting topic

I am not very experienced with gen_server but this comes to my mind:

I think if the goal of infinity is to guarente that the caller ALWAYS receives - eventually - an answer, then this case will be troublesome:

Caller sends message to gen_server
gen_server crashes before result is sent to caller

Thus caller will be blocked infinitely.

I also think that it is the sole responsibility of the calling process to handle the case where a response is delayed or no response arrives.

I think that the solution with infinity (if the goal is just to make life easier for the calling process) will be confusing … sometime you dont have to handle the possibility of no answer and other non gen_server calls you have to handle it.

starbelly · August 21, 2023, 8:32pm

gen should take of this

See Erlang -- gen_server specifically reasons why an exit may happen.

The server process exited during the call, with reason Reason. Either by returning {stop,Reason,_} from its Module:handle_call/3 callback (without replying), by raising an exception, or due to getting an exit signal it did not trap.

Inforista · August 21, 2023, 8:45pm

But if the point of introducing infinity is to have the caller not worring about catching exits… then I still have to catch exits in the case where gen_server breaks down?

starbelly · August 21, 2023, 11:13pm

That’s an excellent point, I would say in this case, let it crash. In fact, we should let it crash in most cases (devoid of the timeout stuff), there are exceptions to this (look at gen_server.erl itself), namely if we’re operating within our error kernel (or a error kernel).

I think the point to make here is about trust. A short time out (either implicitly or explicitly) can be said to express : I don’t trust the process I’m calling. If there’s some truth to that, then it’s defensive.

There’s other problems with relying on timeouts in general (remember, we need timeouts for some parts), managing timeouts in a complex system. A calls B with a timeout of 10 seconds, B calls C with a timeout of 15 seconds, but C calls D with a timeout of 5 seconds. That’s a simplified example as well Put another way, the problem of managing cascading timeouts in complex systems.

So we can take care to avoid these situations, but what if by default we didn’t have to think about it so much? What if by default you knew you had to craft your gen_* with great care (which you should be doing anyway)?

Then there’s the “I forgot” situation. Let’s say you have a process responsible for receiving some data and shipping it out to some external service. The amount of time that will take is non-deterministic, on the server side there is a timeout to work with that constraint, yet the caller code you put in place forgot about the 5 second default and all of a sudden, your external service is taking 60 seconds to respond. Not great, as what usually ensues is the client then hits the server again while the server is still doing work . Rinse, wash repeat, and you have a nasty situation on your hands, however there’s at least two ways to look at it (see why I’m interested in yours and other peoples thoughts). Still we should be trusting our server to do the right thing, if can’t, then we have a bug.

Understand, I get your points, and they are ones that I’ve leaned on in the past, but lately I’m leaning towards infinity

In the end, it’s not a huge deal, I can always just write code that specifies infinity… but I’ve pondered on this and wonder what would the pros and cons be if we just went with infinity as a default? Even more specifically, would this help send people down a good path in regard to taking care when crafting their processes?

ingela · August 22, 2023, 7:34am

Infinity is the only sane default, we would change if it was not for the huge amount of code the might start working differently than the designers of it expects. I always use the infinity when I use calls. The big problem with having a timeout is that the timeout is client-side only, the server knows nothing about the timeout. So the server side function will most likely be performed anyway and a late answer would historically have arrived at the client (if it catches the gen_server:call exit and survives it) late. I think this is actually handled nowadays, but still the server is oblivious of the timeout. In the case I need a time out I will make it server-side. A well designed server will not block itself and will be responsive and can implement server-side timeouts. Also a server that crashes or a erlang node that goes down will result in the process doing the call receiving an exit signal.

Maria-12648430 · August 22, 2023, 7:55am

FWIW, gen_statem:call/2 uses infinity as the default timeout, which deviates from gen_fsm:sync_send_event/2 which used the 5s.

starbelly · August 22, 2023, 11:05am

Thank you for your precise response I think originally there was discussion around introducing a new variation of call then, I wouldn’t even seek that

I wonder if OTP team would be open for documenting this though?

starbelly · August 22, 2023, 12:38pm

I do believe gen_statem is under utilized as well, thanks for the call out!

ingela · August 22, 2023, 2:03pm

This is because gen_statem is fairly new and did not need to adhere to legacy

ingela · August 22, 2023, 2:07pm

There is a lot of legacy applications that would benefit from using gen_statem instead, but it is not always going to be prioritized to do the re-write! But I do hope it will catch on for new applications!

ingela · August 22, 2023, 2:10pm

I think we are open to improving the documentation. It might however not be obvious exactly what to write and where.

scherrey · August 22, 2023, 2:52pm

Agree if only the client knows there’s a timeout then that’s a Bad Thing ™. Would be nice if a “deadline” was sent along to the server so it would know whether it should process the request or drop it as an option.

caravan_muffin · August 22, 2023, 3:19pm

I personally believe there should not be any default since any of the possible values is not universally good. The caller must provide a timeout value at all times. They can then decide what set of tradeoffs works best for them. Otherwise it is all too easy to forget what the implied value is.

ingela · August 22, 2023, 3:30pm

That is how you implement a server side timeout, that is let the client send the desired timeout value to the server and the server handles it

ingela · August 22, 2023, 3:33pm

I would argue having infinity as default is conceptually, not having a default in this respect. Problem is that changing the behavior of gen_server:call affects a lot of legacy code out there!

starbelly · August 22, 2023, 7:54pm

I had a thought about the backwards compat problem… let me preface this with it may be a bad idea, but I’d be remiss to not share an idea even if I think it may be bad

What we could do is introduce this as a feature using features

rebar, mix, etc. could be updated perhaps to use the new feature by default. Old projects don’t have to worry about it unless they choose, new projects get it for free by virtue of app and lib templates via project management tools.

After a VERY long period of time, it could maybe be switched for good, and perhaps accompanied by a way still to turn it off.

It could work, but there might be issues that I’m not thinking about too. None the less, the idea has been shared for good for bad

ingela · August 23, 2023, 5:20am

Well, yes that might be a feasible @rickard what do you think?

rvirding · August 23, 2023, 7:28am

There is one thing you have to be aware of and that is that when you do gen_server:call it automatically monitors the server process . This means that if the server process dies when you are waiting for a reply the call handler function will receive a 'DOWN' message and generate an exit call. This means you will never block waiting for a crashed server.

Check the gen.erl module which does the core work and the function do_call/4 which monitors, sends the call request and waits for the reply.

Inforista · August 23, 2023, 7:54am

Thank you for the clarification @rvirding

I am still quite new to Erlang, but are so happy to have discovered it for my work on an extended version of the concept of a distributed shared memory (Tuplespace)

It is really a well designed language with some amazing possibilities.

Especially the way the use of lightweight independent processes easily can be used to model real world scenarios is intriguing and makes many development tasks so much easier

Even full stack web development can be done purly with Erlang and the Nitrogen framework - fantastic!