Gen_statem: looping with {next_event, internal, _} blocks calls

rlipscombe · May 1, 2024, 12:45pm

I have a gen_statem that needs to loop.

I wrote something like the following:

init({}) ->
    {ok, no_state, no_state_data, [{next_event, internal, loop}]}.

handle_event(internal, loop, _, _) ->
    timer:sleep(500), % do some simulated work.
    {keep_state_and_data, [{next_event, internal, loop]}.

But, because next_event is higher priority than call or system messages, my process never sees the stop request and continues looping. The call to gen_statem:stop/1 blocks indefinitely.

How should I implement a polling loop in a gen_statem? How should I tell it to stop?

As a workaround, I’ve inserted a generic timeout (loop → timeout → loop → …), which seems to resolve the problem, but it’s not particularly pleasing.

NelsonVides · May 1, 2024, 2:08pm

I’d say it depends on what the simulated work is really doing here, if it is something heavy, it could be outsourced to a separate process that will tell the main one when it’s done its work, and then the main one can poll again

rlipscombe · May 1, 2024, 2:24pm

The actual work is something that takes 500ms, which is why I felt comfortable simulating it with a sleep

And, in fact, it’s a synchronous call to another process. If I replace it with send_request…check_response, everything’s fine. It’s just harder to read the code – I quite liked the simplicity of the synchronous thing.

mmin · May 1, 2024, 3:50pm

You can get it working by using 2 states together with event postponing. no_state means “do work”, while s2 means “handle other calls”. Example:

init(_) ->
    {ok, no_state, no_state_data, [{next_event, internal, loop}]}.

handle_event(internal, loop, no_state, Data) ->
    timer:sleep(500), % do some simulated work.
    {next_state, s2, Data,[{next_event, internal, loop}]};
handle_event(internal, loop, s2, Data) ->
    {keep_state, Data, [postpone]}.

You can handle all other events while in state s2.

rlipscombe · May 1, 2024, 4:23pm

The problem with this, unless I’m missing something, is that it doesn’t loop: there’s nothing to trigger a transition back to the ‘working’ state.

init(_) ->
    {ok, working, no_state_data, [{next_event, internal, loop}]}.

handle_event(internal, loop, working, StateData) ->
    timer:sleep(500),
    {next_state, idle, StateData, [{next_event, internal, loop}]};
handle_event(internal, loop, idle, StateData) ->
    {keep_state, StateData, [postpone]}.

It stays in the ‘idle’ state.

If I try to flip back to the other state immediately, I get the original blocking, “won’t stop” behaviour:

handle_event(internal, loop, working, StateData) ->
    timer:sleep(500),
    {next_state, idle, StateData, [{next_event, internal, loop}]};
handle_event(internal, loop, idle, StateData) ->
    {next_state, working, StateData, [postpone]}.

mmin · May 1, 2024, 7:22pm

That is true, I rushed a bit.

If I understood correctly, logic should be something like:

loop() ->
    receive
        X ->
            %% any "event"
            io:format("~p~n",[X]),
            loop()
    after 0 ->
            %% work
            io:format("WORK~n"),
            timer:sleep(500),
            loop()
    end.

However, I can’t simulate that after 0 with gen_statem. There are zero timeouts in gen_statem, but they don’t behave as after 0. I’m not sure if this is how it should behave or is it a bug.

rlipscombe · May 1, 2024, 7:27pm

Yes, that.

Yeah; I couldn’t come up with anything either.

mmin · May 1, 2024, 8:06pm

Here’s a hack:

handle_event(cast, undefined, State, Data) ->
    io:format("WORK~n"),
    timer:sleep(500),
    gen_statem:cast(self(), undefined),
    {keep_state_and_data, []};
handle_event(_,_, State, Data) ->
    %% anything

The thing is, next_event would work in this case, but it inserts an event in the queue, but we need something to append an event in the queue. So, we can use cast because it returns immediately and it appends a cast event to the event queue.

Not sure it this is more convenient than solution with timeout

raimo · May 2, 2024, 7:52am

Since you want to receive messages (system messages in particular) you have to be prepared for any message. And have to insert your loop message after other messages so you have to send it to your own mailbox.

You may use gen_statem:cast/2 as suggested, but it is intended for client code and works here since it doesn’t block. You can also use self() ! whatever.

To protect against anyone accidentally sending you whatever you can (since you are using handle_event/4 send a ref and keep it in the state:

handle_event(info, Ref, {loop, Ref}, Data) ->
    work(),
    NewRef = make_ref(),
    self() ! NewRef,
    {next_state, {loop, NewRef}, Data};

That is assuming that you want postponed messages to be repeated the next loop round.

rlipscombe · May 7, 2024, 12:20pm

It turns out that everything is not fine.

My linked process returns a response and then dies. If I call check_response and then immediately return a {next_event, internal, loop} action, I miss the 'EXIT' messaage and try to call the now-dead process.

So, essentially, what’s missing here is a low-priority next_event. Gonna have to cast/send myself a message that’ll go to the back of the queue.

raimo · May 7, 2024, 1:51pm

The purpose of the next_event and timeout,0 actions is to bypass the inbox so you can be sure to process pseudo events undisturbed.

You need to be responsive to incoming messages, hence the only option is to send yourself a message (Bang or gen_statem:cast/2), and then deal with whatever messages that may come…

rlipscombe · May 7, 2024, 2:14pm

Yeah, I get that. I was responding to my earlier assertion that using a separate process (or async calls) solved the problem – it doesn’t.

However: as mentioned up-thread, the timeout,0 behaviour is kinda surprising – in a receive...after 0, it lets you process incoming messages before continuing, so I think a lot of people think it should do the same here.

It would be nice if (say) for OTP-28, there was a way to queue actions for after incoming messages had been dealt with; this would remove the need for posting oneself a message. Or, more specifically, a way to write a looping gen_statem…

raimo · May 7, 2024, 3:09pm

Well, there is already a way to queue an action for after incoming messages, and that is to send yourself a message So do we really need another way to do it? What would the point be?

There are actually already two ways, gen_statem:cast/2 is the other. It maybe feels more like a generated action than a random message.

raimo · May 7, 2024, 3:24pm

next_event inserts a message first in the queue. timeout,0 was chosen to insert last in the queue, but before any not yet received message. They complement each other. I agree that it somewhat contradicts the behaviour of receive ... after 0.

As you say; maybe a new action for that might be useful… One that effectively does receive ... after 0 in the gen_statem internal receive and either produces an event or a timeout, immediately, while handling system messages.

raimo · May 8, 2024, 8:16am

It is intended. But it is not entirely clear that the intention was correct.
I think, however, that we cannot change this since that might introduce surprising bugs.

The reason is that the gen_server time-outs are flawed since they utilize receive ... after Timeout. When a system message arrives it restarts the time-out so you can’t actually know what time-out time you will get (in the presence of system messages).

Therefore the gen_statem time-outs became timer based. Then was the question how to handle time-out 0; it should not start a timer but instead just insert an event. The logical place to insert it would be after all messages in the internal queue, since the event happens “now” as in after the events that are already queued. Sending a message via the process mailbox would be unpredictable since that would depend on what was already in the process mailbox.

So next_event inserts at the front, and timeout,0 at the back of the gen_statem internal queue. Great, that should cover all cases!

… Not!

receive ... after 0 is a case that was missed.

I can see 2 possible new actions (names are up for debate):

{after_0, Msg} that would insert an event after_0,Msg unless the receive ... after 0 statement in gen_statem’s engine gets a message. Either you immediately get an external event, or an after_0,Msg event.
{send_to_self, Msg} that unlike self() ! Msg would prioritize self messages in a fair way as in interleave external messages with self messages. This would require maintaining a queue of self messages in the gen_statem engine. If there are queued self messages the gen_statem engine would use receive ... after 0 and if a message is received it processes that plus one message from the queue of self messages. Otherwise it just processes one from the queue of self messages.

Either will have to co-exist with the hibernate_after_timeout that today utilizes receive ... after Timeout.

{after_0, Msg} should be smaller to implement, and the minimal solution to plug this hole.

{send_to_self, Msg} might be better for this use case, but is maybe over-engineering and might give the users too much rope to hang themselves. Or it may have interesting use cases I cannot think of right now.

One odd thing with {send_to_self, Msg} would be that you use it to send Msg1 and Msg2, then handle one external event and in that handler do self() ! Msg3. Now you might get Msg3 as the next event, before Msg2.

This might be seen as violating message ordering, or just a pedagogical problem…

One could also have a send to self without a queue as in a queue of 1. But that would be a strange send operation if later sends overwrite earlier. That might be just a naming problem… It would effectively be as {after_0, Msg}, but you will always get the Msg event, after zero or one external event.

rlipscombe · May 8, 2024, 8:54am

It feels like we’re getting a little way into the weeds here. These suggestions are great for the general case, but – as you say – might be a bit too much rope.

My use case is simpler (but possibly too specific) – I want to make a network call (to Kafka, as it happens). I’m fetching messages. If I get a successful fetch, I need to issue another one. If I get an error, I need to do something specific (e.g. refresh metadata, get new message offsets, etc.). Once that’s complete, I want to issue another fetch.

So I’m effectively going round a loop, with either a fetch or (occasionally) a list_offsets on each iteration. While doing this, I need to respond to system messages (gen_statem:stop/1, sys:get_state/1, etc.), the occasional call (kafka_consumer:info/1, e.g.) and ‘EXIT’ messages (so I know when my linked kafka_connection process has died).

This doesn’t – with hindsight – seem like a good fit for gen_statem, as it currently exists. If we can come up with something that fits, that’d be great.

In the interim, however, I’m thinking I might wind it back and build something that fits better – starting with proc_lib…

mmin · May 8, 2024, 9:51am

I was thinking something like {append_event, Type, Content} that would behave just like next_event but would put the event on the back of the queue instead on the front. That would enable the user to handle those events like any other events, without having to manually craft references of implement unnecessary handling of info events or any other events. Seems simple to me, are there any cases that this solution can’t cover?

send_to_self sounds complex to me at the first look

rlipscombe · May 8, 2024, 10:13am

…with the understanding that “back of the queue” includes receiving and processing normal messages first…?

ingela · May 8, 2024, 10:14am

So could your loop not have two states for the different situations that you have, that is really what state machines are loops where you do not always handle all messages in all states, and if you get event (message) that you are not ready to handle postpone it and go back to the gen_statem-loop maybe staying in the same state but handling system message.

rlipscombe · May 8, 2024, 11:09am

Thinking about it, yes, probably.

Now that I’m not doing a synchronous call to the connection process, and I’m using send_request…check_response, I can probably go back to a proper state machine, rather than trying to wedge a synchronous loop in there.