Race condition for logger:remove_handler

dischoen · October 15, 2023, 5:50pm

Hi all,

is there a known race condition for logger:remove_handler()?
I want to start some short-lived (several minutes) processes where each of them has a dedicated logger handler and logs to a dedicated file. The mapping from log message in a process to handler is done via
a domain filter.

When the process ends, the handler shall be removed.
I started several of these processes in a unit test, and they end at nearly the same time.
After their end, some handler IDs remain in logger:get_handler_ids(), and logger:i() crashes because
some of those handlers are in limbo.
If I serialize the handler removal in a top-level process, all is well.

I can provide a code example, if this is of interest.

Environment: it happens in 25 und 26, on Ubuntu 22.

starbelly · October 15, 2023, 8:08pm

Code always helps

dischoen · October 15, 2023, 8:39pm

sure.
here we go:

in this example I use the logger_std_h handler for debugging instead of writing to files.

-module(tseq_logger).

-include_lib("kernel/include/logger.hrl").

%% API
-export([start/2]).

get_id(Number) ->
    list_to_atom("proc" ++ integer_to_list(Number)).

start(From, To) ->    
    lists:foreach(
      fun(I) ->
              spawn(
                fun() ->
                        launcher(get_id(I))
                end)
      end,
      lists:seq(From,To)).


launcher(Id) ->
    ok = logger:set_process_metadata(#{domain => [Id]}),
    Parent = self(),
    ok = add_tseq_handler(Id),
    %% start actual worker
    Pid = spawn(
            fun() ->
                    ok = logger:set_process_metadata(#{domain => [Id]}),
                    %% working
                    ?LOG_NOTICE("the job is done, tell parent ~p", [Parent]),
                    timer:sleep(200),
                    Parent ! {done, self()}
            end),
    ?LOG_NOTICE("waiting for ~p",[Pid]),
    receive
        {done, Pid} -> 
            ?LOG_NOTICE("removing handler ~p",[Id]),
            ok = logger:remove_handler(Id)
    end.

add_tseq_handler(TseqId) ->
    logger:add_handler(
      TseqId, 
      logger_std_h, 
      #{config => #{type => standard_io},
        
        filter_default => stop,
        filters => 
            [{tseq_filter, 
              {fun logger_filters:domain/2, {log, equal, [TseqId]}}}],
        formatter =>
            {logger_formatter,
             #{
               legacy_header => false,
               single_line   => true,
               template      => [domain, ">", time, " ",level," ", pid,
                                 " ", mfa,":",line,"|",tseq," ",
                                 msg,"\n"]}
            },
        level  => all}).

a launcher process creates a logger handler and starts a worker process which uses this handler via the domain. when the worker is done, the handler is removed.
start() can create multiple launchers. since they share the same code, they end nearly at the same time,
hence the logger:remove_handler() seem to come in a race condition situation.

An example output in my environment looks like this:

ok
2> 
[proc2]>2023-10-15T22:37:45.605769+02:00 notice <0.554.0> tseq_logger:launcher/1:35| waiting for <0.565.0>
[proc3]>2023-10-15T22:37:45.605975+02:00 notice <0.555.0> tseq_logger:launcher/1:35| waiting for <0.568.0>
[proc1]>2023-10-15T22:37:45.605165+02:00 notice <0.561.0> tseq_logger:launcher/1:31| the job is done, tell parent <0.553.0>
[proc2]>2023-10-15T22:37:45.605815+02:00 notice <0.565.0> tseq_logger:launcher/1:31| the job is done, tell parent <0.554.0>
[proc3]>2023-10-15T22:37:45.606003+02:00 notice <0.568.0> tseq_logger:launcher/1:31| the job is done, tell parent <0.555.0>
[proc1]>2023-10-15T22:37:45.605300+02:00 notice <0.553.0> tseq_logger:launcher/1:35| waiting for <0.561.0>
[proc3]>2023-10-15T22:37:45.812213+02:00 notice <0.555.0> tseq_logger:launcher/1:38| removing handler proc3
[proc1]>2023-10-15T22:37:45.812192+02:00 notice <0.553.0> tseq_logger:launcher/1:38| removing handler proc1
[proc2]>2023-10-15T22:37:45.812484+02:00 notice <0.554.0> tseq_logger:launcher/1:38| removing handler proc2

2> logger:get_handler_ids().
[proc3,proc1,default,ssl_handler]
3> 
3> logger:i().              
** exception error: no match of right hand side value {error,{not_found,proc3}}
     in function  logger:'-get_handler_config/0-lc$^0/1-0-'/1 (logger.erl, line 520)
     in call from logger:get_config/0 (logger.erl, line 660)
     in call from logger:i/0 (logger.erl, line 669)

starbelly · October 16, 2023, 12:35am

dischoen:

3> logger:i().              
** exception error: no match of right hand side value {error,{not_found,proc3}}
     in function  logger:'-get_handler_config/0-lc$^0/1-0-'/1 (logger.erl, line 520)
     in call from logger:get_config/0 (logger.erl, line 660)
     in call from logger:i/0 (logger.erl, line 669)

The problem is interesting, thus I was happy to dig in There is not a race condition, rather an element in a bit of config is not being deleted. This may or may not be bug, but it’s suprising at least and should be documented if it’s intended behaviour.

Let’s say you run your code above a few times, the logger table will end up with this :

...
 {'$primary_config$',#{filter_default => log,filters => [],
                       handlers =>
                           [proc2,proc3,proc1,proc3,proc2,default,ssl_handler],
                       level => notice,metadata => #{}}},

Notice how you have duplicates in there. Now you call logger:i/0 or logger:reconfig/0, it will iterate over this data and attempt to get handler information, which no longer exists in another table (I believe it’s another table), because per your code the handlers were removed.

This happens because of the lists:delete/2 call here , and because there are multiple elements in the list with the same id, the delete operation will find the first one and call it a day.

This may be because when this designed and implemented the intention was that every handler added would have a truly unique id, or it could just be a bug

The fix here would either to process this via lists:uniq/2 either before or after calling lists:delete/2, using a set, a map, etc.

Could you open an issue for this here ?

I would add the call to ets:tab2list(logger) as part of the report after you’ve shown running your code 2 to 3 times consecutively.

dischoen · October 16, 2023, 4:28am

Good morning,

many thanks for your analysis! I will be happy to open an issue about this.

But I am still not completely convinced that there is not a bit of a race condition involved.
If I serialize the removal, so that only one at a time is executed, no errors occur:

start(From, To) ->
    Remover = spawn(fun remover/0),
    lists:foreach(
      fun(I) ->
              spawn(
                fun() ->
                        launcher(get_id(I), Remover)
                end)
      end,
      lists:seq(From,To)).


launcher(Id, Remover) ->
    ok = logger:set_process_metadata(#{domain => [Id]}),
    Parent = self(),
    ok = add_tseq_handler(Id),
    %% start actual worker
    Pid = spawn(
            fun() ->
                    ok = logger:set_process_metadata(#{domain => [Id]}),
                    %% working
                    ?LOG_NOTICE("the job is done, tell parent ~p", [Parent]),
                    timer:sleep(200),
                    Parent ! {done, self()}
            end),
    ?LOG_NOTICE("waiting for ~p",[Pid]),
    receive
        {done, Pid} -> Remover ! Id
    end.

remover() ->
    receive
        Id -> 
            ok = logger:set_process_metadata(#{domain => [Id]}),
            ?LOG_NOTICE("Remove ~p", [Id]),
            ok = logger:remove_handler(Id),
            remover()
    end.

starbelly · October 16, 2023, 1:47pm

Yeah, I’m not able to replicate, that is merely running this code, many many times does not yield a crash per a bad match. Perhaps I’m missing something though.

jimdigriz · October 17, 2023, 8:03am

I remember running into a similar problem with one of my projects.

Like you I noticed choking it through a singleton solved the problem, though my solution was to use global:{set,del}_lock() (later with global:trans()) pinned to [node()].

Glancing through the history, looks like I solved the problem ‘properly’ by making sure I was doing all the hairy work inside the handler callback functions (ie. HModule:{adding,changing,removing}_handler) as (glancing through the OTP source) this is punted through the loggers singleton process instead.

Sorry for the ‘fuzzy’ answer, the commits to my projects are June 2021 so not 100% clear if that did fix it or did I rework the problem to avoid it anyway.

dischoen · October 24, 2023, 1:53pm

So our solutions have some similarities. I’m glad that my way of thinking seems to align with the erlang
way.
In my application it was not critical, it happens when the worker process has done its job and is about
to remove the handler. So I could avoid the whole problem alltogether and delegate the problem to a “log removal collector process”.
I think Joe said something in this spirit that it is not always necessary to tackle a problem if you can circumvent it.