`public_key' mayhem (or something worse?)

nato · January 6, 2024, 5:23am

I have a release that has a serious flood of terms being thrown either in error, or I don’t even know what. This is going to be hard to communicate, so bear with me.

In development, I have some `httpc’ requests that work just fine… passing along some headers and things for successful oauth2 requests.

Other than the requesting host differences (one is 127.0.0.1:9004, the other troubled one is https://example.com) there aren’t any differences in what is being sent along on the erlang side. The boxes are both OpenBSD, both are running OTP26 (usually just fine).

Anyway, when I perform this in-production request, at some point, my server locks and I get an endless barrage of terms that knocks my wee production server out of commission. The shape of these terms in my logs are:

    ...
    {cert,
      <<48,130,3,119,48,130,2... ,123,167,157>>,
    {'OTPCertificate',
      {'OTPTBSCertificate',v3,664...
        {'SignatureAlgorithm',{1,2,...11},'NULL'},
        {rdnSequence,
    ...

I was wondering if anyone could see what libs/apps may be squawking!? I have a terrible hack to halt the VM after a few seconds before the IO is so intense that I can’t interact with the REPL, and my logs are being quickly written over that makes it unknown what is the failing calling process. But to be honest, I can’t tell if this is a failure due to an error yet… Anyone can help point me in the right direction?

mmin · January 6, 2024, 8:03am

nato:

   {cert,
      <<48,130,3,119,48,130,2... ,123,167,157>>,
    {'OTPCertificate',
      {'OTPTBSCertificate',v3,664...
        {'SignatureAlgorithm',{1,2,...11},'NULL'},
        {rdnSequence,

That looks like something you’d get when calling httpc:ssl_verify_host_options/1. You’re probably calling it when doing request, to pass ssl opts. That said, it’s probably not part of the error itself, but it’s just a term you passed as an argument to httpc:request/X that failed and therefore is part of the stacktrace. Do you have any other clues like, undef, badmatch, function_clause?

Is ssl application started? “Works on localhost” problems are usually connected with ssl stuff.

About the error loop, try lowering intensity and increasing period in your supervisors to make them crash too. You can also set your app as temporary so it doesn’t get restarted, so you can at least inspect where the problem is.

nato · January 6, 2024, 8:27am

Is ssl application started? “Works on localhost” problems are usually connected with ssl stuff.

It is. I did want to also include, the ``loop’’ of terms shows new content with every iteration… specifically, it cycles through new terms with new values for `printableString’ eg.

{'OTPCertificate',
                        {'OTPTBSCertificate',v3,
                         188833316161142517227353805653483829216,
                         {'SignatureAlgorithm',{1,2,840,113549,1,1,11},'NULL'},
                         {rdnSequence,
                          [[{'AttributeTypeAndValue',{2,5,4,6},"US"}],
                           [{'AttributeTypeAndValue',
                             {2,5,4,10},
                             {printableString,"Certainly"}}],
                           [{'AttributeTypeAndValue',
                             {2,5,4,3},
                             {printableString,"Certainly Root R1"}}]]},

Followed by …

                        {'OTPTBSCertificate',v3,33554617,
                         {'SignatureAlgorithm',{1,2,840,113549,1,1,5},'NULL'},
                         {rdnSequence,
                          [[{'AttributeTypeAndValue',{2,5,4,6},"IE"}],
                           [{'AttributeTypeAndValue',
                             {2,5,4,10},
                             {printableString,"Baltimore"}}],
                           [{'AttributeTypeAndValue',
                             {2,5,4,11},
                             {printableString,"CyberTrust"}}],
                           [{'AttributeTypeAndValue',
                             {2,5,4,3},
                             {printableString,"Baltimore CyberTrust Root"}}]]},

I will look into somehow capturing some remnants of the throwing error. I do see how this is just the state / args being barfed out… but this IO is what kills my server via resource starvation.

Thanks for chiming in!

mmin · January 6, 2024, 8:56am

Maybe you could try calling logger:set_formatter_config/2 at startup and limit max_size, chars_limit and depth to prevent the whole term from being logged?

Also, isn’t error reason mentioned before the crashing call, i.e. at top of your log?

nato · January 6, 2024, 9:51pm

Well, I am using SASL logs (for better or worse) and the terms are so large that the logs get corrupted. With the report browser, there is a line item that reads:

rb: reading report...Error: {unknown,"Incomplete erlang term in log",[],8706}

In other news, this comes up once in a while when I am attached to the REPL and
try to kill it during the barrage:

=PROGRESS REPORT==== 6-Jan-2024::14:54:17.404489 ===
    supervisor: {local,inet_gethost_native_sup}
    started: [{pid,<0.1045.0>},{mfa,{inet_gethost_native,init,[[]]}}]

=PROGRESS REPORT==== 6-Jan-2024::14:54:17.404746 ===
    supervisor: {local,kernel_safe_sup}
    started: [{pid,<0.1044.0>},
              {id,inet_gethost_native_sup},
              {mfargs,{inet_gethost_native,start_link,[]}},
              {restart_type,temporary},
              {significant,false},
              {shutdown,1000},
              {child_type,worker}]
=PROGRESS REPORT==== 6-Jan-2024::13:44:35.908784 ===
    supervisor: {<0.1953.0>,tls_dyn_connection_sup}
    started: [{pid,<0.1954.0>},
              {id,sender},
              {mfargs,{tls_sender,start_link,[[{spawn_opt,[]}]]}},
              {restart_type,temporary},
              {significant,false},
              {shutdown,5000},
              {child_type,worker}]

But to answer your question, I lose the first crashing/error info as it gets buried with all the other noise. With the logs getting corrupted from the huge terms, I can’t ever investigate the error. Still trying some workaround to get access to that – this is why this is such a unique issue!

nato · January 6, 2024, 11:11pm

Ok. So, my progress report has an entry that has a term that is killing my VM, and then, my server. It’s the cacerts key-val that is doing it and since I am logging progress, errors, etc., it does all this IO and starves my server. So, no errors this time, but this supervisor combined with SASL is the culprit:

=PROGRESS REPORT==== 6-Jan-2024::15:02:11.513897 ===
    supervisor: {<0.1021.0>,tls_dyn_connection_sup}
    started: [{pid,<0.1023.0>},
              {id,receiver},
              {mfargs,
               {ssl_gen_statem,start_link,
                [client,<0.1022.0>,"www.reddit.com",443,#Port<0.14>,
                 {#{alpn_advertised_protocols => undefined,
                    customize_hostname_check =>
                     [{match_fun,#Fun<public_key.6.11965134>}],
                    signature_algs_cert => undefined,
                    partial_chain => #Fun<ssl.5.74402199>,protocol => tls,
                    crl_cache => {ssl_crl_cache,{internal,[]}},
                    key_update_at => 388736063997,verify_fun => undefined,
                    signature_algs =>
                     [{sha512,ecdsa},
                      rsa_pss_pss_sha512,rsa_pss_rsae_sha512,
                      {sha512,rsa},
                      {sha384,ecdsa},
                      rsa_pss_pss_sha384,rsa_pss_rsae_sha384,
                      {sha384,rsa},
                      {sha256,ecdsa},
                      rsa_pss_pss_sha256,rsa_pss_rsae_sha256,
                      {sha256,rsa}],
                    early_data => undefined,log_level => notice,
                    max_fragment_length => undefined,
                    versions => [{3,3}],
                    secure_renegotiate => true,verify => verify_peer,
                    cacerts => ...

The final key of cacerts points to the endless barrage of terms. That value is the whopper.

If anyone can best advise how to `shush’ certain progress reports, that would be amazing – unlikely. I don’t think I want to disable these, so a best practice here would be great!!

EDIT: This is a legacy project, so the issue seemed to be NOT the fact that such a large term was being written to logfile, but rather, since these gigantic progress reports were IO to tty.

Sadly, these humongous terms in the logfile coming from the progress reports are still corrupting my SASL logfiles. If I find a solution to that, I’ll be happy. But I fixed the thread’s main issue by ensuring {errlog_type, error} was set to stop tty getting pounded (shouldn’t this also prevent logfiles from getting progress reports? I thought it should, and nothing in the man says otherwise… ).

okeuday · January 7, 2024, 12:16am

An example of how to avoid the tls_dyn_connection_sup logging spam is in
CloudI with the configuration

github.com

CloudI/CloudI/blob/5e46a4477dd6532b446064ccdabff4697c648f1d/src/rel/files/vm.config.in#L26-L28


      
          [{supervisor_started_noise,
            {fun cloudi_logger_kernel:filter_out_supervisor_started/2,
             [tls_dyn_connection_sup]}}]}]},

using the source code

github.com

CloudI/CloudI/blob/5e46a4477dd6532b446064ccdabff4697c648f1d/src/lib/cloudi_core/src/cloudi_logger_kernel.erl#L56


      
          %%  [tls_dyn_connection_sup]}
          %% will prevent the logging of supervisor report started messages from
          %% the tls_dyn_connection_sup module.
          %% @end
          %%-------------------------------------------------------------------------
          
          -spec filter_out_supervisor_started(logger:log_event(),
                                              Supervisors :: nonempty_list(module())) ->
              ignore | stop.
          
          filter_out_supervisor_started(#{level := info,
                                          meta := #{domain := _,
                                                    error_logger := _,
                                                    file := _,
                                                    gl := _,
                                                    line := _,
                                                    logger_formatter := _,
                                                    mfa := {supervisor,
                                                            report_progress, 2},
                                                    pid := _,
                                                    report_cb := _,

nato · January 7, 2024, 12:43am

Thank-you. I am in still in SASL-land, but when I port this project over to newer tools, I will use this as you have it. BTW… when did the convention move from `sys.config’ to ``vm.config’’ ? Rebar3?

okeuday · January 7, 2024, 1:26am

CloudI uses vm.config to make the main .config file separate from other
configuration (i.e., the cloudi.conf file). The documentation at
Erlang -- config prefers the sys.config name
and has other details regarding the use of additional .config files.