Possible causes for infrequent exceptions that don't make sense

fhunleth · September 29, 2024, 1:47am

We have a lot of IoT devices running the BEAM. We log exceptions and periodically get ones that don’t make sense. These are very low in frequency (probably less than 1 per 100K devices/week) and the code calling them recovers immediately. I don’t currently have a good reason to investigate these other than curiosity.

Here’s one example:

arg0: 1727566608171023868
arg1: :native
arg2: :microsecond

ArgumentError: argument error
  Module "erlang", in :erlang.convert_time_unit/3
  File "lib/calendar/iso.ex", line 1760, in Calendar.ISO.from_unix/2
  File "lib/calendar/naive_datetime.ex", line 115, in NaiveDateTime.utc_now/1
  File "lib/quantum/clock_broadcaster.ex", line 83, in Quantum.ClockBroadcaster.handle_tick/1
  File "lib/gen_stage.ex", line 2206, in GenStage.noreply_callback/3
...
(3 additional frame(s) were not displayed)

This works, though:

1> erlang:convert_time_unit(1727566608171023868, native, microsecond).
1727566608171023

In fact, it works on the device reporting the exception in the same instance of the BEAM (no restarts of the BEAM or reboots of the device) that caught it in the first place.

The other strange errors have been argument errors, but in different places, and also seemingly impossible. I picked erlang:convert_time_unit/3 to show since the code for that function is easily available. These errors are happening on different devices.

This example was on a 4-core 32-bit ARM device running Erlang 27.0.1. The other ones have been on similar hardware with both 27.0.1 and older versions. We lean to using ports, so I don’t think this is due to a misbehaving NIF. We also aren’t seeing C-based programs crashing or the hints that there’s some memory corruption happening. Admittedly, I feel like we have much more visibility into the BEAM. There’s plenty of memory on the device and no other signs of strangeness.

I obviously can’t reproduce these issues. I’m really not sure what I’m asking for, but if this rings any bells for anyone, I’d be curious on thoughts or possible things to try.

filmor · September 29, 2024, 7:00am

The function catches everything and converts it to error:badarg, see otp/erts/preloaded/src/erlang.erl at OTP-27.0.1 · erlang/otp · GitHub. This means in particular that it will also convert exits, e.g. from a dying linked process or its supervisor to badarg if that just happens to occur while this function is within its try block.

IMHO this is a bug in the function, it should only catch errors (i.e. change this line to error:_).

Are the other cases also guarded by exit-catching blocks?

jhogberg · September 29, 2024, 7:56am

1 out of 100K over a week does sound a lot like hardware issues, and it doesn’t necessarily have to be that the hardware is busted: the memory could be fine but a single read could’ve been bad.

The only things I can think to check are very general. Do these errors coincide with power supply issues and/or sudden changes in workload? Is there a geographical component to the errors (area with frequent brownouts, high altitude, two or more unrelated devices failing in short order in the same area, etc)? You’ve probably thought to check this already though.

No, exit signals cannot be caught as an exception. You can only catch an exit raised by the current process.

filmor · September 29, 2024, 10:44am

Hah, thank you very much for the correction. I found the relevant section in the docs again (which I’m sure I’ve read before):

The functions erlang:exit/1 and erlang:exit/2 are named similarly but provide very different functionalities. The erlang:exit/1 function should be used when the intent is to stop the current process while erlang:exit/2 should be used when the intent is to send an exit signal to another process. Note also that erlang:exit/1 raises an exception that can be caught while erlang:exit/2 does not cause any exception to be raised.

This is the note on erlang:exit/2.

fhunleth · September 29, 2024, 4:10pm

Wow, it would happen that I did not check the one exception report that I chose as an example. The timestamp of that report corresponds within about 10 minutes to peak wind gusts from Hurricane Helene as logged on wunderground.com!

Sadly, the other devices I’ve checked so far in the same town didn’t reboot or do anything remarkable that evening.

I checked another recent report again (a variable that is only ever set to 0 magically became an empty list when checked in a NIF). That device is located far from the south east US and the weather was decent there. It’s at 1500m elevation, and I hadn’t considered its elevation before you mentioned it.

I did some googling to find historical electrical service status. There’s some really interesting real-time data, but nothing I can use and sadly, I don’t have a way to catch power supply issues on these devices unless it results in a reboot.

I’ll keep looking since it fascinates me that I see these effects at such a high level in the software. I think you’re right though about the signs point to hardware-related causes.

Thanks for your quick response!