None of them are ready to use without some glue code. However, it’s a good start. In the past, ‘eva’ was somehow part of standard Erlang/OTP. It has some good ideas implemented.
I use elarm to handle alarms and send them to upper layers.
I also know that alarms is not very common in non-telecom applications. That’s why I think there is no much libraries around there.
And you, do you use any alarm system ? How do you manage system alarms (and events as well) ?
We have an internal API, in the appropriately named alarm module, which forwards alarms to:
Log files, which are forwarded by an external agent to ChaosSearch (how I miss Splunk!)
Sentry, which in turn optionally forwards to the team’s email and Slack
Optionally for critical ones to Opsgenie, which then forwards to the on-call person’s phone and the team’s email and Slack.
The Sentry integration uses raven-erlang plus some of our own code on top of it.
The Opsgenie integration uses an internally-developed Erlang library, which for some reason hasn’t been open-sourced. It’s not big.
That’s for the alarms we generate from within Erlang. Then there are external monitors that feed into Datadog which can trigger Opsgenie as above. Various health checks sit here.
We have been using elarm in one of our server applications. The customer wanted to see a few alarms in Nagios around 8 years ago, so we implemented these alarms with elarm.
However, apart from this one case, neither this customer nor any other customer have been interested in alarms coming from our applications. The operations teams use metrics, and alarms based on the metrics. So if we want them to have an alarm for something, we need to implement a metric, and provide them with the alarm rule based on that metric.
At the end this subject is hard to come out with one solution fits all, too use case specific. Besides the fact for some fields there are a lot of standards.
We have an application, written in Erlang, which produces the TM Forum Open API for Alarm Management (TMF642). It manages CRUD for a collection of alarms and handles ack/unack, clear, group/ungroup operations and includes a Web Components PWA for NOC staff.
On the southbound side it collects VNF Event Streaming (VES) events and we have an SNMP TRAP collector (github.com/sigscale/snmp-collector) which normalizes to the information model of ITU-T X.721/ X.733 and 3GPP Alarm Integration Reference Point (IRP) 32.111-2 and adapts to the VES API. We reused what we could of OTP’s SNMP manager however had to extend it for high volume production use, including a NIF for SNMPv3 crypto.
We also have a load generator (github.com/sigscale/snmp-simulator) which supports supplying an alarm model with the Alarm Management Information Base (MIB) (RFC3877) for simulation.
Our fault management application should be open source too, however we never got around to releasing it. The solution was used in a large national mobile operator as part of an umbrella network management project.