BeamTrail - an OTP-native durable step runner using PostgreSQL event logs

Hi,

I am building BeamTrail, an Erlang/OTP library for durable business process execution. I would like design feedback from people who have operated OTP systems, PostgreSQL-backed queues, workflow engines, or event-sourced systems.

Repository: GitHub - sherry255/BeamTrail: Durable workflow runtime for Erlang/OTP backed by PostgreSQL · GitHub

BeamTrail is not a Temporal clone and not a job queue. The current goal is smaller: an embedded OTP runtime for durable step/process execution in an Erlang application that already runs PostgreSQL.

The basic model:

  • Active runs are supervised OTP processes.
  • Business progress is stored as an append-only PostgreSQL event stream.
  • State is rebuilt by reducing events.
  • Snapshots and indexed projections are optimizations only.
  • Leases, fencing tokens, and expected sequence checks protect concurrent writers.
  • A scanner recovers unfinished runs whose lease is missing or expired.

The split is intentional:

  • OTP owns live execution, timers, process isolation, and cleanup.
  • PostgreSQL owns durable history, append ordering, leases, fencing, and cross-process recovery.

The project started as a linear durable step runner. It now also has an optional deterministic decider callback for one-command-at-a-time orchestration. The decider path supports explicit step inputs, durable signals, durable timers, and {wait, Reason}. There is an executable approval-deadline pattern built from signals, timers, and waits.

Current implemented pieces:

  • PostgreSQL durable adapter and memory test adapter.
  • Append-only event log and reducer.
  • Expected-sequence append checks.
  • Per-run PostgreSQL append lock.
  • Lease renewal and fencing tokens.
  • Supervised per-run gen_statem active runner.
  • Scanner recovery with indexed recoverable-run projections.
  • Retries, timeouts, and crash-atomic failure decisions.
  • Cancel, park/resume, and manual requeue.
  • Version mismatch gates for in-flight attempt replay and decider workflows.
  • Durable signals and scanner-driven durable timers.
  • Crash recovery and PostgreSQL stress examples.
  • EUnit, PostgreSQL integration tests, xref, Dialyzer, and secret scan in CI.

Current limits:

  • No DAG, fan-out/fan-in, child workflows, or parallel command batches yet.
  • No first-class human task assignment UI.
  • No HTTP API or browser UI.
  • No SQL-native JSON payload inspection.
  • No built-in external side-effect deduplication.
  • No exactly-once execution of user callbacks.

The at-least-once boundary is explicit. BeamTrail provides stable idempotency keys, but workflow code must use them when calling external systems. If a VM dies after a callback performs an external side effect but before BeamTrail records the outcome, the same attempt can be re-entered with the same idempotency key.

What I would especially like reviewed:

  1. Is the OTP/PostgreSQL split reasonable?
  2. Is the lease/fencing model strong enough?
  3. Is the active runner/scanner recovery shape idiomatic enough?
  4. Is scanner-driven timer materialization acceptable?
  5. Is the public API shaped naturally for Erlang?
  6. Is the MVP scope honest enough?

Useful docs:

For the decider, timer, approval pattern, roadmap, and community review notes, please see the repository README.

I am most interested in failure-model criticism, OTP design criticism, and whether the current scope would be useful to real Erlang/Elixir teams.

Thanks.

2 Likes