Hi,
I am building BeamTrail, an Erlang/OTP library for durable business process execution. I would like design feedback from people who have operated OTP systems, PostgreSQL-backed queues, workflow engines, or event-sourced systems.
Repository: GitHub - sherry255/BeamTrail: Durable workflow runtime for Erlang/OTP backed by PostgreSQL · GitHub
BeamTrail is not a Temporal clone and not a job queue. The current goal is smaller: an embedded OTP runtime for durable step/process execution in an Erlang application that already runs PostgreSQL.
The basic model:
- Active runs are supervised OTP processes.
- Business progress is stored as an append-only PostgreSQL event stream.
- State is rebuilt by reducing events.
- Snapshots and indexed projections are optimizations only.
- Leases, fencing tokens, and expected sequence checks protect concurrent writers.
- A scanner recovers unfinished runs whose lease is missing or expired.
The split is intentional:
- OTP owns live execution, timers, process isolation, and cleanup.
- PostgreSQL owns durable history, append ordering, leases, fencing, and cross-process recovery.
The project started as a linear durable step runner. It now also has an optional deterministic decider callback for one-command-at-a-time orchestration. The decider path supports explicit step inputs, durable signals, durable timers, and {wait, Reason}. There is an executable approval-deadline pattern built from signals, timers, and waits.
Current implemented pieces:
- PostgreSQL durable adapter and memory test adapter.
- Append-only event log and reducer.
- Expected-sequence append checks.
- Per-run PostgreSQL append lock.
- Lease renewal and fencing tokens.
- Supervised per-run
gen_statemactive runner. - Scanner recovery with indexed recoverable-run projections.
- Retries, timeouts, and crash-atomic failure decisions.
- Cancel, park/resume, and manual requeue.
- Version mismatch gates for in-flight attempt replay and decider workflows.
- Durable signals and scanner-driven durable timers.
- Crash recovery and PostgreSQL stress examples.
- EUnit, PostgreSQL integration tests, xref, Dialyzer, and secret scan in CI.
Current limits:
- No DAG, fan-out/fan-in, child workflows, or parallel command batches yet.
- No first-class human task assignment UI.
- No HTTP API or browser UI.
- No SQL-native JSON payload inspection.
- No built-in external side-effect deduplication.
- No exactly-once execution of user callbacks.
The at-least-once boundary is explicit. BeamTrail provides stable idempotency keys, but workflow code must use them when calling external systems. If a VM dies after a callback performs an external side effect but before BeamTrail records the outcome, the same attempt can be re-entered with the same idempotency key.
What I would especially like reviewed:
- Is the OTP/PostgreSQL split reasonable?
- Is the lease/fencing model strong enough?
- Is the active runner/scanner recovery shape idiomatic enough?
- Is scanner-driven timer materialization acceptable?
- Is the public API shaped naturally for Erlang?
- Is the MVP scope honest enough?
Useful docs:
For the decider, timer, approval pattern, roadmap, and community review notes, please see the repository README.
I am most interested in failure-model criticism, OTP design criticism, and whether the current scope would be useful to real Erlang/Elixir teams.
Thanks.