I wanted to “Request for Comments” on the ideas behind a prototype I put together recently for Erlang Distribution Protocol node-specific control command filtering.
The full description and code for the prototype is available here: RFC: Erlang Dist Security Filtering Prototype by potatosalad · Pull Request #1 · potatosalad/otp · GitHub
Summary
In short: it’s a Firewall for Erlang Dist.
Example:
- Node B fully trusts Node A. All control commands sent over Erlang Dist from Node A to Node B will be accepted.
- Node A partially trusts Node B, but will only accept control commands sent to a registered process
node_b_gateway
. All other control commands will be rejected.
Motivation
Consider the following scenario where 3 nodes are halted as a result of 1 line entered into an initially unconnected node:
erl -name foo@127.0.0.1
erl -name bar@127.0.0.1
erl -name baz@127.0.0.1
On bar@127.0.0.1
:
1> net_adm:ping('baz@127.0.0.1').
pong
On foo@127.0.0.1
:
1> erpc:cast('bar@127.0.0.1', fun Halt() -> erpc:multicast(nodes(), Halt), erlang:halt(1) end).
% all 3 nodes are halted
What can be done to prevent foo@127.0.0.1
from recursively halting all other nodes?
Guide-level explanation
NOTE: The API below is not finalized and will likely change in the future. The concepts related to coarse-grained filters, fine-grained filters, combination of filters, and spawn_request
handlers on top of Erlang Dist are the main focus of this proposal.
Filters (or rules) are configured per-node connection using either coarse-grained or fine-grained BIF calls that modify the dist_entry
table within dist.c
for the given connection.
Coarse-grained filters may be configured with either accept
or reject
(with accept
being the default):
net_kernel:set_filter(Node, link, reject).
net_kernel:set_filter(Node, reg_send, reject).
net_kernel:set_filter(Node, group_leader, reject).
net_kernel:set_filter(Node, monitor, reject).
net_kernel:set_filter(Node, send, reject).
net_kernel:set_filter(Node, spawn_request, reject).
net_kernel:set_filter(Node, alias_send, reject).
Fine-grained filters may be configured for reg_send
and spawn_request
command types:
net_kernel:add_filter(Node, reg_send, my_secret_process, reject).
net_kernel:add_filter(Node, reg_send, my_public_process, accept).
net_kernel:add_filter(Node, spawn_request, {my_mod, my_fun, 0}, accept).
net_kernel:add_filter(Node, spawn_request, {my_mod, my_fun, 1}, reject).
Fine-grained filters may also be removed for reg_send
and spawn_request
command types:
net_kernel:del_filter(Node, reg_send, my_secret_process).
net_kernel:del_filter(Node, reg_send, my_public_process).
net_kernel:del_filter(Node, spawn_request, {my_mod, my_fun, 0}).
net_kernel:del_filter(Node, spawn_request, {my_mod, my_fun, 1}).
Combined filtering may be tested to see whether a given command from a node will result in an accept
or reject
:
1> net_kernel:test_filter(Node, reg_send, my_secret_process).
accept
2> net_kernel:add_filter(Node, reg_send, my_secret_process, reject).
accept
3> net_kernel:test_filter(Node, reg_send, my_secret_process).
reject
In addition, there is an option to enable Erlang-based filtering for spawn_request
commands:
-module(my_spawn_request_handler).
-export([dist_spawn_init/4]).
-spec dist_spawn_init(Node, Module, Function, Arguments) -> Result when
Node :: node(),
Module :: module(),
Function :: atom(),
Arguments :: [term()],
Result :: any().
dist_spawn_init(Node, Module, Function, Arguments) ->
% Perform any extra filtering based on `Node' here...
erlang:apply(Module, Function, Arguments).
Our my_spawn_request_handler
module can be enabled per-node with:
4> net_kernel:set_handler(Node, spawn_request, my_spawn_request_handler).
undefined
Reference-level explanation
See the following for more details: RFC: Erlang Dist Security Filtering Prototype by potatosalad · Pull Request #1 · potatosalad/otp · GitHub
Drawbacks
- Why should we not do this?
- Does this actually improve security at all? Node-level control command filtering may not be fine-grained enough in practice to sufficiently increase security between nodes within an Erlang Dist cluster.
- Extra memory used as part of the Erlang Dist entry table in order to store filters may prevent larger clusters.
- Silently rejecting messages on the receiving side is a bad idea and something like the capability system proposed in SafeErlang sent over Erlang Dist would be better here.
- What’s the point? Just use something other than Erlang Dist between nodes of different service types.
Rationale and alternatives
- Why is this design the best in the space of possible designs?
- I think this is a pragmatic solution that solves a real problem today where Erlang Dist is used between fully trusted and partially trusted nodes.
- The filtering table uses the same hash table technique used by the dist entries themselves.
- Preliminary performance shows that checking the filters in the hash table has a negligible effect on the throughput and latency of Erlang Dist communication between nodes with filtering enabled.
- This actually provides a solution to prevent the problem mentioned above where all 3 nodes were halted remotely.
- What other designs have been considered and what is the rationale for not choosing them?
-
Don’t use Erlang Distribution
- This is probably the most common solution I have seen implemented.
- For nodes of different service types, a different protocol is used: gRPC, Thrift, REST, GraphQL, BERT-RPC, etc.
-
Problems: Support for process-level message routing is lost or must be manually implemented (for example:
RemotePid ! foo
). Other dist features, like monitoring, linking, etc, are no longer available out-of-the-box.
-
SafeErlang: Access control with capabilities, resources, gates, rights, etc
- See prior art below for more information about the SafeErlang papers.
- Problems: To quote Rickard Green, this approach “require(s) quite a lot of work.”
-
Pure Erlang
proto_dist
implementation- This would allow a filtering solution per-node with roughly equivalent functionality without requiring a change to Erlang/OTP itself.
-
Problems: I initially started with this approach, but it required having duplicate decoding, encoding, and state tracking both internally by
dist.c
and by theproto_dist
pure Erlang module. Keepingdist.c
and theproto_dist
implementation in-sync in the future would likely be expensive from a maintenance perspective and the resulting performance over the distribution protocol was sub-optimal.
-
Client-Side Restricted Shell
- See
shell
: Restricted Shell for more information. - This is roughly the idea that inspired the prototype, but focusing on server-side instead of client-side.
Thespawn_request
handler has similar functionality to thenon_local_allowed/3
callback used byshell:start_restricted/1
, but it is instead executed on the server-side. - Problems: It’s client-side and not server-side. Manually starting a non-restricted shell is fairly trivial where all restrictions may be bypassed.
- See
-
Filter at the network layer instead
- Why not just use existing firewall solutions to filter the traffic a the network layer?
-
Problems: No support for encryption (like TLS). The atom cache table will need to be maintained and updated for every node connection pair. Difficult to keep in-sync with upstream
dist.c
changes in the future.
-
Don’t use Erlang Distribution
- What is the impact of not doing this?
- Certain use-cases for Erlang Dist in the future will continue to be impossible due to security constraints.
- I will be sad
Prior art
-
(1997) “SafeErlang” by Gustaf Naeser
- Introduces a few new concepts: capabilities, resources, namespaces, and gates.
- The prototype effectively borrows the concept of lightweight “gates” as the initial filtering point for inbound dist control commands.
-
(1999) “Extending Erlang for Safe Mobile Code Execution” by Lawrie Brown; Dan Sahliny
- Interesting ideas related to local versus remote sourcing of code/modules and a proposed solution for capabilities from the earlier SafeErlang paper.
-
(2000) “Enhancing Security in Distributed Erlang by Integrating Access Control: Approaching a Real SafeErlang Implementation” by Rickard Green
- Excellent overview of Erlang dist features available at the time.
- Discusses forging of capabilities, remote code loading, secure (encrypted) communication, access control, and potential solutions.
- My favorite line from the paper: “The problems identified in the previous section require quite a lot of work.”
- The paper focuses on access control and capability lists, referencing related implementations found in Java and Safe-Tcl.
- SafeErlang Capability Rights:
-
link
— Permission to link to the referred process. -
monitor
— Permission to monitor the referred process. - Process-level, port-level, and node-level “rights” are defined, like
halt
— Permission to halt the referred node. - And more. These “rights” are similar to the filter types for control commands in the prototype. However, the “rights” in this paper are much more fine-grained and tied to a given process, port, or node.
-
-
(2000) “Secure Distributed Communication in SafeErlang” by Bertil Karlsson
- Describes an encryption scheme for the dist protocol and additional ideas related to capability exchange between nodes.
- While the prototype does not introduce encryption, I think that “encryption by default” is still a good goal to have here. Whether the default transport happens to be TLS or something more lightweight (like Noise), I think this is worth figuring out as part of a future effort.
Unresolved questions
- How to configure node-specific filters for a given connection before the connection is established?
- Do we want to support pid, port, and alias-level filtering? For example: right now, we can disable all sending to processes, but what if we want to allow selective sending? What use-cases exist for this kind of setup?
- Does this actually improve security? If not, is it possible to build on top of this idea?
- Does this increase the probability that Erlang Dist may be used in future services versus replacing it with another protocol between service types?
Future possibilities
- Pid, port, and alias-level filtering.
-
spawn_request
pid filtering. For example: Node A can only send messages to processes on Node B that it directly spawned withspawn_request
. - Tracing integration and counters for profiling and statistics related to accepted and rejected control commands.
- Preflight checking to find out whether a remote node is going to reject a request before it is sent.
- Integration with the restricted shell so that allowed code execution may be defined remotely.
- Possibily more integration of ideas from SafeErlang related to capabilities that might enable a form of “sandbox checking” prior to evaluation of remote commands.