How do I replace ct_slave:start with ?CT_PEER() or the peer module?

fkrause98 · May 30, 2022, 9:01pm

Hi!
I’m updating some tests that were originally written for OTP 21 to OTP 25 that I run like this:
rebar3 ct --name test@127.0.0.1.

The idea behind the tests is to have 3 nodes running, and I start them like this:

init_per_suite(Config) ->
  Node1 = 'node1@127.0.0.1',
  Node2 = 'node2@127.0.0.1',
  Node3 = 'node3@127.0.0.1',
  start_node(Node1, 8198, 8199),
  start_node(Node2, 8298, 8299),
  start_node(Node3, 8398, 8399),

  build_cluster(Node1, Node2, Node3),

  [{node1, Node1},
   {node2, Node2},
   {node3, Node3} | Config].

And the start_node implementation is as follows:
start_node(NodeName, WebPort, HandoffPort) →

 %% need to set the code path so the same modules are available in the slave
  CodePath = code:get_path(),
  PathFlag = "-pa " ++ lists:concat(lists:join(" ", CodePath)),
  {ok, _} = ct_slave:start(NodeName, [{erl_flags, PathFlag}]),

  %% set the required environment for riak core
  DataDir = "./data/" ++ atom_to_list(NodeName),
  rpc:call(NodeName, application, load, [riak_core]),
  rpc:call(NodeName, application, set_env, [riak_core, ring_state_dir, DataDir]),
  rpc:call(NodeName, application, set_env, [riak_core, platform_data_dir, DataDir]),
  rpc:call(NodeName, application, set_env, [riak_core, web_port, WebPort]),
  rpc:call(NodeName, application, set_env, [riak_core, handoff_port, HandoffPort]),
  rpc:call(NodeName, application, set_env, [riak_core, schema_dirs, ["../../lib/rc_example/priv"]]),

  %% start the rc_example app
  {ok, _} = rpc:call(NodeName, application, ensure_all_started, [rc_example]),

  ok.

Everything works fine like this, but I get the following warning:

test/key_value_SUITE.erl:86:13: Warning: ct_slave:start/2 is deprecated and will be removed in OTP 27; use ?CT_PEER(), or the 'peer' module instead
test/key_value_SUITE.erl:103:3: Warning: ct_slave:stop/1 is deprecated and will be removed in OTP 27; use ?CT_PEER(), or the 'peer' module instead

I have tried to use and alternative implementation with both CT_PEER and peer:start, like this:

 `start_node(NodeName, Host, WebPort, HandoffPort) ->
  %% need to set the code path so the same modules are available in the slave
  CodePath = code:get_path(),
  PathFlag = "-pa " ++ lists:concat(lists:join(" ", CodePath)),

  {ok, _Peer, Node} = ?CT_PEER(["-name " ++ NodeName ++ "@"  ++ Host, PathFlag]),

  %% set the required environment for riak core
  DataDir = "./data/" ++ NodeName,

  %% Check the node is running
  ok = rpc:call(Node, application, load, [riak_core]),
  ok = rpc:call(Node, application, set_env, [riak_core, ring_state_dir, DataDir]),
  ok = rpc:call(Node, application, set_env, [riak_core, platform_data_dir, DataDir]),
  ok = rpc:call(Node, application, set_env, [riak_core, web_port, WebPort]),
  ok = rpc:call(Node, application, set_env, [riak_core, handoff_port, HandoffPort]),
  ok = rpc:call(Node, application, set_env, [riak_core, schema_dirs, ["../../lib/rc_example/priv"]]),

  %% start the rc_example app
  {ok, _} = rpc:call(Node, application, ensure_all_started, [rc_example]),

  Node.
  %%

And change init to:

init_per_suite(Config) ->
  %% Node1 = 'node1@127.0.0.1',
  %% Node2 = 'node2@127.0.0.1',
  %% Node3 = 'node3@127.0.0.1',
  Host = "127.0.0.1",
  Node1 = start_node("node1", Host, 8198, 8199),
  Node2 = start_node("node2", Host, 8298, 8299),
  Node3 = start_node("node3", Host, 8398, 8399),

  build_cluster(Node1, Node2, Node3),

  [{node1, Node1},
   {node2, Node2},
   {node3, Node3} | Config].

But then, the test fails with a timeout, what could I be doing wrong?

Edit:
Here’s the code, you should be able to reproduce the error by:

Cloning.
git checkout non_working_tests
make test

starbelly · May 31, 2022, 12:32am

I don’t think you want to use ?CT_PEER() as it looks like this macro will define a node name for you and go through the ct test_server. What’s more both ?CT_PEER() and and peer:start* take a map for options. See Erlang -- peer for details.

If you are still interested in using ?CT_PEER(), I would take a look at the test suites in OTP and the macro definition and some OTP test suites for examples (i.e., grep -rl '\?CT_PEER' from otp source dir).

fkrause98 · May 31, 2022, 12:48am

Thanks for the answer, I’ll look into it, although I have fixed it by using test_server:start_node
instead and the tests run fine without the warnings!

starbelly · May 31, 2022, 1:04am

Great!

One correction, ?CT_PEER() will take a list, but it puts it in args in a map. Seems like your example above should work, but haven’t looked deeper than that. So quite interested in your finding. Also, there are examples of using CT_PEER in the docs here.

garazdawi · May 31, 2022, 7:26am

If you want to specify the node names you need to use the map syntax for ?CT_PEER and not the list. For example:

{ok, _Peer, Node} = ?CT_PEER(#{ name => NodeName, host => Host, args -> ["-pa"|code:get_path()]}),

The different options you can pass are exactly the same as to peer:start_link/1.

However, I would let peer just take take of the naming and call ?CT_PEER like this:

{ok, _Peer, Node} = ?CT_PEER(["-pa"|code:get_path()]),

Please do not use the test_server API as it is an internal undocumented API. I was surprised to not find any documentation for ?CT_PEER, do you know if there is any @max-au ?

fkrause98 · May 31, 2022, 2:19pm

Ok, I’m trying to use ?CT_PEER like in your example:
{ok, _Peer, Node} = ?CT_PEER(["-pa"|code:get_path()])

But when the tests actually run, for example, like in this snippet:
{pong, _Partition1} = rc_command(Node1, ping)

where rc_command is defined as:

rc_command(Node, Command) ->
  rc_command(Node, Command, []).
rc_command(Node, Command, Arguments) ->
  rpc:call(Node, rc_example, Command, Arguments).

I get this error:
{badrpc,nodedown}

garazdawi · May 31, 2022, 4:48pm

The peer node is linked to the process that started it, so when the init_per_suite process terminates, so will the node. For the node to not terminate you can either:

unlink(Peer)
Start the node in the test case (or init_per_testcase as that is also guaranteed to run in the same process).

max-au · May 31, 2022, 5:03pm

There is some amount of Erlang/OTP documentation here: Erlang -- peer

It lists all variants:

?CT_PEER - to start a basic peer
?CT_PEER(["-emu_flavor", "smp"]) to start a peer with some command line arguments
?CT_PEER(#{name => ?CT_PEER_NAME(ActualTestCase)}), to start a peer with specific node name

It also has the example that is requested in the staring message, that is, start several nodes:

 The next example demonstrates how to start multiple nodes concurrently:

      multi_node(Config) when is_list(Config) ->
          Peers = [?CT_PEER(#{wait_boot => {self(), tag}})
              || _ <- lists:seq(1, 4)],
          %% wait for all nodes to complete boot process, get their names:
          _Nodes = [receive {tag, {started, Node, Peer}} -> Node end
              || {ok, Peer} <- Peers],
          [peer:stop(Peer) || {ok, Peer} <- Peers].

This way nodes start concurrently, reducing test time.

The very same documentation contains examples of starting nodes inside docker containers. Hope that covers most use-cases.
@garazdawi is there a way to document macro using OTP docs? The only way I found was through the examples (that are embedded into peer documentation).

If you’re starting your nodes in init_per_suite, it runs in a separate disposable process, and therefore your node shuts down before the test case starts. I have a few words about it there in “Stopping extra nodes” section.

If it was not the case, then it might be a a problem with OTP 25 with new implications of the incompatible global behaviour, see here: Otp 25.0 - Erlang/OTP for OTP-17911:

As of OTP 25, global will by default prevent overlapping partitions due to network issues by actively disconnecting from nodes that reports that they have lost connections to other nodes. This will cause fully connected partitions to form instead of leaving the network in a state with overlapping partitions.

Essentially if you have 4 nodes (test runner and 3 extra), and then one node goes down, all other nodes decide to disconnect. If you aren’t using global in your tests, you might want to use ?CT_PEER(["-connect_all", "false"]) that won’t start global. See the extended discussion here: Prevent global inconsistency by preventing overlapping partitions by rickard-green · Pull Request #5611 · erlang/otp · GitHub

You may also want to take a look at the blogpost I put up for peer. It discusses a few more features that might be used to debug. For example, using standard_io alternative connection to debug the peer node startup (this way crash dump message is going to be printed to the origin node console).

garazdawi · May 31, 2022, 5:12pm

Since the macro is defined in common_test, I was assuming the docs were also in CT. The new docsearch found the right place though so I should have used that

I think that the way that assert.hrl is documented is probably the best way.

fkrause98 · May 31, 2022, 5:28pm

Thanks, the first option worked!!
I’ve got it working like this:

{ok, Peer, Node} = ?CT_PEER(["-pa"|code:get_path()]),
unlink(Peer),

I have also tried the second option but I’m faced with the same error, weird, maybe I’m doing something wrong . But, nevermind, and again, thanks

fkrause98 · May 31, 2022, 5:29pm

Thanks for the extra info and links, I’ll check them out, specially the
peer: distributed application testing one.

max-au · May 31, 2022, 9:18pm

Nice, I didn’t know of that option. That’s definitely PR-worthy, ct.hrl has macros that can benefit from being properly documented.

This error means that the node shut down before you did RPC call to it.

One important thing, if you unlink the controlling process, your extra peer node keeps running (forever) and may introduce unexpected effects for the subsequent tests (hence by default the behaviour is to shut it down when linked process shuts down).

gonzalobf · June 1, 2022, 11:12am

I’m not sure if it is something related, but I found that rebar3 ct only works when #{connection => standard_io} is given to ?CT_PEER.

In a new rebar3 project, the following test fails for me:

-module(my_SUITE).
-behaviour(ct_suite).
-include_lib("common_test/include/ct.hrl").
-export([all/0]).
-export([basic/1]).


all() ->
    [basic].

basic(Config) when is_list(Config) ->
    {ok, Peer, _Node} = ?CT_PEER(),
    peer:stop(Peer).

$ rebar3 ct                                                                         
===> Verifying dependencies...
===> Analyzing applications...
===> Compiling erlangct
===> Running Common Test suites...
%%% my_SUITE:
%%% my_SUITE ==> basic: FAILED
%%% my_SUITE ==> {not_alive,[{peer,verify_args,1,[{file,"peer.erl"},{line,529}]},
            {peer,start_it,2,[{file,"peer.erl"},{line,582}]},
            {my_SUITE,basic,1,
                      [{file,"/tmp/example/apps/example/test/my_SUITE.erl"},
                       {line,12}]},
            {test_server,ts_tc,3,[{file,"test_server.erl"},{line,1782}]},
            {test_server,run_test_case_eval1,6,
                         [{file,"test_server.erl"},{line,1291}]},
            {test_server,run_test_case_eval,9,
                         [{file,"test_server.erl"},{line,1223}]}]}

EXPERIMENTAL: Writing retry specification at /tmp/example/_build/test/logs/retry.spec
              call rebar3 ct with '--retry' to re-run failing cases.
Failed 1 tests. Passed 0 tests.
Results written to "/tmp/example/_build/test/logs/index.html".
===> Failures occurred running tests: 1

But it works when running with ct_run

$ ct_run -suite my_SUITE.erl                                     
Erlang/OTP 25 [erts-13.0] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit:ns]

Common Test v1.23 starting (cwd is /tmp/example/apps/example/test)

Eshell V13.0  (abort with ^G)
(ct@localhost)1>
Common Test: Running make in test directories...
Recompile: my_SUITE

CWD set to: "/tmp/example/apps/example/test/ct_run.ct@localhost.2022-06-01_12.09.04"

TEST INFO: 1 test(s), 1 case(s) in 1 suite(s)

Testing apps.erlangct.my_SUITE: Starting test, 1 test cases
Testing apps.erlangct.my_SUITE: TEST COMPLETE, 1 ok, 0 failed of 1 test cases

garazdawi · June 1, 2022, 11:19am

The rebar3 test node does not seem to be alive. According to rebar3 docs you need to pass --sname nodename or --name nodename@host to rebar3 ct to make it alive.

gonzalobf · June 1, 2022, 11:34am

Thank you, I got the same behaviour when I tried to run ct:run_test([{suite, my_SUITE}]).. Adding --sname to rebar3 ct fixed the problem

max-au · June 1, 2022, 2:57pm

To explain this behaviour: by default, ?CT_PEER starts a node that is connected via Erlang Distribution. Which is only possible when the origin is alive, that is, distributed.

Using standard_io (or tcp) does not require peer or origin to be distributed. It allows to test the distribution itself, or run tests that simulate net splits.

mmin · August 28, 2022, 7:55pm

May this be added to documentation, probably in Examples section? IMHO it is not that trivial for one to deduce what not_alive means. When I got that error my first thought was that there is some error with EMPD. This could make transfer from slave to peer more clean.

garazdawi · August 29, 2022, 5:51am

A PR with any improvement would be most welcome! The peer module is documented here.