Why does Erlang's DNS caching sort host IPs?

We use a round-robin(ish) load balancing strategy at work and I work on a service that makes HTTP requests to other services that are load-balanced. The company’s internal DNS server will return a set of several IPs for a given host lookup, and it shuffles those IPs so the list is in a random order. The expectation is clients lookup a given host and pull an IP off the top (for example), which should theoretically distribute load across each instance because the IPs come back from the DNS query in random order every time.

Recently it was discovered that although a DNS lookup of, say, example.internal-service.net returns a set of several randomized IPs addresses, our Elixir service was consistently only choosing one of the IPs. We generate a lot of traffic so the imbalance was noticed. (Although we’re using an Elixir service, I think this is really an Erlang question, so that’s why I’m posting here :slight_smile, I hope that’s ok!)

Our service does use Erlang’s DNS caching. I noticed that if caching is on, the IP addresses returned by inet:getaddrs('example.internal-service.net', inet). are always sorted from lowest to highest by octet.

If caching is off, the function returns a randomized list of IPs as we had initially assumed it would.

So I find myself with a few questions:

  • Does anyone know why Erlang is sorting the list when we have caching enabled?
  • Is there a way to turn off the sorting behavior? Reviewing the docs I don’t see a way (which didn’t surprise me).
    • I also suspect the sorting is occurring for a very good functional reason so this isn’t something you just turn off. I always assume that Erlang is smarter than me.
  • Does this seem like ideal behavior?
    • My understanding is that this kind of round-robin DNS strategy is somewhat common, even if it’s not without flaws, but this sorting of IPs breaks it. If the IPs were cached in the order they were received we might still benefit from a low-TTL cache while also (mostly) respecting the IP randomization being provided by the DNS server. It would at least allow us to explore if there’s a valid balance to be struck between the two.
    • Note: I know there are well-reasoned arguments against round-robin DNS as a load balancing strategy! Caching, in fact, is one of them. But this is not a choice I personally have control over in my specific circumstance and I’m hoping we can set that debate to the side since I’m mostly looking to satisfy my curiosity?

If anyone wants an easy example, Docker actually has a built in round-robin DNS server, which makes it easy to demonstrate locally without a lot of setup. This example uses the Elixir shell but I could do the same thing with an Erlang container and erl if that’s preferred, I’m just less fluent with the syntax :slight_smile:

Example
# Create a docker network named 'frontend'
docker network create frontend

# Start several aliased containers that will be part of the round-robin
docker run -d --net frontend --net-alias example.test nginx:alpine
docker run -d --net frontend --net-alias example.test nginx:alpine
docker run -d --net frontend --net-alias example.test nginx:alpine
docker run -d --net frontend --net-alias example.test nginx:alpine
docker run -d --net frontend --net-alias example.test nginx:alpine

# Run an elixir container on the network without caching enabled
docker run --rm -it -v $(pwd)/inetrc:/inetrc --net frontend elixir iex
iex(1)> Enum.map(0..4, fn _ -> :inet.getaddrs('example.test', :inet) end) 
[
  ok: [
    {172, 18, 0, 4},
    {172, 18, 0, 3},
    {172, 18, 0, 5},
    {172, 18, 0, 2},
    {172, 18, 0, 6}
  ],
  ok: [
    {172, 18, 0, 4},
    {172, 18, 0, 6},
    {172, 18, 0, 3},
    {172, 18, 0, 5},
    {172, 18, 0, 2}
  ],
  ok: [
    {172, 18, 0, 2},
    {172, 18, 0, 3},
    {172, 18, 0, 6},
    {172, 18, 0, 4},
    {172, 18, 0, 5}
  ],
  ok: [
    {172, 18, 0, 4},
    {172, 18, 0, 6},
    {172, 18, 0, 3},
    {172, 18, 0, 5},
    {172, 18, 0, 2}
  ],
  ok: [
    {172, 18, 0, 4},
    {172, 18, 0, 3},
    {172, 18, 0, 6},
    {172, 18, 0, 2},
    {172, 18, 0, 5}
  ]
]

# Run an elixir container on the network with caching enabled
# After the initial lookup subsequent lookups return the list sorted
docker run --rm -it -v $(pwd)/inetrc:/inetrc --net frontend elixir iex --erl "-kernel inetrc '/inetrc'" 
iex(1)> Enum.map(0..4, fn _ -> :inet.getaddrs('example.test', :inet) end) 
[
  ok: [
    {172, 18, 0, 2},
    {172, 18, 0, 3},
    {172, 18, 0, 4},
    {172, 18, 0, 6},
    {172, 18, 0, 5}
  ],
  ok: [
    {172, 18, 0, 2},
    {172, 18, 0, 3},
    {172, 18, 0, 4},
    {172, 18, 0, 5},
    {172, 18, 0, 6}
  ],
  ok: [
    {172, 18, 0, 2},
    {172, 18, 0, 3},
    {172, 18, 0, 4},
    {172, 18, 0, 5},
    {172, 18, 0, 6}
  ],
  ok: [
    {172, 18, 0, 2},
    {172, 18, 0, 3},
    {172, 18, 0, 4},
    {172, 18, 0, 5},
    {172, 18, 0, 6}
  ],
  ok: [
    {172, 18, 0, 2},
    {172, 18, 0, 3},
    {172, 18, 0, 4},
    {172, 18, 0, 5},
    {172, 18, 0, 6}
  ]
]

Where inetrc is:

{lookup, [dns, native]}.
{cache_size, 100}.
{cache_refresh, 60}.

Phew! I know that was a long one, but I’m really curious to learn more if anyone has thoughts!

2 Likes

Firstly, thanks for providing a simple way to replicate the issue.

Now on to the issue. This looks like an artifact of managing the cache per creation time and access time. I looked at some code to get a basic idea of what’s going on in inet_db (basic is the keyword here).

Next, I took your setup instructions and did the following :

iex(1)> :inet_res.gethostbyname('example.test')
{:ok,
 {:hostent, ~c"example.test", [], :inet, 4,
  [
    {172, 18, 0, 6},
    {172, 18, 0, 3},
    {172, 18, 0, 4},
    {172, 18, 0, 5},
    {172, 18, 0, 2}
  ]}}
iex(2)> :ets.tab2list(:inet_cache)
[
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 6},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 3},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 4},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 5},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 2},
   -576460721, ~c"example.test", false}
]

:point_up: Good so far, and note the 4th element (counting from zero).

:point_down: second lookup and when things go “bad”.

iex(3)> :inet_res.gethostbyname('example.test')
{:ok,
 {:hostent, ~c"example.test", [], :inet, 4,
  [
    {172, 18, 0, 2},
    {172, 18, 0, 3},
    {172, 18, 0, 4},
    {172, 18, 0, 5},
    {172, 18, 0, 6}
  ]}}
iex(4)> :ets.tab2list(:inet_cache)
[
  {:dns_rr, ~c"example.test", :a, :in, -576460704, 600, {172, 18, 0, 2},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460704, 600, {172, 18, 0, 3},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460704, 600, {172, 18, 0, 4},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460704, 600, {172, 18, 0, 5},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460704, 600, {172, 18, 0, 6},
   -576460721, ~c"example.test", false}
]

Now we’re where you don’t want to be :slight_smile:

I haven’t dug any further into that to pin point exactly where access time is getting updated (could be a few places based on a glance), but I feel like this is going to be a solvable problem.

Hope this helps.

Cheers.

Edit:

FWIW it’s worth looking closely here

Edit ( :see_no_evil: ) :

I linked you to the wrong place, go here instead

Also FWIW, if you wait for the cache time to expire and then lookup again, you’ll see your randomized records again :wink:

1 Like

I walked away and said “Ahhh”. The problem here is the use of maps to accumulate for re-insertion, etc.

A list may work better here, but would have to understand why a map was chosen. Could still stick with a map for “small cases” (i.e., a host that doesn’t have > 32 ip addresses). This could be done by inserting a counter into the tuple used a key here . Order per the original insertion would be preserved in the map and call to map:values/1.

A list is probably a better way to go here though just in case :wink: . I’d be happy to do a PR, but would like to understand why maps were chosen.

To illustrate what I mean, and I’ll do it in elixir since I had that open already :

iex> recs = [
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 6},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 3},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 4},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 5},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 2},
   -576460721, ~c"example.test", false}
]

iex> Enum.reduce(recs, %{}, fn({_, domain, class, type, _, _, data, _, _, _} = rr, acc) -> Map.put(acc, {domain, class, type, data}, rr) end)
%{
  {~c"example.test", :a, :in, {172, 18, 0, 2}} => {:dns_rr, ~c"example.test",
   :a, :in, -576460721, 600, {172, 18, 0, 2}, -576460721, ~c"example.test",
   false},
  {~c"example.test", :a, :in, {172, 18, 0, 3}} => {:dns_rr, ~c"example.test",
   :a, :in, -576460721, 600, {172, 18, 0, 3}, -576460721, ~c"example.test",
   false},
  {~c"example.test", :a, :in, {172, 18, 0, 4}} => {:dns_rr, ~c"example.test",
   :a, :in, -576460721, 600, {172, 18, 0, 4}, -576460721, ~c"example.test",
   false},
  {~c"example.test", :a, :in, {172, 18, 0, 5}} => {:dns_rr, ~c"example.test",
   :a, :in, -576460721, 600, {172, 18, 0, 5}, -576460721, ~c"example.test",
   false},
  {~c"example.test", :a, :in, {172, 18, 0, 6}} => {:dns_rr, ~c"example.test",
   :a, :in, -576460721, 600, {172, 18, 0, 6}, -576460721, ~c"example.test",
   false}
}

:point_up: we lost our original order :frowning:

iex> {recs1, _} = Enum.reduce(recs, {%{}, 0}, fn({_, domain, class, type, _, _, data, _, _, _} = rr, {map_acc, n}) -> 
%{
   {0, ~c"example.test", :a, :in, {172, 18, 0, 6}} => {:dns_rr,
    ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 6}, -576460721,
    ~c"example.test", false},
   {1, ~c"example.test", :a, :in, {172, 18, 0, 3}} => {:dns_rr,
    ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 3}, -576460721,
    ~c"example.test", false},
   {2, ~c"example.test", :a, :in, {172, 18, 0, 4}} => {:dns_rr,
    ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 4}, -576460721,
    ~c"example.test", false},
   {3, ~c"example.test", :a, :in, {172, 18, 0, 5}} => {:dns_rr,
    ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 5}, -576460721,
    ~c"example.test", false},
   {4, ~c"example.test", :a, :in, {172, 18, 0, 2}} => {:dns_rr,
    ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 2}, -576460721,
    ~c"example.test", false}
 }, 5}

iex> Map.values(recs1)
[
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 6},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 3},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 4},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 5},
   -576460721, ~c"example.test", false},
  {:dns_rr, ~c"example.test", :a, :in, -576460721, 600, {172, 18, 0, 2},
   -576460721, ~c"example.test", false}
]

:point_up: order preserved :smile:

I guess we could still use maps and just do a sort at the end to ensure order is preserved (requires the ephemeral counter still).

Edit (again) :

Right, maps were used to de-duplicate, using the ephemeral counter here would break that, could de-duplicate towards the end though.

2 Likes

Wow! Ok, I just learned so much. First off, thank you for this analysis and explanation. I spent a lot of time going through inet_db on my own to see if I could find the source of the issue, but I couldn’t for the life of me figure out how things ended up sorted.

So am I right that what you’re alluding to is this aspect of maps in Erlang? I think what I’m understanding from that post and your reply is: the fact that these values end up sorted was likely unintentional, and it is just a side effect of using maps, which officially have no order but internally in some cases (<= size 32?) will actually be ordered by key, hence the surprise.

Do you think this qualifies as a bug? I was thinking I could report it with the example and what I learned from you, it amounts to a pretty solid bug report.

I also considered trying to fix the issue, but to be honest it would be my first time writing more than a line or two of Erlang and I’m not sure how it would go :joy:.

1 Like

It’s always the subtle problems that are the trickiest :smile: and you’re welcome! And thank you for raising this, otherwise it might have sat latent so to speak for a while :heart:

That is correct and great call out on unofficial part :slight_smile:. The docs iirc make no guarantee around order in maps.

I definitely thinks this qualifies as a bug and was introduced via Rewrite to use bag table · erlang/otp@b2e6a28 · GitHub and I think a bug report would be very much appreciated.

I think if you write Elixir you already know Erlang :slight_smile: but still perhaps the OTP team will know precisely what to do to keep performance (why the change was introduced in the first place and for great good!) and solve the issue and in an expedited fashion. I mean, I can see where this can be solved a few different ways (different structures, post-processing, etc.), and will probably come back and look again tomorrow just for fun :slight_smile: Happy to update this post with my findings here (if I do).

Cheers.

1 Like

Awesome I will file the report! What a lovely first forum experience!

Happy to update this post with my findings here

Yes if you explore more please do! I’m enjoying learning.

Also in that PR I saw @sneako talking about overriding the inet_tcp module with a shim that randomized the results of getaddrs…

We use a custom :inet_tcp module by passing the module name to gen_tcp:connect which randomizes the order of the list of IP addresses returned by inet_tcp:getaddrs/1,2 so that we use them all in our HTTP connection pools.

This may be a great way for us to achieve the behavior we desire right away without having to turn off caching. Worth a little science experiment if nothing else :slight_smile:

Thanks again!

Edit — issue logged here: DNS caching causes host records to be returned in sorted order · Issue #7577 · erlang/otp · GitHub

1 Like

I think that work around will of course work assuming your in control of the code that is effected by the unexpected order of entries or can pass options in such that it uses the provided tcp_module. If it were me and not more work than what you suggested, I’d do it as close to code that needs this guarantee (i.e., a shuffle on the return of the call vs a shuffle in the call).

The interesting thing about this issue is that for your case, the shuffle at some point in your code is sufficient, assuming the shuffle that docker does is not based on load, etc. However, there are services out there (route53 as an example) that are more intelligent in what they return (i.e., the entries returned are based on custom defined policies, load, etc), thus the problem should be fixed :slight_smile:

1 Like

For HTTP requests we’re using Finch (provides pooling) which uses Mint (HTTP client) which then finally uses gen_tcp. I did some tracing of option arguments last night and it does look like I can get the option all the way down the layers of abstraction, though I won’t know for sure until I try.

I agree about shuffling on the return of the call, I’d definitely prefer a shuffle in our own code instead of trying to supply this sort of “override” of behavior inside the tcp module. But, alas, that call to getaddrs/2 is abstracted away from us so overriding the tcp module looks like our best shot. On the plus side, it would let us avoid running a fork of Mint or something, so I’m hoping it works out.

However, there are services out there (route53 as an example) that are more intelligent in what they return (i.e., the entries returned are based on custom defined policies, load, etc), thus the problem should be fixed

I did not know this! That’s very interesting, definitely another point in favor of addressing the issue, I’m glad you brought that up. I guess I’m lucky our system is doing something as simple as just randomizing the return values.

1 Like

Based on what you said, using Finch and Mint, I agree that passing down the option for an alternative tcp module implementation is the way to go :+1:

1 Like

A fix that retains the order of the cached RR records has been merged in PR-7578 to be released in OTP-26.1.

5 Likes