We use a round-robin(ish) load balancing strategy at work and I work on a service that makes HTTP requests to other services that are load-balanced. The company’s internal DNS server will return a set of several IPs for a given host lookup, and it shuffles those IPs so the list is in a random order. The expectation is clients lookup a given host and pull an IP off the top (for example), which should theoretically distribute load across each instance because the IPs come back from the DNS query in random order every time.
Recently it was discovered that although a DNS lookup of, say, example.internal-service.net
returns a set of several randomized IPs addresses, our Elixir service was consistently only choosing one of the IPs. We generate a lot of traffic so the imbalance was noticed. (Although we’re using an Elixir service, I think this is really an Erlang question, so that’s why I’m posting here :slight_smile, I hope that’s ok!)
Our service does use Erlang’s DNS caching. I noticed that if caching is on, the IP addresses returned by inet:getaddrs('example.internal-service.net', inet).
are always sorted from lowest to highest by octet.
If caching is off, the function returns a randomized list of IPs as we had initially assumed it would.
So I find myself with a few questions:
- Does anyone know why Erlang is sorting the list when we have caching enabled?
- Is there a way to turn off the sorting behavior? Reviewing the docs I don’t see a way (which didn’t surprise me).
- I also suspect the sorting is occurring for a very good functional reason so this isn’t something you just turn off. I always assume that Erlang is smarter than me.
- Does this seem like ideal behavior?
- My understanding is that this kind of round-robin DNS strategy is somewhat common, even if it’s not without flaws, but this sorting of IPs breaks it. If the IPs were cached in the order they were received we might still benefit from a low-TTL cache while also (mostly) respecting the IP randomization being provided by the DNS server. It would at least allow us to explore if there’s a valid balance to be struck between the two.
- Note: I know there are well-reasoned arguments against round-robin DNS as a load balancing strategy! Caching, in fact, is one of them. But this is not a choice I personally have control over in my specific circumstance and I’m hoping we can set that debate to the side since I’m mostly looking to satisfy my curiosity?
If anyone wants an easy example, Docker actually has a built in round-robin DNS server, which makes it easy to demonstrate locally without a lot of setup. This example uses the Elixir shell but I could do the same thing with an Erlang container and erl
if that’s preferred, I’m just less fluent with the syntax
Example
# Create a docker network named 'frontend'
docker network create frontend
# Start several aliased containers that will be part of the round-robin
docker run -d --net frontend --net-alias example.test nginx:alpine
docker run -d --net frontend --net-alias example.test nginx:alpine
docker run -d --net frontend --net-alias example.test nginx:alpine
docker run -d --net frontend --net-alias example.test nginx:alpine
docker run -d --net frontend --net-alias example.test nginx:alpine
# Run an elixir container on the network without caching enabled
docker run --rm -it -v $(pwd)/inetrc:/inetrc --net frontend elixir iex
iex(1)> Enum.map(0..4, fn _ -> :inet.getaddrs('example.test', :inet) end)
[
ok: [
{172, 18, 0, 4},
{172, 18, 0, 3},
{172, 18, 0, 5},
{172, 18, 0, 2},
{172, 18, 0, 6}
],
ok: [
{172, 18, 0, 4},
{172, 18, 0, 6},
{172, 18, 0, 3},
{172, 18, 0, 5},
{172, 18, 0, 2}
],
ok: [
{172, 18, 0, 2},
{172, 18, 0, 3},
{172, 18, 0, 6},
{172, 18, 0, 4},
{172, 18, 0, 5}
],
ok: [
{172, 18, 0, 4},
{172, 18, 0, 6},
{172, 18, 0, 3},
{172, 18, 0, 5},
{172, 18, 0, 2}
],
ok: [
{172, 18, 0, 4},
{172, 18, 0, 3},
{172, 18, 0, 6},
{172, 18, 0, 2},
{172, 18, 0, 5}
]
]
# Run an elixir container on the network with caching enabled
# After the initial lookup subsequent lookups return the list sorted
docker run --rm -it -v $(pwd)/inetrc:/inetrc --net frontend elixir iex --erl "-kernel inetrc '/inetrc'"
iex(1)> Enum.map(0..4, fn _ -> :inet.getaddrs('example.test', :inet) end)
[
ok: [
{172, 18, 0, 2},
{172, 18, 0, 3},
{172, 18, 0, 4},
{172, 18, 0, 6},
{172, 18, 0, 5}
],
ok: [
{172, 18, 0, 2},
{172, 18, 0, 3},
{172, 18, 0, 4},
{172, 18, 0, 5},
{172, 18, 0, 6}
],
ok: [
{172, 18, 0, 2},
{172, 18, 0, 3},
{172, 18, 0, 4},
{172, 18, 0, 5},
{172, 18, 0, 6}
],
ok: [
{172, 18, 0, 2},
{172, 18, 0, 3},
{172, 18, 0, 4},
{172, 18, 0, 5},
{172, 18, 0, 6}
],
ok: [
{172, 18, 0, 2},
{172, 18, 0, 3},
{172, 18, 0, 4},
{172, 18, 0, 5},
{172, 18, 0, 6}
]
]
Where inetrc is:
{lookup, [dns, native]}.
{cache_size, 100}.
{cache_refresh, 60}.
Phew! I know that was a long one, but I’m really curious to learn more if anyone has thoughts!