Cowboy 2.13.0 performance bottleneck at 8.7K RPS on Erlang 28.0.2

Hi Folks,

I’m seeing unexpectedly poor performance with Cowboy 2.13.0 on Erlang 28.0.2.

Load testing shows only 8.7K RPS with 1-second average response times for simple GET requests serving static content (a small 33-byte file). This seems far below what Erlang should handle.

Setup

  • Erlang: 28.0.2
  • Cowboy: 2.13.0
  • Platform: macOS M1

Cowboy Configuration

active_n: 10
request_timeout: 5000
idle_timeout: 5000
inactivity_timeout: 10000
max_keepalive: 100
http10_keepalive: true
dynamic_buffer: true

Load Test Command hey

$ ulimit -n 65536
$ sudo sysctl net.inet.ip.portrange.first=1024
$ sudo sysctl net.inet.ip.portrange.last=65535

$ brew install hey
$ hey -n 10000 -c 10000 -z 10s http://localhost:8080/
Summary:
  Total:	10.8376 secs
  Slowest:	2.9020 secs
  Fastest:	0.0497 secs
  Average:	1.1103 secs
  Requests/sec:	8710.7259

  Total data:	3115299 bytes
  Size/request:	33 bytes

Response time histogram:
  0.050 [1]	|
  0.335 [1018]	|■
  0.620 [1787]	|■
  0.905 [3316]	|■■
  1.191 [76507]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1.476 [5195]	|■■■
  1.761 [1803]	|■
  2.046 [598]	|
  2.332 [1133]	|■
  2.617 [1465]	|■
  2.902 [1580]	|■

Latency distribution:
  10% in 0.9844 secs
  25% in 1.0072 secs
  50% in 1.0420 secs
  75% in 1.0818 secs
  90% in 1.3080 secs
  95% in 1.8275 secs
  99% in 2.8010 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0162 secs, 0.0497 secs, 2.9020 secs
  DNS-lookup:	0.0021 secs, 0.0000 secs, 0.0607 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0254 secs
  resp wait:	1.0227 secs, 0.0288 secs, 1.4419 secs
  resp read:	0.0002 secs, 0.0000 secs, 0.1046 secs

Status code distribution:
  [200]	94403 responses

Question

Is 8.7K RPS with 1-second latencies normal for Cowboy serving static content, or should I expect much higher throughput? What’s typically the bottleneck at these low numbers?

Note: I’m specifically interested in raw Cowboy/Erlang performance. I want to understand the baseline performance characteristics (not putting Cowboy behind a proxy like Nginx or other workarounds). HTTP/2 testing showed no performance improvement over HTTP/1.1, so I’m focusing on HTTP/1.1 for this issue.

Link: Survey of Cowboy Webserver Performance

Thanks

1 Like

Hey! That level of performance definitely isn’t the ceiling for Cowboy. On a simple static response you should normally see much higher throughput with latencies in the millisecond range. The numbers you’re seeing look more like the result of the way the test is being run rather than a limitation of Cowboy or Erlang itself. The main clue is that almost all the time is showing up in “response wait,” which means the requests are sitting in a queue before being handled. Running both the load generator and Cowboy on the same Mac, and especially pushing 10,000 concurrent connections with hey, tends to saturate the client and the kernel networking stack long before Cowboy itself is under stress. If you try a more moderate concurrency, switch to wrk, or run the client from a separate machine, you’ll see the latency drop and the RPS go up significantly. On a tuned Linux setup it’s normal to get tens or even hundreds of thousands of requests per second from Cowboy with tiny static responses.

1 Like

I think you need to post the source code of the benchmark (unless you did and I missed it). Without that it is impossible tell what exactly could be the problem.

1 Like

Hi @vkatsuba,

Thank you for the guidance! I was able to identify the bottleneck.

The issue was static file serving through cowboy_static. When I switched to serving a simple 100-byte string from memory (also lowering the number of connections), performance increased dramatically:

$ hey -c 1000 -n 100000 -z 10s "http://localhost:8080/100b"

Summary:
  Total:	10.0403 secs
  Slowest:	1.0799 secs
  Fastest:	0.0001 secs
  Average:	0.0165 secs
  Requests/sec:	60336.9905

  Total data:	60580400 bytes
  Size/request:	100 bytes

Response time histogram:
  0.000 [1]	|
  0.108 [599691]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.216 [1160]	|
  0.324 [1192]	|
  0.432 [1625]	|
  0.540 [1462]	|
  0.648 [234]	|
  0.756 [134]	|
  0.864 [255]	|
  0.972 [15]	|
  1.080 [35]	|


Latency distribution:
  10% in 0.0039 secs
  25% in 0.0066 secs
  50% in 0.0096 secs
  75% in 0.0141 secs
  90% in 0.0278 secs
  95% in 0.0386 secs
  99% in 0.1100 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0001 secs, 1.0799 secs
  DNS-lookup:	0.0002 secs, 0.0000 secs, 0.0713 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0591 secs
  resp wait:	0.0159 secs, 0.0000 secs, 1.0798 secs
  resp read:	0.0002 secs, 0.0000 secs, 0.1143 secs

Status code distribution:
  [200]	605804 responses

Result: 60K RPS with 16ms (0.0165 secs) average latency, compared to the previous 8.7K RPS with 1+ second latencies.

The bottleneck was file I/O operations going through Erlang’s file_server_2 process, which was serializing all file system access.

> recon:proc_count(message_queue_len, 10).
...
{<0.53.0>,98,
 [file_server_2,
  {current_function,{prim_file,get_cwd_nif,0}},
  {initial_call,{proc_lib,init_p,5}}]}
...
> recon:proc_count(message_queue_len, 10).
[{<0.53.0>,196,
  [file_server_2,
   {current_function,{prim_file,read_info_nif,2}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.42323.0>,1,
  [{current_function,{prim_file,open_nif,2}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.42309.0>,1,
  [{current_function,{erts_internal,dirty_nif_finalizer,1}},
   {initial_call,{proc_lib,init_p,5}}]},

Cowboy itself performs excellently when not constrained by disk I/O.

One follow-up question: What’s the best approach to profile file_server_2 and identify which specific file system operations or syscalls are causing the serialization bottleneck?

Could it be this call prim_file:get_cwd_nif?

Thanks again for pointing me in the right direction!

1 Like

Hi @eproxus

Nothing special - just basic Cowboy setup with the configuration from my original post. Here’s the minimal test case:

Cowboy routes:

Dispatch = [{'_', [
    {"/100b", test_100b_handler, []},
    {'_', cowboy_static, {file, "/tmp/33b.txt", [{mimetypes, {<<"text">>, <<"html">>, []}}]}
]}].

Test handler:

-module(test_100b_handler).
-export([init/2]).

init(Req, State) ->
    Req2 = cowboy_req:reply(200, #{
        <<"content-type">> => <<"text/plain">>
    }, <<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx">>, Req),
    {ok, Req2, State}.

The 60K RPS result was from hitting /100b (in-memory handler), while the 8.7K RPS bottleneck occurred when requests went through the cowboy_static catch-all route, which caused file_server_2 message queue buildup.

Context

I am building a minimalistic CDN-like system as part of a much bigger Erlang service. I have a large number of small files ranging from a few bytes to 100KB maximum that rarely change but cannot all fit in RAM. These files contain sensitive information and can’t be placed on an external CDN or shared storage.

Goal: How can I serve these files with Cowboy (HTTP GET) as fast as possible?

Given the findings about the file_server_2 bottleneck, I’m looking for proven approaches to serve large numbers of small files without hitting this serialization bottleneck.

Has anyone solved similar high-throughput file serving challenges in Erlang? I’m particularly interested in:

  • Hybrid simple caching strategies (hot files in memory, cold files on disk). Caching is hard!!!
  • Ways to bypass file_server_2 for file operations
  • Minimizing filesystem metadata calls like the get_cwd_nif, read_info_nif, etc. I observed

Any battle-tested patterns would be greatly appreciated!

1 Like

Hey @zabrane, I know you want to do this with Erlang/cowboy, and I don’t want to be “that guy” :grimacing: but if it’s hitting hard limits on static files that definitely aren’t down to OS tuning (right?) then - if it’s because you want to dynamically select the appropriate file and then spit it out, you could maybe think about putting it behind nginx, letting proxied cowboy deal with any and all auth/app stuff and just using X-Accel-Redirect to tell nginx which of the static files to serve.

It’s a fairly simple and very solid pattern and could take out a whole bunch of tuning pain given nginx’s nuts static file performance especially with sendfile on. You could also use OpenResty to write more complex routing/proxying/balancing stuff in Lua, in-process to nginx, if you need even tighter integration with what your cowboy app is doing.

Obviously if you have other reasons not to want to do this (including “not wanting to” :smiling_face:) then cool - just might help if the static file issue proves insurmountable or just to need too much effort.

1 Like

@igorclark Thank you for the nginx suggestion. However, this CDN component must integrate tightly with our larger Erlang service’s security model and authentication pipeline, making a pure Erlang solution necessary.

I will explore approaches within the Erlang ecosystem first, though I will certainly keep your nginx recommendation as a viable fallback option should performance requirements prove unattainable through pure Erlang solutions :+1:

@igorclark the OpenResty resource you shared is excellent for performance analysis: Pinpointing the hottest Erlang code paths with high CPU usage (using OpenResty XRay).

1 Like

It seems cowboy does not use the raw option when calling file:read_file_info/2, something it does to determine the size of the file.

The absence of this option is causing all such requests to go through the file server process, resulting in the bottleneck.

I suppose you could just copy the code in cowboy_static.erl, pass the option and compare the performance?

4 Likes

This large number of small files, what are they like? How well do they compress? Could application-specific compression (like XMill did for XML, or like Jioce did for Oberon) compress even better? Are there patterns across files that a smart compressor could exploit?
As an example of what I’m talking about, there’s another programming language I’m interested in where I wondered if it would be practical to include all the library files in the interpreter executable and decompress them if and when needed. I wrote a very simple token-by-token compressor.
2598324 characters of source code token-by-token compressed to
503808 characters (19.4%) which gzip -9 then reduced to
230655 characters (8.9%).
I keep meaning to try the idea on Erlang. The point is that application-specific compression can do very well (better than gzip for this corpus) and that general-purpose compression can take it further still.

Another suggestion is “what if you put the files, as binaries, in one or more DETS tables”?

1 Like

Nice, glad you tracked it down! What you’re seeing with “file_server_2” is normal, all file operations through the regular “file” API get funneled through that single process, so under load it becomes the bottleneck. If you want to dig deeper, you can trace “file_server_2” with “recon:trace_calls/2” or “erlang:trace/3” to see exactly which functions are hit, and at the OS level tools like “strace”, “dtrace”, or “perf” will show you the syscalls. The “prim_file:get_cwd_nif” you saw isn’t really the culprit; the heavy hitters are usually “prim_file:open_nif/2” or “read_file_info” that “cowboy_static” calls. In practice, the way around it is to use sendfile for static files or put a proxy/CDN in front, but for raw profiling recon plus OS tracing will give you the clear picture.

2 Likes

@ausimian Spot on! Thanks for the clever insight about the file server bottleneck - adding raw was exactly what was needed. Now hitting +13500 req/sec. Great catch on Cowboy’s missing optimization opportunity there.

4 Likes

I always find benchmarks reading the exact same file thousands of times per second a bit silly. Just cache the whole thing in memory and serve it from there. Same for the file info.

In the end you are just benchmarking the file i/o system. Which is not very interesting.

As an example, in Zotonic we have a cache with the most essential file info information (like modification time), and use that cached information instead of checking the file system on every call. We also cache css/javascript in a process, together with their gzip variant (compressing data costs quite a bit of cpu time…).

3 Likes

You should open an issue or submit a PR at GitHub - ninenines/cowboy: Small, fast, modern HTTP server for Erlang/OTP.

2 Likes

@Maria-12648430 thanks. I will.

1 Like

@Maria-12648430 PR submitted.
Thanks all for your precious help.

1 Like