Handling High Parallel Requests and Large Bodies in Cowboy HTTP Server

onex-rahul · September 3, 2024, 7:25am

Hello,

I’m currently working with the Cowboy HTTP server in an Erlang application, and I’m encountering a performance issue. My setup involves handling approximately 2,000 parallel requests, each with a body size of 0.2 MB. I have observed that cowboy_req:read_body/1 is taking more than 15 seconds to process each request.

Given the high volume of concurrent requests and the sizable body data, I’m seeking advice on optimizing the performance of my server. Specifically:

Are there best practices or configuration options in Cowboy for efficiently handling a large number of parallel requests with sizable body content?
Is there a recommended approach for improving the throughput and reducing the latency of body reading in such scenarios?
Are there any Erlang or Cowboy-specific techniques for better managing and scaling resource usage under these conditions?

Any insights or suggestions would be greatly appreciated.

Thank you!

eproxus · September 5, 2024, 10:36am

It would be interesting to know where the bottleneck is in this case. That cowboy_req:read_body/1 takes 15 seconds sounds really out of the ordinary (especially for 200 KiB). Can you determine where the delay is coming from (i.e. the network or the code)?

Handling e.g. 2000 requests/s shouldn’t be an issue on modern hardware, and reading 200 KiB completely into memory before doing something with it would only consume ~400 Mb of RAM (plus some overhead).

That being said, you probably want to look into streaming the body into whatever processing you do so that you don’t have to keep the whole body in memory all the time if possible.

The other thing you can do if the processing involves some kind of bottleneck (e.g. a singleton process somewhere or some other shared resource) is to cap the number of requests that can be processed in parallel.

But, it’s hard to give more specific tips without seeing the code.

onex-rahul · September 6, 2024, 4:23am

Hi,
Thank you for the insightful feedback. I’ve gathered some additional information and insights based on your suggestions:

I’ve reviewed the code responsible for reading the body. Here’s the relevant snippet:

read_body(Req0, Acc) ->
       case cowboy_req:read_body(Req0) of
          {ok, Data, Req} ->
              {<<Acc/binary, Data/binary>>, Req};
          {more, Data, Req} ->
             read_body(Req, <<Acc/binary, Data/binary>>)
       end.

System Information

CPU

lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      48 bits physical, 48 bits virtual
CPU(s):                             32

Memory

free -h
              total        used        free      shared  buff/cache   available
Mem:           62Gi        14Gi       623Mi        16Gi        47Gi        32Gi
Swap:         4.0Gi       610Mi       3.4Gi

dch · September 6, 2024, 11:29am

normally with cowboy, you’d increase the default acceptors which IIRC is something like 10. Can you try that and have a look?

https://ninenines.eu/docs/en/ranch/2.1/guide/listeners/

Also use netstat -ALan or whatever is appropriate for your OS, to see if its the kernel holding off on handing connections to erlang, or if its inside erlang (i.e. ranch / cowboy) taking the time.

juhlig · September 10, 2024, 11:10am

@onex-rahul are you sure that it is the cowboy_req:read_body/1 call that is taking the 15s to complete? Given that you had to dig into the code to find the part where it is called, I somehow doubt it. Also, the snippet you posted runs a loop over the call, so is every call to cowboy_req:read_body/1 in that loop taking 15s, or is it the entire loop consisting of multiple such calls taking 15s?

If indeed the read_body call is slow, then increasing the number of acceptors won’t help much. In fact, 10 should be plenty.
All that a ranch acceptor does is just accept connections from a listen socket, tell (one of the) connection supervisors to start a connection handling process (a ranch_protocol implementation), and hand control over to that process. Then it will loop around to accepting the next connection.
Interacting with that connection (like, reading from it) is no concern of the acceptors but of the connection handling process.
The only scenario where the number of acceptors could be relevant performance is when starting a connection handling process is slow, but then again, this is not the fault of cowboy or underlying ranch.

So the question remains @onex-rahul: What are you really measuring?