Performance problems with CRL verification

rlipscombe · June 12, 2024, 11:06am

We recently ran into a problem with the CPU and memory use of one of our services. This service makes HTTPS calls (using hackney, but I don’t think that’s relevant) to a small number of backend services hosted in AWS and with Amazon-issued TLS certificates.

After some preliminary investigation, we discovered that it was spending all of its time doing CRL checks (enabled as per the recommendations at Erlang standard library: ssl | EEF Security WG), so we turned those off temporarily, which greatly improved performance.

Further investigation revealed two things:

The relevant CRLs ballooned in size recently. The “Amazon RSA 2048 M02” and “Amazon RSA 2048 M03” CRLs had approximately 100K revoked serial numbers added to each of them at the start of June 2024, growing from 11K certs to about 111K certs each and the file sizes grew to about 4MiB each.
The ssl_crl_cache doesn’t, well, cache lookups. This results in a fresh download (4MiB!) of the CRL every time a TLS connection is established. See otp/lib/ssl/src/ssl_crl_cache.erl at OTP-25.3.2.7 · erlang/otp · GitHub where the insert is commented out. This is unchanged as of OTP-27.0.

Questions:

Are there any plans to resolve the lack of caching in ssl_crl_cache?
Are there any third-party libraries that address this?
- On this note, the ssl_crl_cache_api lacks documentation and seems to be extremely specific to ssl_crl_cache and ssl_crl_hash_dir. What’s the difference between lookup/3 and select/2, for example? When will fresh_crl/2 actually be called, and what should it do?
Are there any other recommendations to mitigate the lack of caching?
Has anyone looked at the performance of public_key:pkix_crls_validate/3 when given a CRL this large?

afa · June 12, 2024, 11:19am

Just seen the same today, also on AWS, for an older VerneMQ release. (I think for OTP 24, and we’re also using Hackney)

github.com/vernemq/vernemq

[Bug]: Downloading too many/big CRLs leads to broker crash

opened 08:15AM - 12 Jun 24 UTC

pellepelster

bug

### Environment - VerneMQ Version: 1.13.0 - OS: official VerneMQ docker imag…e - Cluster size/standalone: 2 ### Current Behavior We are using the broker with multiple webhooks for authorization purposes: ``` - name: DOCKER_VERNEMQ_vmq_webhooks.webhook1.endpoint value: "https://<some_aws_lb>/auth-on-register" - name: DOCKER_VERNEMQ_vmq_webhooks.webhook1.hook value: auth_on_register - name: DOCKER_VERNEMQ_vmq_webhooks.webhook2.endpoint value: "https://<some_aws_lb>/auth-on-publish" - name: DOCKER_VERNEMQ_vmq_webhooks.webhook2.hook value: auth_on_publish - name: DOCKER_VERNEMQ_vmq_webhooks.webhook3.endpoint value: "https://<some_aws_lb>/auth-on-subscribe" - name: DOCKER_VERNEMQ_vmq_webhooks.webhook3.hook ``` a few days back we noticed that after restarts due to re-deployments, freshly started broker nodes would randomly die after a few seconds. cutting off external traffic allowed the broker to start normally, but it would intermediately die if traffic was ramped up again. A tcpdump revealed, that the broker seems to to parallel download the CRLs from the certificates terminating the TLS connection for the webhooks, where each CRL has roughly 4mb. ![image](https://github.com/vernemq/vernemq/assets/624069/cf97101c-d461-48e8-ab31-00effe1b990c) we traced ~70 MQTT connects, that caused ~200mb of downloaded CRLs before the broker dies with ``` [os_mon] memory supervisor port (memsup): Erlang has closed [os_mon] cpu supervisor port (cpu_sup): Erlang has closed ``` setting `vmq_webhooks.use_crls` to `off` fixed the issue, but we feel, that this should be handled more gracefully ### Expected behaviour a graceful warning if CRL is to big and/or resource friendly handling of many CRL downloads ### Configuration, logs, error output, etc. ```markdown - ``` ### Code of Conduct - [X] I agree to follow the VerneMQ's Code of Conduct

I guess compiling with newer OTP releases does not help then.

kuba · June 13, 2024, 12:10pm

I don’t think this is planned. ssl_crl_cache is documented as “simple default implementation of a CRL cache”. Maybe @ingela can provide more detailed information.

Have you checked ssl_crl_cache_api — ssl v11.2 ?
I guess your questions are somehow addressed there … what is missing?

rlipscombe · June 13, 2024, 12:34pm

Many things. For example: what should the various functions return? Why is Issuer passed as an argument to lookup/3? What should I do with it? What, exactly, does public_key:pkix_crls_validate/3 do with the update_crl option? What should that return? “unless the value is a list of so called general names” – huh? And so on.

I’ve spent a day reading the source code to ssl_handshake and both ssl_crl_cache and ssl_crl_hash_dir – and various related RFCs – and it’s no clearer to me how one would write a conforming implementation of ssl_crl_cache_api.

ingela · June 13, 2024, 2:44pm

I guess that we just forgot about improving the default cache and so far nobody complained, maybe because they provide their own or do not care about CRLs !? We can look into enhancing the documentation. PKIX standard is not a trivial thing We can also make a ticket for fixing the default cache but I do not know when it might be prioritized so if your in a hurry a PR is welcome.

lukebakken · June 13, 2024, 10:17pm

I looked into CRL caching for a RabbitMQ user quite a while ago, that turned into this PR because (if I remember correctly) the fetched CRLs weren’t actually being cached:

Anyway, maybe that code could be revisited.

rlipscombe · June 14, 2024, 7:27am

It also occurs to me that the cache lookup API could be improved. Currently, it returns a list of CRLs (each of which is a list of serial numbers) for a given distribution point and issuer. What I think ssl_handshake actually needs is a simple yes/no: given this {Issuer, SerialNumber} pair, has it been revoked?

That could lead to better performance for large (cached) CRLs, because implementations could replace the linear scan through the lists of serial numbers, by using a set (or even a bloom filter, if we wanted to get exotic).

ingela · June 14, 2024, 3:17pm

What version of the documentation where you reading. It occurred to me that it might not been the OTP-27 one, so if you have opinions of things needing enhancing please base your comments on the latest version.

public_key:pkix_crls_validate/3 is the function giving you the yes/no answer (well it in practice it gives you a yes/no/undetermined answer) All according to the RFC 5280. CRL mechanism is know to have problems with growing into very large data. There are something called Delta-CRL’s to mitigate the problem. I do not know how common it is used.

Also you do not have to create a callback, unless you want to use your own storage, you may also managed the cache yourself by calling functions insert and delete in ssl_crl_cache
(documented API). I have a vague memory that the commented out line was as we did not want to automatically allocate potentially lots of memory, and it was postponed and forgotten.

An alternative to CRLs is OCSP, which we currently add support for the flavor of that called OSCP-stapling for the client side of things.

ingela · June 14, 2024, 3:29pm

I guess what is needed to be able to uncomment the insert is some kind of option to set a max size for the cash, if the cache is managed by the user this will be their responsibility. And what is reasonable I think is very application specific. I do not think you PR addressed that.

rlipscombe · June 14, 2024, 8:09pm

The latest, at ssl_crl_cache_api — ssl v11.2; I still find it unclear in places.

Yeah; I see that. My concern is that – even if allowing for delta CRLs (which is a network bandwidth thing…?) – it’s a linear search through the list. That’s obviously a compatibility concern, since there’s a baked-in assumption that the cache callback will return that list, which makes it hard to replace with a set.

Sure, assuming I know what CRLs need adding to the cache. I’m assuming that needs doing a priori, at which point I might as well include the hashed CRLs in my release and use the ssl_crl_hash_dir variant – which is something we’re considering, since in our use case, we do know which servers we’re going to be talking to.

I guess, depending on whether hackney exposes it, I could insert arbitrary CRLs after the connection is established. That leads straight back to managing the cache size/expiry, though, and that’s probably harder to achieve from the outside…

ingela · June 18, 2024, 2:00pm

Well it really depends on how you build that. You can have some service that maybe updates the case at some interval that you feel is acceptable but that does not perform a down load for each TLS connection. You might even collect information from the TLS handshakes to help
know what CRL’s you are interested in.

I am not saying we should not solve this problem for the internal cache, just that it was not prioritized. And if you have a good suggestion PR are always welcome.

rlipscombe · June 18, 2024, 2:44pm

But it’s scary in there…

ingela · June 19, 2024, 6:19am

I put it on our todo list hopefully it can happen during upcoming release timespan.