Performance problems with CRL verification

We recently ran into a problem with the CPU and memory use of one of our services. This service makes HTTPS calls (using hackney, but I don’t think that’s relevant) to a small number of backend services hosted in AWS and with Amazon-issued TLS certificates.

After some preliminary investigation, we discovered that it was spending all of its time doing CRL checks (enabled as per the recommendations at Erlang standard library: ssl | EEF Security WG), so we turned those off temporarily, which greatly improved performance.

Further investigation revealed two things:

  1. The relevant CRLs ballooned in size recently. The “Amazon RSA 2048 M02” and “Amazon RSA 2048 M03” CRLs had approximately 100K revoked serial numbers added to each of them at the start of June 2024, growing from 11K certs to about 111K certs each and the file sizes grew to about 4MiB each.
  2. The ssl_crl_cache doesn’t, well, cache lookups. This results in a fresh download (4MiB!) of the CRL every time a TLS connection is established. See otp/lib/ssl/src/ssl_crl_cache.erl at OTP-25.3.2.7 · erlang/otp · GitHub where the insert is commented out. This is unchanged as of OTP-27.0.

Questions:

  • Are there any plans to resolve the lack of caching in ssl_crl_cache?
  • Are there any third-party libraries that address this?
    • On this note, the ssl_crl_cache_api lacks documentation and seems to be extremely specific to ssl_crl_cache and ssl_crl_hash_dir. What’s the difference between lookup/3 and select/2, for example? When will fresh_crl/2 actually be called, and what should it do?
  • Are there any other recommendations to mitigate the lack of caching?
  • Has anyone looked at the performance of public_key:pkix_crls_validate/3 when given a CRL this large?
3 Likes

Just seen the same today, also on AWS, for an older VerneMQ release. (I think for OTP 24, and we’re also using Hackney)

I guess compiling with newer OTP releases does not help then.

1 Like

I don’t think this is planned. ssl_crl_cache is documented as “simple default implementation of a CRL cache”. Maybe @ingela can provide more detailed information.

Have you checked ssl_crl_cache_api — ssl v11.2 ?
I guess your questions are somehow addressed there … what is missing?

2 Likes

Many things. For example: what should the various functions return? Why is Issuer passed as an argument to lookup/3? What should I do with it? What, exactly, does public_key:pkix_crls_validate/3 do with the update_crl option? What should that return? “unless the value is a list of so called general names” – huh? And so on.

I’ve spent a day reading the source code to ssl_handshake and both ssl_crl_cache and ssl_crl_hash_dir – and various related RFCs – and it’s no clearer to me how one would write a conforming implementation of ssl_crl_cache_api.

2 Likes

I guess that we just forgot about improving the default cache and so far nobody complained, maybe because they provide their own or do not care about CRLs !? We can look into enhancing the documentation. PKIX standard is not a trivial thing :wink: We can also make a ticket for fixing the default cache but I do not know when it might be prioritized so if your in a hurry a PR is welcome.

4 Likes

I looked into CRL caching for a RabbitMQ user quite a while ago, that turned into this PR because (if I remember correctly) the fetched CRLs weren’t actually being cached:

Anyway, maybe that code could be revisited.

3 Likes

It also occurs to me that the cache lookup API could be improved. Currently, it returns a list of CRLs (each of which is a list of serial numbers) for a given distribution point and issuer. What I think ssl_handshake actually needs is a simple yes/no: given this {Issuer, SerialNumber} pair, has it been revoked?

That could lead to better performance for large (cached) CRLs, because implementations could replace the linear scan through the lists of serial numbers, by using a set (or even a bloom filter, if we wanted to get exotic).

3 Likes

What version of the documentation where you reading. It occurred to me that it might not been the OTP-27 one, so if you have opinions of things needing enhancing please base your comments on the latest version.

public_key:pkix_crls_validate/3 is the function giving you the yes/no answer (well it in practice it gives you a yes/no/undetermined answer) All according to the RFC 5280. CRL mechanism is know to have problems with growing into very large data. There are something called Delta-CRL’s to mitigate the problem. I do not know how common it is used.

Also you do not have to create a callback, unless you want to use your own storage, you may also managed the cache yourself by calling functions insert and delete in ssl_crl_cache
(documented API). I have a vague memory that the commented out line was as we did not want to automatically allocate potentially lots of memory, and it was postponed and forgotten.

An alternative to CRLs is OCSP, which we currently add support for the flavor of that called OSCP-stapling for the client side of things.

1 Like

I guess what is needed to be able to uncomment the insert is some kind of option to set a max size for the cash, if the cache is managed by the user this will be their responsibility. And what is reasonable I think is very application specific. I do not think you PR addressed that.

1 Like

The latest, at ssl_crl_cache_api — ssl v11.2; I still find it unclear in places.

Yeah; I see that. My concern is that – even if allowing for delta CRLs (which is a network bandwidth thing…?) – it’s a linear search through the list. That’s obviously a compatibility concern, since there’s a baked-in assumption that the cache callback will return that list, which makes it hard to replace with a set.

Sure, assuming I know what CRLs need adding to the cache. I’m assuming that needs doing a priori, at which point I might as well include the hashed CRLs in my release and use the ssl_crl_hash_dir variant – which is something we’re considering, since in our use case, we do know which servers we’re going to be talking to.

I guess, depending on whether hackney exposes it, I could insert arbitrary CRLs after the connection is established. That leads straight back to managing the cache size/expiry, though, and that’s probably harder to achieve from the outside…

1 Like

Well it really depends on how you build that. You can have some service that maybe updates the case at some interval that you feel is acceptable but that does not perform a down load for each TLS connection. You might even collect information from the TLS handshakes to help
know what CRL’s you are interested in.

I am not saying we should not solve this problem for the internal cache, just that it was not prioritized. And if you have a good suggestion PR are always welcome.

1 Like

But it’s scary in there… :wink:

1 Like

I put it on our todo list hopefully it can happen during upcoming release timespan.

3 Likes