!MthpOIxqJhTgrMNxDS:nixos.org

NixOS ACME / LetsEncrypt

103 Members
Another day, another cert renewal44 Servers

Load older messages


SenderMessageTime
9 Nov 2024
@m1cr0man:m1cr0man.comm1cr0manok fair enough22:24:59
@k900:0upti.meK900I think it's just ordering being off again 22:28:34
@k900:0upti.meK900But I don't have a good mental model 22:28:39
@m1cr0man:m1cr0man.comm1cr0man Just looking over the maxConcurrentRenewals implementation, and the options/discussion from last year. I'm really starting to feel that the systemd dependency approach would have been more straightforward. I couldn't convince folks at the time, and there was lengthly discussion in which it really needed feedback from another maintainer. I went with the current solution because I felt there wasn't much between them and that it may lend to more ACME contributors, but I'm seeing now that it's a heavy bit of complexity and we're still limited on maintainers. 22:29:59
@m1cr0man:m1cr0man.comm1cr0man
In reply to @k900:0upti.me
Or the units need to also wants the server

An idea for fixing this: I could add more targets in the ACME module to simplify the config and dependencies in the webserver + other downstream modules, and potentially help resolve this issue also:

  • Add an acme-renewal-http01.target which requires and after the relevant acme services.
  • For each web server listening on port 80 or configured to serve the acme-challenge directory (either is possible and logic already exists to discover these cases), add a requires and before rule on acme-renewal-http01.target

Honestly, I'm trying to think of reasons I haven't done this until now. I could add targets for other renewal types with the intention to allow DNS server startups in the same way. I could even go further and add targets for acme-selfsigned.target and acme-renewal.target so that downstream services generally don't need to worry about what certs to wait on. I would hazard a guess that the complement of certs being waited on is significant in 90% of system configurations out there, and using these general targets wouldn't cause much more slow down.

23:09:17
@m1cr0man:m1cr0man.comm1cr0man
In reply to @k900:0upti.me
Or the units need to also wants the server
*

An idea for fixing this: I could add more targets in the ACME module to simplify the config and dependencies in the webserver + other downstream modules, and potentially help resolve this issue also:

  • Add an acme-renewal-http01.target which requires and after the relevant acme services.
  • For each web server listening on port 80 or configured to serve the acme-challenge directory (either is possible and logic already exists to discover these cases), add a requires and before rule on acme-renewal-http01.target

Honestly, I'm trying to think of reasons I haven't done this until now. I could add targets for other renewal types with the intention to allow DNS server startups in the same way. I could even go further and add targets for acme-selfsigned.target and acme-renewal.target so that downstream services generally don't need to worry about what certs to wait on. I would hazard a guess that the complement of certs being waited on is not significant in 90% of system configurations out there, and using these general targets wouldn't cause much more slow down.

23:23:15
10 Nov 2024
@m1cr0man:m1cr0man.comm1cr0man
In reply to @m1cr0man:m1cr0man.com
Just looking over the maxConcurrentRenewals implementation, and the options/discussion from last year. I'm really starting to feel that the systemd dependency approach would have been more straightforward. I couldn't convince folks at the time, and there was lengthly discussion in which it really needed feedback from another maintainer. I went with the current solution because I felt there wasn't much between them and that it may lend to more ACME contributors, but I'm seeing now that it's a heavy bit of complexity and we're still limited on maintainers.
Another thing about this - we already use systemd dependency ordering to do something very similar with how we handle account creation, where one cert is elected as a leader. It just feels unnecessary to have locks implemented on disk for this other use case.
00:08:54
@emilazy:matrix.orgemilyI forget what side I was on but we should go with that one 😂00:41:44
@emilazy:matrix.orgemilyI recall being against the complexity one of the PRs like that introduced00:43:00
@emilazy:matrix.orgemilythe biggest things I have come to dislike about our ACME implementation - and I don't hold this against you at all, it's evolved organically under the pressure of being expected to support arbitrarily complex features and integrate with arbitrarily complex setups on top of a program that isn't quite fit for purpose - are how much we reinvent the wheel of both ACME and systemd and how coupled everything is00:44:51
@m1cr0man:m1cr0man.comm1cr0manTotally agree. I am taking some time this weekend to do refactors, and figure out our dependency chains. I don't see switching from lego being on the agenda for a good while. Infact, I still want to try and upstream some of the complicated logic we have around offline renewal checks. It would be pretty trivial to add behind a flag on the lego side, and remove a good chunk of custom scripting we have done. This evening, I have simplified the setup process substantially: I have merged acme-selfsigned-ca, acme-fixperms and acme-lockfiles into a single acme-setup.service. In turn, I removed all use of tmpfiles, and it made the unit dependencies much clearer. The biggest thing we are working around with systemd in general is the fact that lego must be invoked per certificate. This is why I'm now thinking we should refactor downstream services to rely on a single target instead of individual services. I'm happy with how efficient + robust it all is when it works - the single account per config, and the fact that one cert failure does not break all certs, are all good features to have. There's pros and cons to the architecture.02:16:30
@emilazy:matrix.orgemily

I don't see switching from lego being on the agenda for a good while.

I don't think anyone is planning to put in the work to make it happen, but I do think that we're very much at the point where our certificate management lifecycle just wants to be an autonomous always-running daemon that communicates with the rest of the system via systemd

02:17:36
@emilazy:matrix.orgemilylike, whether we can get there or not is a separate question02:17:41
@emilazy:matrix.orgemilybut I think we have to acknowledge that we have basically constructed the equivalent of this out of a morass of shell scripts, services, and targets wrapped around a tiny core of lego, and that it's hurting us02:18:08
@emilazy:matrix.orgemilysince, well, that does not make a very good programming language for a complex lifecycle management service :)02:18:51
@m1cr0man:m1cr0man.comm1cr0man

Personally, I don't hate the fact that we've used systemd to achieve this. It is an always running daemon that communicates with the rest of the system 😉 and it integrates very nicely with the lifecycle of services which depend on acme certs. However we are definitely pushing (and actual exceeding) its limits in terms of what it can achieve.

As you said, it has been an organic evolution over many years for many use cases. I want to give refactoring one good go before investigating alternative solutions/replacement to the stack we have today. Perhaps there is something that would make life easier as maintainers, but from what feedback I've heard, people are generally happy with cert management today.

02:28:27
@emilazy:matrix.orgemilywe are ultimately gluing two tools together, neither of which was designed for what we're doing with it :/02:31:26
@emilazy:matrix.orgemilywhich IMO went okay until the drift between what they're capable of and the model they're designed for, and what we actually need, became clear and we had to work around that02:32:06
@emilazy:matrix.orgemilyall I can say is that I understand why Caddy gave up on LEGO, and they didn't even have the penalty of trying to express all the lifecycle logic and rate limiting in terms of a Unix service manager 😅02:33:01
@emilazy:matrix.orgemilyI don't think we should have a target that represents all certificate renewals and gate every use of certificates on all certificates, if that's what you mean02:33:31
@emilazy:matrix.orgemilythat'll scale pretty badly when you have a ton of certs02:33:35
@m1cr0man:m1cr0man.comm1cr0manI understand what you're saying yeah. Wrt the target thing - it's not so much that I want to put up a gate, but I want to provide a simpler method for resolving the dependency chain. In deployments where many certs are in use that I have observed, almost all of them are a dependency of the service(s) they are attached to. In practicality, I don't think there would be a significant difference between dependencies per cert vs generalized targets. At the very least, a selfsigned target would go a long way.02:38:56
@m1cr0man:m1cr0man.comm1cr0manhttps://github.com/NixOS/nixpkgs/pull/355087 first big refactoring PR22:32:19
@m1cr0man:m1cr0man.comm1cr0manIf there's any maintainers about, I think this PR is good to merge also https://github.com/NixOS/nixpkgs/pull/348344 22:34:17
@k900:0upti.meK900
In reply to @m1cr0man:m1cr0man.com
If there's any maintainers about, I think this PR is good to merge also https://github.com/NixOS/nixpkgs/pull/348344
Merged that
22:44:00
@k900:0upti.meK900
In reply to @m1cr0man:m1cr0man.com
https://github.com/NixOS/nixpkgs/pull/355087 first big refactoring PR
This scares me but in a good way
22:44:06
@m1cr0man:m1cr0man.comm1cr0man
In reply to @k900:0upti.me
This scares me but in a good way
What I will do next will surely terrify and amaze 🧛‍♂️
22:45:04
@m1cr0man:m1cr0man.comm1cr0man
In reply to @k900:0upti.me
Merged that
Thanks a mil
22:45:19
11 Nov 2024
@m1cr0man:m1cr0man.comm1cr0man

This open, 2020 ticket is peak ACME module: https://github.com/NixOS/nixpkgs/issues/106862

This is actually a variant of what K900 saw yesterday wrt webserver startup ordering. It duplicates/affects this recent ticket too. I also found the reason why we check cert expiry ourselves (I recall there being an issue with container startup also? But I can't find any reference to it. Please link if you know of it.).

I see two ways to frame this issue more generally, each with very different solutions:

  1. "ACME renewal does not reliably wait on external dependencies"

In this case, we need to have a reliable mechanism for configuring services which may affect cert renewability to be started before renewal is attempted. One solution is to add an acme-renewal-dependencies.target and add service modules to it as required on a best effort basis. I'm sure issues will be opened if we miss something, as they have been historically.

Sadly this only half-solves the problem. Running != listening, and systemd only accounts for the former in most cases (non-notify services). Socket units were suggested before (I do understand them now 😉) but that is a monumentous task for all the dependencies.
We could do some naive tests and delays like CURLing the configured ACME server and checking for a listening on port 80, but that just feels wrong.

  1. "ACME renewal does not gracefully handle failures"

This is actually untrue - we do have systemd Restart directives configured on the units. The problem is that this causes start jobs to fail, and dependent services to not start or work. What we really want(ed) is a way to gracefully retry, where we don't fail the job. We could add some sort of retry logic to the script and do away with systemd's retry logic, but again feels like reinventing the systemd retry logic in a crude way.

I would love to reduce flakiness and remove scripts (the logic for checking renewal date) at the same time, but we're stuck with a limited toolset to solve this.

My feeling right now is that we should at least implement solution 1 and get dependency ordering right for non-failure scenarios. Despite the caveats, I still think this would be a significant improvement. I wish I could say with confidence that this would solve test flakiness, but it probably won't.

01:04:00
@arianvp:matrix.orgArianIt's peak because we opened it ourselves and then ignored it for 5 years07:51:12

Show newer messages


Back to Room ListRoom Version: 6