NixOS ACME / LetsEncrypt - Public Room Timeline

	NixOS ACME / LetsEncrypt	104 Members
	Another day, another cert renewal	44 Servers

Load older messages

Sender	Message	Time
9 Nov 2024
K900	https://hydra.nixos.org/eval/1809873	21:57:20
K900	More ordering nonsense	21:57:24
K900	If anyone wants to look into it	21:57:30
K900	It's funny how adding more synchronization uncovers more and more weird behaviors	21:58:31
K900	Just because tests that used to insta-fail on slow machines don't anymore	21:58:43
m1cr0man	Did I miss it? Looks like it passed	22:20:42
K900	I restarted it	22:21:12
K900	Without really thinking	22:21:43
K900	My bad	22:21:44
m1cr0man	No worries. What was the gist of it?	22:24:00
K900	Two different failures	22:24:33
K900	On aarch64 and x86_64	22:24:41
K900	Did not look closely	22:24:44
m1cr0man	ok fair enough	22:24:59
K900	I think it's just ordering being off again	22:28:34
K900	But I don't have a good mental model	22:28:39
m1cr0man	Just looking over the maxConcurrentRenewals implementation, and the options/discussion from last year. I'm really starting to feel that the systemd dependency approach would have been more straightforward. I couldn't convince folks at the time, and there was lengthly discussion in which it really needed feedback from another maintainer. I went with the current solution because I felt there wasn't much between them and that it may lend to more ACME contributors, but I'm seeing now that it's a heavy bit of complexity and we're still limited on maintainers.	22:29:59
m1cr0man	In reply to @k900:0upti.me Or the units need to also wants the server An idea for fixing this: I could add more targets in the ACME module to simplify the config and dependencies in the webserver + other downstream modules, and potentially help resolve this issue also: Add an `acme-renewal-http01.target` which `requires` and `after` the relevant acme services. For each web server listening on port 80 or configured to serve the acme-challenge directory (either is possible and logic already exists to discover these cases), add a `requires` and before rule on `acme-renewal-http01.target` Honestly, I'm trying to think of reasons I haven't done this until now. I could add targets for other renewal types with the intention to allow DNS server startups in the same way. I could even go further and add targets for `acme-selfsigned.target` and `acme-renewal.target` so that downstream services generally don't need to worry about what certs to wait on. I would hazard a guess that the complement of certs being waited on is significant in 90% of system configurations out there, and using these general targets wouldn't cause much more slow down.	23:09:17
m1cr0man	In reply to @k900:0upti.me Or the units need to also wants the server * An idea for fixing this: I could add more targets in the ACME module to simplify the config and dependencies in the webserver + other downstream modules, and potentially help resolve this issue also: Add an `acme-renewal-http01.target` which `requires` and `after` the relevant acme services. For each web server listening on port 80 or configured to serve the acme-challenge directory (either is possible and logic already exists to discover these cases), add a `requires` and before rule on `acme-renewal-http01.target` Honestly, I'm trying to think of reasons I haven't done this until now. I could add targets for other renewal types with the intention to allow DNS server startups in the same way. I could even go further and add targets for `acme-selfsigned.target` and `acme-renewal.target` so that downstream services generally don't need to worry about what certs to wait on. I would hazard a guess that the complement of certs being waited on is not significant in 90% of system configurations out there, and using these general targets wouldn't cause much more slow down.	23:23:15
10 Nov 2024
m1cr0man	In reply to @m1cr0man:m1cr0man.com Just looking over the maxConcurrentRenewals implementation, and the options/discussion from last year. I'm really starting to feel that the systemd dependency approach would have been more straightforward. I couldn't convince folks at the time, and there was lengthly discussion in which it really needed feedback from another maintainer. I went with the current solution because I felt there wasn't much between them and that it may lend to more ACME contributors, but I'm seeing now that it's a heavy bit of complexity and we're still limited on maintainers. Another thing about this - we already use systemd dependency ordering to do something very similar with how we handle account creation, where one cert is elected as a leader. It just feels unnecessary to have locks implemented on disk for this other use case.	00:08:54
emily	I forget what side I was on but we should go with that one 😂	00:41:44
emily	I recall being against the complexity one of the PRs like that introduced	00:43:00
emily	the biggest things I have come to dislike about our ACME implementation - and I don't hold this against you at all, it's evolved organically under the pressure of being expected to support arbitrarily complex features and integrate with arbitrarily complex setups on top of a program that isn't quite fit for purpose - are how much we reinvent the wheel of both ACME and systemd and how coupled everything is	00:44:51
m1cr0man	Totally agree. I am taking some time this weekend to do refactors, and figure out our dependency chains. I don't see switching from lego being on the agenda for a good while. Infact, I still want to try and upstream some of the complicated logic we have around offline renewal checks. It would be pretty trivial to add behind a flag on the lego side, and remove a good chunk of custom scripting we have done. This evening, I have simplified the setup process substantially: I have merged acme-selfsigned-ca, acme-fixperms and acme-lockfiles into a single acme-setup.service. In turn, I removed all use of tmpfiles, and it made the unit dependencies much clearer. The biggest thing we are working around with systemd in general is the fact that lego must be invoked per certificate. This is why I'm now thinking we should refactor downstream services to rely on a single target instead of individual services. I'm happy with how efficient + robust it all is when it works - the single account per config, and the fact that one cert failure does not break all certs, are all good features to have. There's pros and cons to the architecture.	02:16:30
emily	I don't see switching from lego being on the agenda for a good while. I don't think anyone is planning to put in the work to make it happen, but I do think that we're very much at the point where our certificate management lifecycle just wants to be an autonomous always-running daemon that communicates with the rest of the system via systemd	02:17:36
emily	like, whether we can get there or not is a separate question	02:17:41
emily	but I think we have to acknowledge that we have basically constructed the equivalent of this out of a morass of shell scripts, services, and targets wrapped around a tiny core of lego, and that it's hurting us	02:18:08
emily	since, well, that does not make a very good programming language for a complex lifecycle management service :)	02:18:51
m1cr0man	Personally, I don't hate the fact that we've used systemd to achieve this. It is an always running daemon that communicates with the rest of the system 😉 and it integrates very nicely with the lifecycle of services which depend on acme certs. However we are definitely pushing (and actual exceeding) its limits in terms of what it can achieve. As you said, it has been an organic evolution over many years for many use cases. I want to give refactoring one good go before investigating alternative solutions/replacement to the stack we have today. Perhaps there is something that would make life easier as maintainers, but from what feedback I've heard, people are generally happy with cert management today.	02:28:27
emily	we are ultimately gluing two tools together, neither of which was designed for what we're doing with it :/	02:31:26

Show newer messages

Back to Room ListRoom Version: 6