Sender | Message | Time |
---|---|---|
13 Jun 2023 | ||
m1cr0man | Hello again :) Busy few weeks... looking into https://github.com/NixOS/nixpkgs/issues/232505 again. I just had a notion - could we chain all the certs together with an After= condition? We would still need to avoid auto-starting the services for each cert (otherwise config switch would take a REALLY long time) but that might be easy to solve with a target. | 19:44:52 |
m1cr0man | In reply to @emilazy:matrix.orgOh wow.. that's spooky. At least if we were using that our systemd services for renewal are hardened like steel | 19:45:33 |
emily | only so much hardening you can do when the process has access to private keys :( | 20:03:36 |
emily | (ideally you have privilege separation so that the process that talks to the ACME server doesn't have access to the keys but I don't think even lego does that) | 20:05:44 |
emily | In reply to @m1cr0man:m1cr0man.comhonestly I don't know if there's a one-size-fits-all solution to this. we can randomize renewal time because it fundamentally doesn't matter when renewal happens as long as it's sufficiently far in advance. some users will want their sites accessible as soon as possible after setting up a new box or activating a new configuration; some will be worried about load and rate limits. i don't see how we can satisfy both out of the box | 20:06:49 |
emily | the "This will cause the timer to start; and after 1 second start all the services with a randomised delay." idea sounds nice enough - but then we're talking about, your sites have broken SSL for up to an entire day? | 20:07:23 |
emily | I'm curious how Caddy/certmagic handles this since it has pretty sophisticated logic for cert issue timing | 20:08:08 |
m1cr0man | Could you let me know what you find from that? But to your point about one size fits all, it seems like we will need to introduce an option for users to decide what they want. We can default to the current situation, but provide an option like renewOnActivate for other situations? | 20:09:44 |
emily | I'm tempted to say that people can just poke at the systemd.* options themselves if they really want rate limiting, but I'm biased :p | 20:10:27 |
emily | I would consider it acceptable to do something out of the box if we found a solution that leads to large numbers of certs being activated in minutes rather than hours/days though | 20:10:48 |
emily | if you have dozens/hundreds of certs then you're probably expecting initial setup to take about that long | 20:11:28 |
emily | I don't want to significantly penalize the common case of just a few domains for that though, or stretch it out to "without manual intervention migrating your NixOS box will result in your sites being offline for the next day" | 20:11:54 |
emily | fundamentally if you want your sites running with TLS you have to spend a certain amount of compute, memory and network to get there | 20:12:15 |
m1cr0man | yep, I'm in full agreement with all of that. I might explore the chained services option to see how it performs and if there's a way to work around the activation delay, with the thought that this solution would be an optional (default off) feature of the module | 20:14:49 |
emily | FWIW, relevant LE rate limits: "The main limit is Certificates per Registered Domain (50 per week)." "You can create a maximum of 300 New Orders per account per 3 hours." "You can have a maximum of 300 Pending Authorizations on your account." | 20:17:11 |
emily | for #1, probably people with tons of certs mostly have them on different domains | 20:17:31 |
emily | #2 means that someone with >300 domains would currently run into rate limits with our existing setup | 20:17:52 |
emily | #3 could theoretically happen if the system chugs enough that the ACME client starts issuing a bunch of certs but doesn't run to completion before more spawn up | 20:18:17 |
emily | of course people with these many certs should probably apply for an exemption anyway, but I think it's good to note the magnitude/timeframe of the upstream limits | 20:18:43 |
m1cr0man | okay yeah, so these are pretty lenient for most people I think I was only concerned about the concurrent one that the ticket opener mentioned:
Right now this one is very easy to do | 20:19:53 |
m1cr0man | * okay yeah, so these are pretty lenient for most people. I think I was only concerned about the concurrent one that the ticket opener mentioned:
Right now this one is very easy to do | 20:20:03 |
emily | ah I missed that one. never skim read! | 20:20:30 |
emily | so yeah my inclination is that it would be good to have something default that ensures we're not issuing certificates at a rate that would surpass that. but preferably not full serialization since that's quite a lot further than that | 20:21:15 |
emily | I feel like there should be a good way to rate limit these services starting without fussing with CPU quotas or whatever. | 20:21:44 |
emily | okay there is | 20:22:08 |
emily | we have StartLimitIntervalSec/StartLimitBurst/StartLimitAction which look perfect. however, I'm guessing that we would need to switch over to @ units to use it - because otherwise all our services are entirely separate | 20:22:45 |
emily | unless it counts the bit after the @ as part of the unit for rate limiting and it's just for making restarts not spam :/ | 20:23:03 |
emily | we need a systemd expert :) | 20:23:22 |
m1cr0man | afaik StartLimit* only applies to services which would enter the failed state? I did consider suggesting that :) however the docs imply it's only for failure. You would need to pair it with Condition/Assert* directives in the unit section, which would be evaluated en masse and actually wouldn't stop concurrency at activation at all | 20:23:50 |
emily | it does say "Configure unit start rate limiting. Units which are started more than burst times within an interval time span are not permitted to start any more." but yeah I'm not sure if it would work | 20:24:32 |