!MthpOIxqJhTgrMNxDS:nixos.org

NixOS ACME / LetsEncrypt

107 Members
Another day, another cert renewal45 Servers

Load older messages


SenderMessageTime
22 Apr 2025
@hexa:lossy.networkhexafor 6 days that changes of course23:09:45
28 Apr 2025
@m1cr0man:m1cr0man.comm1cr0man https://github.com/NixOS/nixpkgs/pull/376334#pullrequestreview-2801003367 this is ready to go. I tested it too. 21:26:09
29 Apr 2025
@ygt:matrix.org@ygt:matrix.org left the room.23:42:45
5 May 2025
@netpleb:matrix.orgnetpleb

hi everyone, does anybody have a workaround that fixes this pesky dns resolution issue when acme.certs... and BIND are running in a declarative nixos container?

Could not create client: get directory at 'https://acme-v02.api.letsencrypt.org/directory': Get "https://acme-v02.api.letsencrypt.org/directory": GET https://acme-v02.api.letsencrypt.org/directory giving up after 6 attempt(s): Get "https://acme-v02.api.letsencrypt.org/directory": dial tcp: lookup acme-v02.api.letsencrypt.org: Temporary failure in name resolution
17:59:16
@netpleb:matrix.orgnetplebwhat seems to be happening is that acme is starting so early that the container is unable to route things yet. Maybe the host has not installed routes yet? but that somehow acme blocks until it times out.18:00:34
@netpleb:matrix.orgnetplebonce it times out, then everything works fine. I have whittled it down to acme, because when I remove any acme things the container boots up just fine and is able to route/ping quite quickly18:01:46
@netpleb:matrix.orgnetpleb

so, by that, I mean that the issue does not seem to be pertaining to (in my case) networking.wireguard nor networking.bind which are both used and operate completely fine in the container. It is only after adding something like (per the manual):

      security.acme.certs."<redacted>"= {
        domain = "*<redacted>";
        dnsProvider = "rfc2136"; # allows us to do dns acme validation with local dns server
        environmentFile = "/var/lib/secrets/<redacted>.certs.secret";
        # We don't need to wait for propagation since this is a local DNS server
        dnsPropagationCheck = false;
      };

that the behavior occurs.

18:04:57
@netpleb:matrix.orgnetpleb what I am most confused about (and why I am posting here) is why the call to lego --accept-tos --path . -d '*.<redacted>' --email <redacted> --key-type ec256 --dns rfc2136 --dns.propagation-disable-ans --dns.resolvers 127.0.0.1:53 --server https://acme-v02.api.letsencrypt.org/directory renew --no-random-sleep --days 30 seems to block all network traffic, even for other services (like wireguard, bind, etc) 18:07:29
@netpleb:matrix.orgnetpleb * what I am most confused about (and why I am posting here) is why the call to lego --accept-tos --path . -d '*.<redacted>' --email <redacted> --key-type ec256 --dns rfc2136 --dns.propagation-disable-ans --dns.resolvers 127.0.0.1:53 --server https://acme-v02.api.letsencrypt.org/directory renew --no-random-sleep --days 30 seems to block all network traffic, even for other services (like wireguard, bind, etc) until it times out. There must be something I do not understand about how systemd works or calls this, but I woudl like to learn ;) 18:07:56
@netpleb:matrix.orgnetpleb * what I am most confused about (and why I am posting here) is why the call to lego --accept-tos --path . -d '*.<redacted>' --email <redacted> --key-type ec256 --dns rfc2136 --dns.propagation-disable-ans --dns.resolvers 127.0.0.1:53 --server https://acme-v02.api.letsencrypt.org/directory renew --no-random-sleep --days 30 seems to block all network traffic, even for other services (like wireguard, bind, etc) until it times out. There must be something I do not understand about how systemd works or calls this, but I would like to learn ;) 18:08:04
@netpleb:matrix.orgnetpleb in essence though, as soon as I comment out the security.acme.certs... config above, the container boots up in a couple seconds, whereas with the acme config in place it takes a couple minutes since it has to wait for acme to timeout. I have tried for days now to figure out how to move the acme renewal process way later, but nothing seems to work. 18:34:26
@netpleb:matrix.orgnetpleb * in essence though, as soon as I comment out the security.acme.certs... config above, the container boots up in a couple seconds and can ping various ips and even resolve hostnames with the local BIND instance, whereas with the acme config in place it takes a couple minutes since it has to wait for acme to timeout. In the interim no pinging or hostname lookups even work. I have tried for days now to figure out how to move the acme renewal process way later, but nothing seems to work. 18:35:53
@netpleb:matrix.orgnetpleb * in essence though, as soon as I comment out the security.acme.certs... config above, the container boots up in a couple seconds and can ping various ips and even resolve hostnames with the local BIND instance, whereas with the acme config in place it takes a couple minutes to boot since it has to wait for acme to timeout. In the interim no pinging or hostname lookups even work. I have tried for days now to figure out how to move the acme renewal process way later, but nothing seems to work. 18:48:20
@netpleb:matrix.orgnetpleb (sorry for so many messages), I have continued to investigate and it seems that the root cause is that the host machine does not provide the network/routes to the container until late (possibly even after?) the container is done booting. So because of this, acme stalls the boot process. So far the only thing that has sort of worked, but is very not-clean, is for me to just put serviceConfig.TimeoutStartSec = "20s"; on the various acme-<domain>.service units. 20:18:57
6 May 2025
@m1cr0man:m1cr0man.comm1cr0manSorry - only seeing your messages now. I believe a fix for this does exist in the wild, I vaguely remember running into it a few years ago. Let me do some digging20:36:18
@m1cr0man:m1cr0man.comm1cr0man

In the mean time netpleb - can you provide the following info from within the container:

  • Logs of acme-$cert.service redacted as necessary
  • Output of systemctl list-dependencies acme-$cert.service
  • Output of systemctl list-dependencies bind.service
20:40:38
@m1cr0man:m1cr0man.comm1cr0man Ah, I see you already found the relevant ticket on GitHub. Did you try this fix? 20:42:27
@netpleb:matrix.orgnetpleb Thanks for your reply and for helping to figure this out. I did try that fix that you mentioned, as well as this one but neither have done the trick. I will the output of those commands for you now. 21:03:08
@netpleb:matrix.orgnetpleb

here's the redacted output (first time using local instance of ollama to do redacting!):

[root@hostname:~]# systemctl list-dependencies acme-example.com.service
acme-example.com.service
○ ├─acme-selfsigned-example.com.service
● ├─acme-setup.service
● ├─bind.service
○ ├─dns-rfc2136-conf.service
○ ├─nginx-config-reload.service
● ├─system.slice
● ├─acme-account-8bbd8b2b5078a14c2103.target
○ │ └─acme-jitsi.example.com.service
● ├─network-online.target
● ├─nss-lookup.target
● │ └─nscd.service
● └─sysinit.target
●   ├─dev-hugepages.mount
●   ├─dev-mqueue.mount
●   ├─firewall.service
○   ├─kmod-static-nodes.service
○   ├─suid-sgid-wrappers.service
○   ├─sys-fs-fuse-connections.mount
○   ├─sys-kernel-debug.mount
○   ├─sys-kernel-tracing.mount
●   ├─systemd-ask-password-console.path
○   ├─systemd-boot-random-seed.service
○   ├─systemd-hibernate-clear.service
○   ├─systemd-journal-catalog-update.service
●   ├─systemd-journal-flush.service
●   ├─systemd-journald.service
○   ├─systemd-machine-id-commit.service
○   ├─systemd-modules-load.service
○   ├─systemd-pstore.service
○   ├─systemd-random-seed.service
●   ├─systemd-resolved.service
●   ├─systemd-sysctl.service
●   ├─systemd-tmpfiles-setup-dev-early.service
●   ├─systemd-tmpfiles-setup-dev.service
●   ├─systemd-tmpfiles-setup.service
○   ├─systemd-tpm2-setup-early.service
○   ├─systemd-tpm2-setup.service
○   ├─systemd-udevd.service
○   ├─systemd-update-done.service
●   ├─systemd-update-utmp.service
●   ├─cryptsetup.target
●   ├─local-fs.target
○   │ └─systemd-remount-fs.service
●   └─swap.target
[root@hostname:~]# systemctl list-dependencies bind.service
bind.service
○ ├─dns-fix-zone-perms.service
○ ├─dns-rfc2136-conf.service
● ├─system.slice
● └─sysinit.target
●   ├─dev-hugepages.mount
●   ├─dev-mqueue.mount
●   ├─firewall.service
○   ├─kmod-static-nodes.service
○   ├─suid-sgid-wrappers.service
○   ├─sys-fs-fuse-connections.mount
○   ├─sys-kernel-debug.mount
○   ├─sys-kernel-tracing.mount
●   ├─systemd-ask-password-console.path
○   ├─systemd-boot-random-seed.service
○   ├─systemd-hibernate-clear.service
○   ├─systemd-journal-catalog-update.service
●   ├─systemd-journal-flush.service
●   ├─systemd-journald.service
○   ├─systemd-machine-id-commit.service
○   ├─systemd-modules-load.service
○   ├─systemd-pstore.service
○   ├─systemd-random-seed.service
●   ├─systemd-resolved.service
●   ├─systemd-sysctl.service
●   ├─systemd-tmpfiles-setup-dev-early.service
●   ├─systemd-tmpfiles-setup-dev.service
●   ├─systemd-tmpfiles-setup.service
○   ├─systemd-tpm2-setup-early.service
○   ├─systemd-tpm2-setup.service
○   ├─systemd-udevd.service
○   ├─systemd-update-done.service
●   ├─systemd-update-utmp.service
●   ├─cryptsetup.target
●   ├─local-fs.target
○   │ └─systemd-remount-fs.service
●   └─swap.target
[root@hostname:~]#
21:09:35
@m1cr0man:m1cr0man.comm1cr0manInteresting. FWIW, I personally used to use Bind + RFC2136 for renewals. It was not in a container though. The service ordering looks correct, with bind listed as a dependency of acme-example.com.service.21:27:43
@m1cr0man:m1cr0man.comm1cr0manWhat error is lego itself throwing during renewal?21:28:22
@netpleb:matrix.orgnetpleb Thanks. Yes, I think it probably works fine when not in a container, but alas my use case is within a container :-/. I will get the exact error for you in a moment, but in essence it is something like this: Could not create client: get directory at ‘https://acme-v02.api.letsencrypt.org/directory’: Get “https://acme-v02.api.letsencrypt.org/directory”: dial tcp: lookup acme-v02.api.letsencrypt.org 1: Temporary failure in name resolution
it tries to do that 6 times I think before timing out. Interestingly, during this process though I cannot ping anything (much less lookup host names).
21:34:17
@netpleb:matrix.orgnetplebBut this is why it seems to be something weird about how the host deals with the container...I think what happens is that when the acme stuff is present in the container, it causes the boot process for the container to be drawn way out longer than it should be (hence why we are discussing here), but because the boot process is drawn out the container has not reached whatever stage it is supposed to get to for the host to install the routes.21:36:05
@m1cr0man:m1cr0man.comm1cr0manI'll see if I can put together a test suite for this when I next get a moment to investigate it. Not sure what the problem is right now, sorry I can't be more help22:08:20
@netpleb:matrix.orgnetplebThanks for looking into it. It was driving me mad, so I stopped yesterday after putting the non-clean solution of a 20s start timeout on the acme services.22:11:19
@m1cr0man:m1cr0man.comm1cr0manIt might be worth poking around with resolvectl/systemd-resolved and see if something fishy is happening. The nspawn containers do funky things with the hosts file and nameserver setup, could be conflicting with bind22:14:41
@netpleb:matrix.orgnetplebthanks, I've been poking at that a bit. Will let you know if anything comes of it.22:36:06
9 May 2025
@netpleb:matrix.orgnetpleb I have good news!! The issue is finally resolved. It turned out to be a much different problem than originally expected: ipv6 link local addressing was the cuplrit. Even though I had networking.enableIPv6 = false on both the host and the machine, systemd-network-wait-online was not reaching its target because systemd-network was trying to assign link local ipv6 addresses. Setting systemd.network.networks."eth0".networkConfig.LinkLocalAddressing = "no"; in my container config seemed to do the trick. 21:47:12
@netpleb:matrix.orgnetpleb * I have good news!! The issue is finally resolved. It turned out to be a much different problem than originally expected: ipv6 link local addressing was the cuplrit. Even though I had networking.enableIPv6 = false on both the host and the container, systemd-network-wait-online was not reaching its target because systemd-network was trying to assign link local ipv6 addresses. Setting systemd.network.networks."eth0".networkConfig.LinkLocalAddressing = "no"; in my container config seemed to do the trick. 21:54:28
10 May 2025
@arianvp:matrix.orgArian you can also configure systemd-network-wait-online to wait for either ipv4 or ipv6 07:19:36

Show newer messages


Back to Room ListRoom Version: 6