| 20 Jan 2025 |
@elvishjerricco:matrix.org | which is exactly what I wanted to avoid :P | 04:13:50 |
@elvishjerricco:matrix.org | Ok. I have things that work now. I'm going to summarize the problems now if only to make sure I've got them all in line in my head :P
- systemd's switch-root is frustrating
- It will only serialize state and hand it over if the new PID1 is the builtin path (
/run/current-system/systemd/lib/systemd/systemd, and the empty string counts as equivalent).
- It will check for the existence of the new PID1 binary before it does its
switch_root function.
- This
switch_root function is the one that bind-mounts /run, meaning if the new init is in /run then you just won't be allowed to switch-root.
- Also when
switch_root does the bind mounts, it only does them if there's not already a mount there.
- We (inadvertently, I think) resolved this by bind mounting
/run ourselves for initrd-nixos-activation.service.
- Now, the reason credentials weren't being "imported" in stage 2 is because systemd expects them to have already been imported in stage 1. Stage 1 was importing, but we were't bind mounting
/run recursively.
- Because
switch_root skips already-mounted mounts, it also skipped this.
- Result, imported credentials are killed.
- All this means that we have to setup
/sysroot/run before we switch-root, even though we want switch-root to be the thing setting up /sysroot/run for us.
UGH
| 04:50:54 |
@elvishjerricco:matrix.org | * Ok. I have things that work now. I'm going to summarize the problems now if only to make sure I've got them all in line in my head :P
-
systemd's switch-root is frustrating
- It will only serialize state and hand it over if the new PID1 is the builtin path (
/run/current-system/systemd/lib/systemd/systemd, and the empty string counts as equivalent).
- It will check for the existence of the new PID1 binary before it does its
switch_root function.
- This
switch_root function is the one that bind-mounts /run, meaning if the new init is in /run then you just won't be allowed to switch-root, because the previous step will have failed before getting here.
- Also when
switch_root does the bind mounts, it only does them if there's not already a mount there.
- We (inadvertently, I think) resolved this by bind mounting
/run ourselves for initrd-nixos-activation.service.
-
Now, the reason credentials weren't being "imported" in stage 2 is because systemd expects them to have already been imported in stage 1. Stage 1 was importing, but we were't bind mounting /run recursively.
- Because
switch_root skips already-mounted mounts, it also skipped this.
- Result, imported credentials are killed.
-
All this means that we have to setup /sysroot/run before we switch-root, even though we want switch-root to be the thing setting up /sysroot/run for us.
UGH
| 04:52:00 |
@elvishjerricco:matrix.org | Additionally, there's a related problem for trying to eliminate specialFileSystems. Activation expects that some things in /sys and /proc are mounted too, not just /run, so now also have to setup those! Now, I think those can be temporary, but it's still something I wish was just handled by systemd. | 04:55:43 |
@elvishjerricco:matrix.org | Oh I have a bad idea. A really bad idea. We could solve all of this by switch-rooting into a system that's almost completely unconfigured, except for one unit that runs activation in the real, current root, and then does a soft-reboot into the real system | 04:58:31 |
phaer | Wait, why would you need a soft-reboot here? Shouldn't you end up with a working system after activation - similar to if you just run activation in an already booted system, i.e. during nixos-rebuild? What am I missing? | 16:57:46 |
phaer | Might actually give this a try later today/tomorrow to find out :D | 16:58:00 |
@elvishjerricco:matrix.org | phaer: You'd do the soft-reboot just to avoid the complex process of switch-to-configuration | 18:43:21 |
@elvishjerricco:matrix.org | it's a beast that really shouldn't be part of bootup | 18:43:28 |
@elvishjerricco:matrix.org | funny thought I just had about that idea. The intermediate phase would be like a stage 1.5, except it's more like stage 2 because it actually exists in the stage 2 rootfs, so maybe more like stage 2.-5? :P | 18:45:08 |
@elvishjerricco:matrix.org | Imagine trying to explain to upstream systemd "yea, this error happens on nixos during stage two and a negative half" | 18:45:54 |
iridium | @elvishjerricco:matrix.org: My notebook just crashed during upgrade - again, systemd restart failed. I now have a system in defunct state, but do have a shell. What useful things should I look at to collect more data for debugging? 🙂 | 18:59:35 |
@elvishjerricco:matrix.org | oh gosh | 18:59:58 |
@elvishjerricco:matrix.org | I need to remind myself of your exact issue again | 19:00:06 |
iridium | https://discourse.nixos.org/t/system-inoperable-after-automatic-upgrades/50197/2 | 19:02:16 |
@elvishjerricco:matrix.org | iridium: Are you able to open journalctl -e? | 19:03:05 |
iridium | Yes: https://pastebin.com/raw/7Yz2drXP | 19:05:09 |
@elvishjerricco:matrix.org | ok good. And you would expect that downgrading and redoing the upgrade would trigger it again, right? | 19:05:25 |
iridium | Not sure if relevant: https://pastebin.com/raw/GSkBauCk | 19:05:58 |
iridium | I have to admit I never tried, but would guess so, yes | 19:06:14 |
@elvishjerricco:matrix.org | oh if you know how to use gdb productively, that could be useful :P I am at level zero with that stuff | 19:06:51 |
iridium | that specific stacktrace couldn't be less interesting tbh | 19:07:16 |
@elvishjerricco:matrix.org | if it does, it'd be good to try booting the old generation, but adding systemd.log_level=debug to the kernel params from your boot menu, and then doing the upgrade | 19:07:39 |
@elvishjerricco:matrix.org | those journal logs could be much more useful | 19:07:46 |
iridium | Anything I should do with the machine right now, in case I don't manage to get it into exactly the same state again afterwards? | 19:08:11 |
@elvishjerricco:matrix.org | not that I can think of unfortunately. | 19:08:33 |
@elvishjerricco:matrix.org | iridium: last time you said you could reproduce it if you had an NFS and/or an SSHFS mounted. Was that the case this time? | 19:11:00 |
iridium | NFS yes, sshfs no | 19:11:09 |
iridium | NFS over wireguard, to be specific | 19:11:16 |
@elvishjerricco:matrix.org | where is the NFS mounted? | 19:11:58 |