Sender | Message | Time |
---|---|---|
9 Jan 2025 | ||
21:40:59 | ||
13 Jan 2025 | ||
i've been seeing odd behavior with containers on non-LTS. they'll get in a stuck state on reboot, or won't start, and now they're failing to communicate across my bridge. anybody else using non-LTS and seeing problems? | 13:49:08 | |
it's bad enough i'm considering rebuilding the host i have on non-LTS. unfortunately the DB has been upgraded so can't downgrade | 13:50:08 | |
when the first behavior happens (stuck containers), restarting incus yields a defunct process that I've only figured out how to clear with a host reboot | 13:54:59 | |
well i guess i'll rollback to a previous generation on the host to fix the bridge issue until i can setup for bisecting what changed | 14:38:37 | |
traffic from the containers or VMs just disappears when trying to flow across the vlan aware bridge | 14:39:10 | |
| 14:41:05 | |
17 Jan 2025 | ||
I suspect the stuck container instance is an incus+zfs issue | 13:55:53 | |
discovered last night that i couldn't delete some ephemeral containers, even when stopped, as zfs is complaining about dataset being busy | 13:56:32 | |
The ZFS issue also exists on LTS. We have an app that spawns/orchestrates ephemeral containers via incus on ZFS+Incus and hit this regularly | 14:00:48 | |
A workaround is to stop the container, then wait a few seconds and retry delete until it works. | 14:01:35 | |
A reboot of the host also will help but that's quite disruptive. | 14:02:24 | |
* A reboot of the host also will help but that's quite disruptive. | 14:02:30 | |
* The ZFS issue also exists on LTS. We have an app that spawns/orchestrates ephemeral containers via incus on ZFS+Incus and hit this regularly. It seems like some open FDs are lingering after container shutdown, given time they usually go away. | 14:04:26 | |
* A reboot of the host also will help but that's quite disruptive for ops. | 14:05:12 | |
We have only encountered this on our main prod server (under load, not on dev servers). We couldn't replicate it repeatedly but it sounds like an upstream bug. | 14:10:36 | |
Hmm, that's good to know. I'm going to update our nixos test to see if I can reproduce it there | 15:02:44 | |
I really regret not trying harder to make a python package out of these nixos tests when i refactored them | 15:27:44 | |
😢 i'm not seeing the issue in our nixos test | 15:46:22 | |
Have you tried rapidly starting/stopping containers? It might also help to have a long running compute intensive process inside of the containers. | 15:47:40 | |
not yet, i'll keep trying to throw things at it | 15:48:12 | |
22 Jan 2025 | ||
re-ran my upgrade on my non-LTS host, but to 6.12 this time, and not seeing the bridge issue again. could have been a kernel bug, but not looking into it any further since i only have so many spoons. | 13:11:43 | |
still want to diagnose the ZFS issue, but the only reliable reproducer I've found is on my running host where running switch-to-configuration boot on another generation breaks during restart | 13:14:30 | |
Cobalt: have you tried creating a simple reproducer? | 13:15:03 | |
In reply to @adam:robins.wtf No, didn't have the time for it yet. Our testing is not really made for this kind of error so it no ight not be easily doable. The farthest I got was to let our software start a bunch (20) ephemeral container instances and then issue a shutdown. This lead to 1-2 containers having the ZFS issue sometimes. | 13:19:07 | |
* No, didn't have the time for it yet. Our testing setup is not really made for this kind of error so might not be easily doable. The farthest I got was to let our software start a bunch (20) ephemeral container instances and then issue a shutdown. This lead to 1-2 containers having the ZFS issue sometimes. | 13:19:56 | |
have you tried any different kernels, or zfs 2.3? | 13:22:27 | |
No, we are on NixOS stable with a hardened 6.12 kernel, as mentioned previously we have workaround in place so it hasn't come up (to us) again | 22:26:46 | |
As far as I know we also aren't logging this because it didn't have observable side-effects after a retry. I will look into it but our prod environment is a somewhat bad place to test with kernels etc. | 22:27:57 | |
* As far as I know we also aren't logging this because it didn't have observable side-effects after a retry. I will look into it but our prod environment is a somewhat bad place to test with kernels etc. I can't really justify downtime/debugging in production at the OS level for a resolved issue. | 22:29:24 |