!VhzbGHamdfMiGxpXyg:robins.wtf

NixOS LXC

36 Members
lxc, lxd, incus discussions related to NixOS15 Servers

Load older messages


SenderMessageTime
9 Jan 2025
@kristijan.zic:matrix.orgKristijanZic joined the room.21:40:59
13 Jan 2025
@adam:robins.wtfadamcstephensi've been seeing odd behavior with containers on non-LTS. they'll get in a stuck state on reboot, or won't start, and now they're failing to communicate across my bridge. anybody else using non-LTS and seeing problems?13:49:08
@adam:robins.wtfadamcstephensit's bad enough i'm considering rebuilding the host i have on non-LTS. unfortunately the DB has been upgraded so can't downgrade13:50:08
@adam:robins.wtfadamcstephenswhen the first behavior happens (stuck containers), restarting incus yields a defunct process that I've only figured out how to clear with a host reboot13:54:59
@adam:robins.wtfadamcstephenswell i guess i'll rollback to a previous generation on the host to fix the bridge issue until i can setup for bisecting what changed14:38:37
@adam:robins.wtfadamcstephenstraffic from the containers or VMs just disappears when trying to flow across the vlan aware bridge14:39:10
@adam:robins.wtfadamcstephens
09:35:31.532885 fast0 P   IP deck > guide1: ICMP echo request, id 7, seq 8, length 64
09:35:31.532885 bond0 P   IP deck > guide1: ICMP echo request, id 7, seq 8, length 64
09:35:31.532893 tap2f7e02f4 Out IP deck > guide1: ICMP echo request, id 7, seq 8, length 64
09:35:31.533010 tap2f7e02f4 P   IP guide1 > deck: ICMP echo reply, id 7, seq 8, length 64
09:35:32.556903 fast0 P   IP deck > guide1: ICMP echo request, id 7, seq 9, length 64
09:35:32.556903 bond0 P   IP deck > guide1: ICMP echo request, id 7, seq 9, length 64
09:35:32.556909 tap2f7e02f4 Out IP deck > guide1: ICMP echo request, id 7, seq 9, length 64
09:35:32.557017 tap2f7e02f4 P   IP guide1 > deck: ICMP echo reply, id 7, seq 9, length 64
14:41:05
17 Jan 2025
@adam:robins.wtfadamcstephensI suspect the stuck container instance is an incus+zfs issue13:55:53
@adam:robins.wtfadamcstephensdiscovered last night that i couldn't delete some ephemeral containers, even when stopped, as zfs is complaining about dataset being busy13:56:32
@c0ba1t:matrix.orgCobaltThe ZFS issue also exists on LTS. We have an app that spawns/orchestrates ephemeral containers via incus on ZFS+Incus and hit this regularly 14:00:48
@c0ba1t:matrix.orgCobaltA workaround is to stop the container, then wait a few seconds and retry delete until it works. 14:01:35
@c0ba1t:matrix.orgCobaltA reboot of the host also will help but that's quite disruptive. 14:02:24
@c0ba1t:matrix.orgCobalt* A reboot of the host also will help but that's quite disruptive. 14:02:30
@c0ba1t:matrix.orgCobalt* The ZFS issue also exists on LTS. We have an app that spawns/orchestrates ephemeral containers via incus on ZFS+Incus and hit this regularly. It seems like some open FDs are lingering after container shutdown, given time they usually go away. 14:04:26
@c0ba1t:matrix.orgCobalt* A reboot of the host also will help but that's quite disruptive for ops. 14:05:12
@c0ba1t:matrix.orgCobaltWe have only encountered this on our main prod server (under load, not on dev servers). We couldn't replicate it repeatedly but it sounds like an upstream bug. 14:10:36
@adam:robins.wtfadamcstephensHmm, that's good to know. I'm going to update our nixos test to see if I can reproduce it there15:02:44
@adam:robins.wtfadamcstephensI really regret not trying harder to make a python package out of these nixos tests when i refactored them15:27:44
@adam:robins.wtfadamcstephens😢 i'm not seeing the issue in our nixos test15:46:22
@c0ba1t:matrix.orgCobaltHave you tried rapidly starting/stopping containers? It might also help to have a long running compute intensive process inside of the containers. 15:47:40
@adam:robins.wtfadamcstephensnot yet, i'll keep trying to throw things at it15:48:12
22 Jan 2025
@adam:robins.wtfadamcstephensre-ran my upgrade on my non-LTS host, but to 6.12 this time, and not seeing the bridge issue again. could have been a kernel bug, but not looking into it any further since i only have so many spoons.13:11:43
@adam:robins.wtfadamcstephens still want to diagnose the ZFS issue, but the only reliable reproducer I've found is on my running host where running switch-to-configuration boot on another generation breaks during restart 13:14:30
@adam:robins.wtfadamcstephens Cobalt: have you tried creating a simple reproducer? 13:15:03
@c0ba1t:matrix.orgCobalt
In reply to @adam:robins.wtf
Cobalt: have you tried creating a simple reproducer?

No, didn't have the time for it yet. Our testing is not really made for this kind of error so it no ight not be easily doable.

The farthest I got was to let our software start a bunch (20) ephemeral container instances and then issue a shutdown. This lead to 1-2 containers having the ZFS issue sometimes.

13:19:07
@c0ba1t:matrix.orgCobalt* No, didn't have the time for it yet. Our testing setup is not really made for this kind of error so might not be easily doable. The farthest I got was to let our software start a bunch (20) ephemeral container instances and then issue a shutdown. This lead to 1-2 containers having the ZFS issue sometimes. 13:19:56
@adam:robins.wtfadamcstephenshave you tried any different kernels, or zfs 2.3?13:22:27
@c0ba1t:matrix.orgCobaltNo, we are on NixOS stable with a hardened 6.12 kernel, as mentioned previously we have workaround in place so it hasn't come up (to us) again22:26:46
@c0ba1t:matrix.orgCobaltAs far as I know we also aren't logging this because it didn't have observable side-effects after a retry. I will look into it but our prod environment is a somewhat bad place to test with kernels etc. 22:27:57
@c0ba1t:matrix.orgCobalt* As far as I know we also aren't logging this because it didn't have observable side-effects after a retry. I will look into it but our prod environment is a somewhat bad place to test with kernels etc. I can't really justify downtime/debugging in production at the OS level for a resolved issue. 22:29:24

Show newer messages


Back to Room ListRoom Version: 10