| 4 Apr 2023 |
hexa | stuck at kexinit | 13:05:12 |
hexa | if I didn't know any better I would assume MTU 😛 | 13:05:30 |
cole-h | [ 550.001721] mlx5_core 0001:01:00.1: wait_func:1137:(pid 18057): MODIFY_CQ(0x403) canceled on out of queue timeout.
[ 550.001723] mlx5_core 0001:01:00.0: wait_func:1137:(pid 18053): MODIFY_CQ(0x403) canceled on out of queue timeout.
[ 551.221694] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 551.227603] rcu: 60-...0: (4 GPs behind) idle=d9e4/1/0x4000000000000000 softirq=1016/1016 fqs=54368
[ 551.236726] (detected by 34, t=115624 jiffies, g=13597, q=153057 ncpus=80)
[ 551.243675] Task dump for CPU 60:
[ 551.246977] task:kworker/u160:5 state:R running task stack:0 pid:815 ppid:2 flags:0x0000000a
[ 551.256879] Workqueue: efi_rts_wq efi_call_rts
[ 551.261313] Call trace:
[ 551.263747] __switch_to+0xf0/0x170
[ 551.267226] 0xffff081f5b486ac0
[ 556.145647] mlx5_core 0001:01:00.0: wait_func:1137:(pid 18065): ACCESS_REG(0x805) canceled on out of queue timeout.
[ 558.193622] mlx5_core 0001:01:00.0: wait_func:1137:(pid 18068): ACCESS_REG(0x805) canceled on out of queue timeout.
| 13:05:34 |
cole-h | lol | 13:05:36 |
hexa | low entropy? | 13:05:38 |
hexa | that call trace is magnificent | 13:06:10 |
hexa | __switch_to! | 13:06:14 |
cole-h | [ 605.297123] INFO: task kworker/u160:3:519 blocked for more than 483 seconds.
[ 605.324779] Tainted: P O 6.1.22 #1-NixOS
[ 605.330601] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.338417] task:kworker/u160:3 state:D stack:0 pid:519 ppid:2 flags:0x00000008
[ 605.346758] Workqueue: events_freezable_power_ sync_hw_clock
[ 605.352409] Call trace:
[ 605.354844] __switch_to+0xf0/0x170
[ 605.358325] __schedule+0x30c/0x1254
[ 605.361889] schedule+0x58/0xec
[ 605.365017] schedule_timeout+0x14c/0x180
[ 605.369017] __wait_for_common+0xd4/0x250
[ 605.373017] wait_for_completion+0x28/0x34
[ 605.377102] virt_efi_set_time+0x114/0x190
[ 605.381188] efi_set_time+0x84/0xc0
[ 605.384664] rtc_set_time+0xc0/0x1c4
[ 605.388229] sync_hw_clock+0x1ac/0x230
[ 605.391966] process_one_work+0x1f4/0x460
[ 605.395966] worker_thread+0x188/0x4e0
[ 605.399704] kthread+0xe0/0xe4
[ 605.402747] ret_from_fork+0x10/0x20
[ 605.406326] INFO: task kworker/7:1H:808 blocked for more than 362 seconds.
[ 605.413189] Tainted: P O 6.1.22 #1-NixOS
[ 605.419009] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 605.426826] task:kworker/7:1H state:D stack:0 pid:808 ppid:2 flags:0x00000008
[ 605.435165] Workqueue: kblockd blk_mq_timeout_work
[ 605.439946] Call trace:
[ 605.442381] __switch_to+0xf0/0x170
[ 605.445858] __schedule+0x30c/0x1254
[ 605.449423] schedule+0x58/0xec
[ 605.452551] schedule_timeout+0x14c/0x180
[ 605.456550] __wait_for_common+0xd4/0x250
[ 605.460548] wait_for_completion+0x28/0x34
[ 605.464633] __wait_rcu_gp+0x194/0x1c4
[ 605.468371] synchronize_rcu+0x68/0xa0
[ 605.472110] blk_mq_timeout_work+0x198/0x1dc
[ 605.476369] process_one_work+0x1f4/0x460
[ 605.480368] worker_thread+0x188/0x4e0
[ 605.484106] kthread+0xe0/0xe4
[ 605.487149] ret_from_fork+0x10/0x20
| 13:06:14 |
cole-h | I'm gonna bonk it again | 13:06:39 |
hexa | want to try a previous regeneration? | 13:06:59 |
hexa | if you even have that 😄 | 13:07:08 |
cole-h | not yet (because it's not easy, if possible lol) | 13:07:16 |
cole-h | telling Equinix to reboot the box is much easier hehe | 13:07:30 |
hexa | could very well be a kernel regression | 13:07:31 |
cole-h | lovely | 13:07:39 |
hexa | because who tests lts kernels, right? | 13:08:03 |
hexa | you just backport stuff into it and move on | 13:08:11 |
cole-h | lmao | 13:08:15 |
raitobezarius | In reply to @hexa:lossy.network you just backport stuff into it and move on greg k-h enters the channel | 14:47:01 |
cole-h | Found this thread: https://lkml.org/lkml/2023/3/16/765
So while it's not 6.2 as that thread mentions, may be the same problem | 14:47:34 |
hexa | kernel downgrade when | 14:52:46 |
hexa | would also be interesting to know what its previous kernel verison was | 14:55:45 |
hexa | * would also be interesting to know what its previous kernel version was | 14:55:49 |
cole-h | 🤷 the box is unpinned, but likely 6.1.21 was its previous version | 14:56:08 |
cole-h | ok, 6.1.21 is also busted
[ 110.726426] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 110.732340] rcu: 44-...0: (0 ticks this GP) idle=9b44/1/0x4000000000000000 softirq=2863/2863 fqs=2462
[ 110.741636] (detected by 70, t=5255 jiffies, g=14289, q=8797 ncpus=80)
[ 110.748238] Task dump for CPU 44:
[ 110.751540] task:kworker/u160:1 state:R running task stack:0 pid:419 ppid:2 flags:0x0000000a
[ 110.761443] Workqueue: efi_rts_wq efi_call_rts
[ 110.765878] Call trace:
[ 110.768312] __switch_to+0xf0/0x170
[ 110.771791] 0xffff07ff85645b80
| 15:38:21 |
cole-h | nvm it's still 6.1.22 somehow | 15:39:29 |
cole-h | [ 242.441034] INFO: task kworker/u160:0:9 blocked for more than 120 seconds.
[ 242.447910] Tainted: P O 6.1.22 #1-NixOS
[ 242.453735] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.461553] task:kworker/u160:0 state:D stack:0 pid:9 ppid:2 flags:0x00000008
[ 242.469895] Workqueue: events_freezable_power_ sync_hw_clock
[ 242.475548] Call trace:
[ 242.477986] __switch_to+0xf0/0x170
[ 242.481467] __schedule+0x30c/0x1254
[ 242.485035] schedule+0x58/0xec
[ 242.488164] schedule_timeout+0x14c/0x180
[ 242.492165] __wait_for_common+0xd4/0x250
[ 242.496165] wait_for_completion+0x28/0x34
[ 242.500251] virt_efi_set_time+0x114/0x190
[ 242.504339] efi_set_time+0x84/0xc0
[ 242.507818] rtc_set_time+0xc0/0x1c4
[ 242.511385] sync_hw_clock+0x1ac/0x230
[ 242.515123] process_one_work+0x1f4/0x460
[ 242.519124] worker_thread+0x188/0x4e0
[ 242.522863] kthread+0xe0/0xe4
[ 242.525908] ret_from_fork+0x10/0x20
| 15:39:43 |
hexa | In reply to @cole-h:matrix.org not yet (because it's not easy, if possible lol) 🤡 | 15:40:13 |
cole-h | oh I missed something lol, let's try again | 15:55:36 |
cole-h | [ 109.650254] rcu: rcu_sched kthread timer wakeup didn't happen for 3034 jiffies! g13841 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 109.661456] rcu: Possible timer handling issue on cpu=44 timer-softirq=210
[ 109.668404] rcu: rcu_sched kthread starved for 3040 jiffies! g13841 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=44
[ 109.678737] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 109.687682] rcu: RCU grace-period kthread stack dump:
[ 109.692720] task:rcu_sched state:I stack:0 pid:14 ppid:2 flags:0x00000008
[ 109.701057] Call trace:
[ 109.703490] __switch_to+0xf0/0x170
[ 109.706967] __schedule+0x30c/0x1254
[ 109.710530] schedule+0x58/0xec
[ 109.713659] schedule_timeout+0xa4/0x180
[ 109.717571] rcu_gp_fqs_loop+0x138/0x4ac
[ 109.721483] rcu_gp_kthread+0x1d4/0x210
[ 109.725307] kthread+0xe0/0xe4
[ 109.728350] ret_from_fork+0x10/0x20
welp
| 16:22:23 |