ceph - sepia - 2024-09-20

Timestamp (UTC)Message
2024-09-20T00:37:54.807Z
<nehaojha> ```[WRN] UPGRADE_REDEPLOY_DAEMON: Upgrading daemon iscsi.iscsi.reesi002.hjvvyq on host reesi002 failed.
    Upgrade daemon: iscsi.iscsi.reesi002.hjvvyq: cephadm exited with an error code: 1, stderr: Non-zero exit code 125 from /usr/bin/podman container inspect --format {{.State.Status}} ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a-iscsi.iscsi.reesi002.hjvvyq
/usr/bin/podman: stderr Error: error inspecting object: no such container ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a-iscsi.iscsi.reesi002.hjvvyq
Deploy daemon iscsi.iscsi.reesi002.hjvvyq ...
Creating ceph-iscsi config...
Write file: /var/lib/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/iscsi.iscsi.reesi002.hjvvyq/iscsi-gateway.cfg
Write file: /var/lib/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/iscsi.iscsi.reesi002.hjvvyq/tcmu-runner-entrypoint.sh
Failed to trim old cgroups /sys/fs/cgroup/system.slice/system-ceph\x2d28f7427e\x2d5558\x2d4ffd\x2dae1a\x2d51ec3042759a.slice/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@iscsi.iscsi.reesi002.hjvvyq.service
Non-zero exit code 1 from systemctl start ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@iscsi.iscsi.reesi002.hjvvyq
systemctl: stderr Job for ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@iscsi.iscsi.reesi002.hjvvyq.service failed because the control process exited with error code.
systemctl: stderr See "systemctl status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@iscsi.iscsi.reesi002.hjvvyq.service" and "journalctl -xeu ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@iscsi.iscsi.reesi002.hjvvyq.service" for details.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/var/lib/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 5579, in <module>
  File "/var/lib/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 5567, in main
  File "/var/lib/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 3049, in command_deploy_from
  File "/var/lib/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 3084, in _common_deploy
  File "/var/lib/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 3104, in _deploy_daemon_container
  File "/var/lib/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 1088, in deploy_daemon
  File "/var/lib/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/__main__.py", line 1224, in deploy_daemon_units
  File "/var/lib/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/cephadm.a58127a8eed242cae13849ddbebcb9931d7a5410f406f2d264e3b1ed31d9605e/cephadmlib/call_wrappers.py", line 307, in call_throws
RuntimeError: Failed command: systemctl start ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@iscsi.iscsi.reesi002.hjvvyq: Job for ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@iscsi.iscsi.reesi002.hjvvyq.service failed because the control process exited with error code.
See "systemctl status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@iscsi.iscsi.reesi002.hjvvyq.service" and "journalctl -xeu ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@iscsi.iscsi.reesi002.hjvvyq.service" for details.```
2024-09-20T00:38:20.415Z
<nehaojha> @Ilya Dryomov tagging you here as well in case you can help
2024-09-20T00:39:46.391Z
<nehaojha> <https://tracker.ceph.com/issues/51361#note-8>
2024-09-20T01:44:31.227Z
<Dan Mick> after some very blunt instruments, I've managed to restart the reesi002 iscsi service using the older (19.1.1) container.  I have no idea what was going wrong with the 19.2.0 one.  The python rbd api server was hanging unkillably with no information (strace hung as well).
2024-09-20T01:44:39.404Z
<Dan Mick> it looks like services are coming back up.
2024-09-20T07:35:06.770Z
<Ilya Dryomov> The trace is not exactly the same, so I don't know if that tracker ticket is relevant for sure
2024-09-20T07:47:43.870Z
<Ilya Dryomov> The unkillable hang is likely caused by this:
```Sep 19 22:44:23 reesi002 kernel: [   94.132179] db_root: cannot be changed: target drivers registered
Sep 19 22:44:23 reesi002 kernel: [  243.103961] INFO: task rbd-target-api:6558 blocked for more than 120 seconds.
Sep 19 22:44:23 reesi002 kernel: [  243.104025]       Not tainted 5.15.0-116-generic #126-Ubuntu
Sep 19 22:44:23 reesi002 kernel: [  243.104042] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 19 22:44:23 reesi002 kernel: [  243.104063] task:rbd-target-api  state:D stack:    0 pid: 6558 ppid:  6548 flags:0x00000006
Sep 19 22:44:23 reesi002 kernel: [  243.104070] Call Trace:
Sep 19 22:44:23 reesi002 kernel: [  243.104073]  <TASK>
Sep 19 22:44:23 reesi002 kernel: [  243.104077]  __schedule+0x24e/0x590
Sep 19 22:44:23 reesi002 kernel: [  243.104085]  schedule+0x69/0x110
Sep 19 22:44:23 reesi002 kernel: [  243.104088]  schedule_timeout+0x105/0x140
Sep 19 22:44:23 reesi002 kernel: [  243.104093]  __wait_for_common+0xae/0x150
Sep 19 22:44:23 reesi002 kernel: [  243.104096]  ? usleep_range_state+0x90/0x90
Sep 19 22:44:23 reesi002 kernel: [  243.104101]  wait_for_completion+0x24/0x30
Sep 19 22:44:23 reesi002 kernel: [  243.104107]  tcmu_netlink_event_send+0x171/0x2b0 [target_core_user]
Sep 19 22:44:23 reesi002 kernel: [  243.104116]  tcmu_destroy_device+0xb9/0x110 [target_core_user]
Sep 19 22:44:23 reesi002 kernel: [  243.104123]  target_free_device+0x50/0x110 [target_core_mod]
Sep 19 22:44:23 reesi002 kernel: [  243.104155]  target_core_dev_release+0x15/0x20 [target_core_mod]
Sep 19 22:44:23 reesi002 kernel: [  243.104173]  config_item_cleanup+0x5d/0x100
Sep 19 22:44:23 reesi002 kernel: [  243.104178]  config_item_put+0x35/0x50
Sep 19 22:44:23 reesi002 kernel: [  243.104181]  configfs_rmdir+0x1f7/0x390
Sep 19 22:44:23 reesi002 kernel: [  243.104187]  ? may_delete+0x111/0x2b0
Sep 19 22:44:23 reesi002 kernel: [  243.104191]  vfs_rmdir+0x86/0x1c0
Sep 19 22:44:23 reesi002 kernel: [  243.104195]  do_rmdir+0x173/0x1a0
Sep 19 22:44:23 reesi002 kernel: [  243.104199]  __x64_sys_rmdir+0x42/0x70
Sep 19 22:44:23 reesi002 kernel: [  243.104203]  x64_sys_call+0x1656/0x1fa0
Sep 19 22:44:23 reesi002 kernel: [  243.104209]  do_syscall_64+0x56/0xb0
Sep 19 22:44:23 reesi002 kernel: [  243.104215]  ? do_syscall_64+0x63/0xb0
Sep 19 22:44:23 reesi002 kernel: [  243.104221]  ? exit_to_user_mode_prepare+0x37/0xb0
Sep 19 22:44:23 reesi002 kernel: [  243.104227]  ? syscall_exit_to_user_mode+0x2c/0x50
Sep 19 22:44:23 reesi002 kernel: [  243.104231]  ? x64_sys_call+0x1a81/0x1fa0
Sep 19 22:44:23 reesi002 kernel: [  243.104236]  ? do_syscall_64+0x63/0xb0
Sep 19 22:44:23 reesi002 kernel: [  243.104241]  ? exit_to_user_mode_prepare+0x37/0xb0
Sep 19 22:44:23 reesi002 kernel: [  243.104246]  ? syscall_exit_to_user_mode+0x2c/0x50
Sep 19 22:44:23 reesi002 kernel: [  243.104250]  ? x64_sys_call+0x1e1d/0x1fa0
Sep 19 22:44:23 reesi002 kernel: [  243.104254]  ? do_syscall_64+0x63/0xb0
Sep 19 22:44:23 reesi002 kernel: [  243.104259]  ? syscall_exit_to_user_mode+0x2c/0x50
Sep 19 22:44:23 reesi002 kernel: [  243.104263]  ? x64_sys_call+0x1a81/0x1fa0
Sep 19 22:44:23 reesi002 kernel: [  243.104267]  ? do_syscall_64+0x63/0xb0
Sep 19 22:44:23 reesi002 kernel: [  243.104272]  ? exc_page_fault+0x89/0x170
Sep 19 22:44:23 reesi002 kernel: [  243.104276]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
Sep 19 22:44:23 reesi002 kernel: [  243.104281] RIP: 0033:0x7fe87359131b
Sep 19 22:44:23 reesi002 kernel: [  243.104285] RSP: 002b:00007fffad0ce238 EFLAGS: 00000206 ORIG_RAX: 0000000000000054
Sep 19 22:44:23 reesi002 kernel: [  243.104289] RAX: ffffffffffffffda RBX: 000055d06cf5b230 RCX: 00007fe87359131b
Sep 19 22:44:23 reesi002 kernel: [  243.104292] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007fe86ebe4470
Sep 19 22:44:23 reesi002 kernel: [  243.104294] RBP: 00007fffad0ce250 R08: 000000000000000a R09: 0000000000000000
Sep 19 22:44:23 reesi002 kernel: [  243.104296] R10: 00007fe8739211f8 R11: 0000000000000206 R12: 00000000ffffff9c
Sep 19 22:44:23 reesi002 kernel: [  243.104298] R13: 000055d06cf5b230 R14: 000055d06cf59940 R15: 00007fe86eb9cb88
Sep 19 22:44:23 reesi002 kernel: [  243.104301]  </TASK>```
2024-09-20T07:48:11.813Z
<Ilya Dryomov> And this:
```Sep 19 22:44:23 reesi002 kernel: [  243.104322] INFO: task ework-thread:23811 blocked for more than 120 seconds.
Sep 19 22:44:23 reesi002 kernel: [  243.104344]       Not tainted 5.15.0-116-generic #126-Ubuntu
Sep 19 22:44:23 reesi002 kernel: [  243.104361] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 19 22:44:23 reesi002 kernel: [  243.104382] task:ework-thread    state:D stack:    0 pid:23811 ppid:  8100 flags:0x00000002
Sep 19 22:44:23 reesi002 kernel: [  243.104386] Call Trace:
Sep 19 22:44:23 reesi002 kernel: [  243.104388]  <TASK>
Sep 19 22:44:23 reesi002 kernel: [  243.104390]  __schedule+0x24e/0x590
Sep 19 22:44:23 reesi002 kernel: [  243.104394]  schedule+0x69/0x110
Sep 19 22:44:23 reesi002 kernel: [  243.104396]  schedule_preempt_disabled+0xe/0x20
Sep 19 22:44:23 reesi002 kernel: [  243.104399]  rwsem_down_read_slowpath+0x33b/0x390
Sep 19 22:44:23 reesi002 kernel: [  243.104404]  down_read+0x43/0xa0
Sep 19 22:44:23 reesi002 kernel: [  243.104408]  walk_component+0x136/0x1c0
Sep 19 22:44:23 reesi002 kernel: [  243.104411]  link_path_walk.part.0.constprop.0+0x23f/0x3a0
Sep 19 22:44:23 reesi002 kernel: [  243.104415]  ? path_init+0x2c0/0x3f0
Sep 19 22:44:23 reesi002 kernel: [  243.104418]  path_openat+0xb5/0x2b0
Sep 19 22:44:23 reesi002 kernel: [  243.104423]  do_filp_open+0xb2/0x160
Sep 19 22:44:23 reesi002 kernel: [  243.104427]  ? __check_object_size+0x1d/0x30
Sep 19 22:44:23 reesi002 kernel: [  243.104431]  ? alloc_fd+0x53/0x180
Sep 19 22:44:23 reesi002 kernel: [  243.104438]  do_sys_openat2+0x9f/0x160
Sep 19 22:44:23 reesi002 kernel: [  243.104444]  __x64_sys_openat+0x55/0x90
Sep 19 22:44:23 reesi002 kernel: [  243.104448]  x64_sys_call+0x1a55/0x1fa0
Sep 19 22:44:23 reesi002 kernel: [  243.104454]  do_syscall_64+0x56/0xb0
Sep 19 22:44:23 reesi002 kernel: [  243.104460]  ? __do_softirq+0xd9/0x2e7
Sep 19 22:44:23 reesi002 kernel: [  243.104467]  ? exit_to_user_mode_prepare+0x37/0xb0
Sep 19 22:44:23 reesi002 kernel: [  243.104472]  ? irqentry_exit_to_user_mode+0xe/0x20
Sep 19 22:44:23 reesi002 kernel: [  243.104476]  ? irqentry_exit+0x1d/0x30
Sep 19 22:44:23 reesi002 kernel: [  243.104480]  ? common_interrupt+0x55/0xa0
Sep 19 22:44:23 reesi002 kernel: [  243.104483]  entry_SYSCALL_64_after_hwframe+0x6c/0xd6
Sep 19 22:44:23 reesi002 kernel: [  243.104487] RIP: 0033:0x7f3862c417c4
Sep 19 22:44:23 reesi002 kernel: [  243.104489] RSP: 002b:00007f3854e74b00 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
Sep 19 22:44:23 reesi002 kernel: [  243.104493] RAX: ffffffffffffffda RBX: 00007f3854e76640 RCX: 00007f3862c417c4
Sep 19 22:44:23 reesi002 kernel: [  243.104495] RDX: 0000000000000001 RSI: 00007f3854e74c20 RDI: 00000000ffffff9c
Sep 19 22:44:23 reesi002 kernel: [  243.104497] RBP: 00007f3854e74c20 R08: 0000000000000000 R09: 00007f3854e74957
Sep 19 22:44:23 reesi002 kernel: [  243.104499] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001
Sep 19 22:44:23 reesi002 kernel: [  243.104501] R13: 000000000000000b R14: 00007f3862bcd980 R15: 0000000000000000
Sep 19 22:44:23 reesi002 kernel: [  243.104504]  </TASK>```
2024-09-20T07:48:59.907Z
<Ilya Dryomov> `ework-thread` belongs to tcmu-runner, so it looks like it managed to deadlock itself
2024-09-20T07:51:26.037Z
<Ilya Dryomov> @Xiubo Li Did you see this before?
2024-09-20T07:54:48.854Z
<Ilya Dryomov> > after some very blunt instruments, I've managed to restart the reesi002 iscsi service using the older (19.1.1) container. I have no idea what was going wrong with the 19.2.0 one.
This is weird because there should be no code changes between 19.1.1 and 19.2.0
2024-09-20T07:54:59.012Z
<Ilya Dryomov> It should be literally the same `ceph-iscsi-3.8-1.el9.noarch.rpm` and `tcmu-runner-1.5.2-99.g1bdb239.el9.x86_64.rpm` packages
2024-09-20T07:57:15.782Z
<Ilya Dryomov> Perhaps rtslib got upgraded in the underlying CentOS Stream?
2024-09-20T08:38:36.762Z
<Dan Mick> Containers can be examined.  Dunno.  6558 was indeed the pid that was misbehaving.  I think there's a persistent problem with upgrade in that there's some kernel state that isn't reset and the new container can't start because it fails to init something in the kernel
2024-09-20T09:17:51.600Z
<Jose J Palacios-Perez> Hi @Adam Kraitman I managed to get a container with ceph running, I was planning to set a symlink from  `/var/lib/containers/` to a subdir in my home directory, so the space in `/dev/mapper/cl-root` is not exhausted (currently got only 3.4 G free). I'll hold my horses after the upgrade then. Please give me a shout when you plan to fresh install since I will be running the container continuously to collect perf results. Many thanks!
2024-09-20T12:05:40.536Z
<Xiubo Li> This should be some IOs were stuck and the tcmu driver kept waiting
2024-09-20T12:06:09.881Z
<Xiubo Li> In theory if you restart the tcmu-runner service it can disappear
2024-09-20T12:07:39.369Z
<Ilya Dryomov> The stack traces above don't seem to be related to I/O to me
2024-09-20T12:09:39.172Z
<Ilya Dryomov> This was on a freshly rebooted node -- I doubt LIO got fully setup
2024-09-20T12:11:13.171Z
<Ilya Dryomov> It seems like rbd-target-api was in the process of removing a directory in configfs, while tcmu-runner went to open something also in configfs
2024-09-20T12:12:12.742Z
<Xiubo Li> ```Sep 19 22:44:23 reesi002 kernel: [  243.104101]  wait_for_completion+0x24/0x30
Sep 19 22:44:23 reesi002 kernel: [  243.104107]  tcmu_netlink_event_send+0x171/0x2b0 [target_core_user]```
This one
2024-09-20T12:12:41.108Z
<Xiubo Li> It's waiting the tcmu-runner's responding
2024-09-20T12:13:02.481Z
<Ilya Dryomov> Right, but tcmu-runner can't respond because its ework-thread got taken out
2024-09-20T12:14:15.026Z
<Xiubo Li> Then just restart the tcmu-runner sevice, it will reset the ringbuffer in LIO/TCMU
2024-09-20T12:14:55.374Z
<Xiubo Li> We have resolved this issue in tcmu-runner
2024-09-20T12:15:30.737Z
<Ilya Dryomov> tcmu-runner is sitting in the kernel trying to grab a semaphore
2024-09-20T12:15:53.132Z
<Ilya Dryomov> ... so it's a deadlock with both processes in uninterruptible sleep
2024-09-20T12:16:31.093Z
<Ilya Dryomov> The service can't be restarted in that scenario
2024-09-20T12:29:12.317Z
<Xiubo Li> The second trace could be caused by the first one ?
2024-09-20T12:29:36.559Z
<Xiubo Li> It won't block userspace tcmu-runner's rebooting
2024-09-20T13:00:22.918Z
<Ilya Dryomov> So you mean start a new instance of tcmu-runner instead of restarting the one that is stuck?
2024-09-20T13:00:58.997Z
<Ilya Dryomov> Or the same but for rbd-target-api service?
2024-09-20T13:01:08.527Z
<Xiubo Li> No, just restart it. And it will clear the ringbuffer in kernel space.
2024-09-20T13:01:31.188Z
<Xiubo Li> Only the tcmu-runner service
2024-09-20T13:01:45.856Z
<Ilya Dryomov> Well, a restart (as I understand it) means killing the existing process
2024-09-20T13:01:56.487Z
<Xiubo Li> This is what we did before when hitting this.
2024-09-20T13:02:34.654Z
<Ilya Dryomov> A process that is stuck in uninterruptible sleep can't be killed
2024-09-20T13:03:00.347Z
<Ilya Dryomov> So you are saying that I saw these exact traces before?
2024-09-20T13:03:10.512Z
<Ilya Dryomov> So you are saying that you saw these exact traces before?
2024-09-20T13:03:38.255Z
<Ilya Dryomov> Are you are saying that you saw these _exact_ traces before?
2024-09-20T13:03:45.380Z
<Xiubo Li> yeah, in gluster-block cu cases. If in a container then just restart the container will be okay. killing it will work too as I remembered.
2024-09-20T13:03:49.411Z
<Ilya Dryomov> Are you saying that you saw these _exact_ traces before?
2024-09-20T13:04:20.475Z
<Xiubo Li> As I remembered several times
2024-09-20T13:06:55.097Z
<Xiubo Li> ```commit 5bf4f822c5d195a5588908eefffeae36ef9e0080
Author: Mike Christie <mchristi@redhat.com>
Date:   Thu Dec 14 20:01:00 2017 -0600

    libtcmu: fix unclean shutdown and restart
    
    If we restart a daemon using libtcmu while IO is in flight
    the kernel could have commands partially completed. This
    patch has us block the device so new IO is stopped, and
    then we reset the ring to a clean state.```
2024-09-20T13:07:05.454Z
<Xiubo Li> It seems this commit
2024-09-20T13:08:56.868Z
<Ilya Dryomov> I doubt there was any I/O going on there
This node was just rebooted
2024-09-20T13:25:38.190Z
<Xiubo Li> Not only the IOs, but also the SCSI cmds to tcmu-runner itself could stuck for some reasons.
2024-09-20T13:25:45.459Z
<Xiubo Li> ```Sep 19 22:44:23 reesi002 kernel: [  243.104101]  wait_for_completion+0x24/0x30```
2024-09-20T15:56:44.720Z
<Guillaume Abrioux> i'm getting 403 errors from <https://qa-proxy.ceph.com>
2024-09-20T15:56:59.782Z
<Guillaume Abrioux> trying to access [https://qa-proxy.ceph.com/teuthology/akupczyk-2024-09-18_17:27:22-rados-aclamk-testin[…]2024-09-18-1004-distro-default-smithi/7911462/teuthology.log](https://qa-proxy.ceph.com/teuthology/akupczyk-2024-09-18_17:27:22-rados-aclamk-testing-nauvoo-2024-09-18-1004-distro-default-smithi/7911462/teuthology.log)
2024-09-20T16:06:07.529Z
<Laura Flores> +1, it was just brought up but no decisions made. We discussed the idea of adding a Shaman build at some point to address any dependency issues.
2024-09-20T16:13:50.179Z
<Casey Bodley> thanks, hoping we can prioritize this once squid is finally out. we'll want to backport 24.04 builds/testing to squid when it's ready. and for 'make check', the 22.04 builders are blocking us from upgrading compiler versions (clang in particular)
2024-09-20T16:36:12.560Z
<Zack Cerza> The cephfs mount on teuthology is broken; root can't even access it. The lab's down until that can be resolved.
2024-09-20T17:18:22.362Z
<Zack Cerza> I rebooted teuthology.front; the cephfs mount came back. But root fs access is _very_ slow
2024-09-20T17:19:22.830Z
<Zack Cerza> log access is back
2024-09-20T17:22:23.256Z
<Guillaume Abrioux> thanks @Zack Cerza
2024-09-20T18:12:31.655Z
<Zack Cerza> ok, jobs are running again
2024-09-20T18:42:08.917Z
<Dan Mick> Are there mechanisms to ensure the configfs access is shut down cleanly if there are I/Os in flight when the shutdown happens?
2024-09-20T18:43:09.524Z
<Dan Mick> might the issue, or some of the issues, be from chance timing of the shutdown vs the I/O state?

Any issue? please create an issue here and use the infra label.