ceph - sepia - 2024-07-24

Timestamp (UTC)Message
2024-07-24T07:15:52.308Z
<Ilya Dryomov> @Zack Cerza I have some additional data for the
> Re: recent uptick of test jobs that get stuck and cleaned up only after 12h timeout
thread.  In my krbd run, I have at least 7 jobs that got stuck at or around
```2024-07-24T01:56:04.054 INFO:teuthology.run_tasks:Running task internal.base...
2024-07-24T01:56:04.061 INFO:teuthology.task.internal:Creating test directory...
2024-07-24T01:56:04.062 DEBUG:teuthology.orchestra.run.smithi003:> mkdir -p -m0755 -- /home/ubuntu/cephtest
2024-07-24T01:56:04.067 DEBUG:teuthology.orchestra.run.smithi016:> mkdir -p -m0755 -- /home/ubuntu/cephtest
2024-07-24T01:56:04.075 DEBUG:teuthology.orchestra.run.smithi089:> mkdir -p -m0755 -- /home/ubuntu/cephtest```
shortly after rebooting into the testing kernel.  Remembering the
> Something must have flushed when the job reached the 12 hour mark, likely as part of it getting killed
point, I killed one of the them and checked `teuthology.log`.  This showed up after the kill:
```2024-07-24T06:58:41.243 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/teuthology/run_tasks.py", line 109, in run_tasks
    manager.__enter__()
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/teuthology/task/internal/__init__.py", line 41, in base
    run.wait(
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/teuthology/orchestra/run.py", line 479, in wait
    proc.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/teuthology/orchestra/run.py", line 143, in wait
    status = self._get_exitstatus()
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/teuthology/orchestra/run.py", line 192, in _get_exitstatus
    status = self._stdout_buf.channel.recv_exit_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/virtualenv/lib/python3.10/site-packages/paramiko/channel.py", line 400, in recv_exit_status
    self.status_event.wait()
  File "src/gevent/event.py", line 163, in gevent._gevent_cevent.Event.wait
  File "src/gevent/_abstract_linkable.py", line 521, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait
  File "src/gevent/_abstract_linkable.py", line 487, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 490, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 442, in gevent._gevent_c_abstract_linkable.AbstractLinkable._AbstractLinkable__wait_to_be_notified
  File "src/gevent/_abstract_linkable.py", line 451, in gevent._gevent_c_abstract_linkable.AbstractLinkable._switch_to_hub
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
gevent.exceptions.LoopExit: This operation would block forever
	Hub: <Hub '' at 0x7f9e40f9be70 epoll default pending=0 ref=0 fileno=4 thread_ident=0x7f9e435ce740>
	Handles:
[]
2024-07-24T06:58:41.530 ERROR:teuthology.util.sentry: Sentry event: <https://sentry.ceph.com/organizations/ceph/?query=839c45d450f941faa45b97f3bdacccea>
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/teuthology/run_tasks.py", line 109, in run_tasks
    manager.__enter__()
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/teuthology/task/internal/__init__.py", line 41, in base
    run.wait(
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/teuthology/orchestra/run.py", line 479, in wait
    proc.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/teuthology/orchestra/run.py", line 143, in wait
    status = self._get_exitstatus()
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/teuthology/orchestra/run.py", line 192, in _get_exitstatus
    status = self._stdout_buf.channel.recv_exit_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_a8aed60fc62d4ff39d2216b360f018eba7518cfe/virtualenv/lib/python3.10/site-packages/paramiko/channel.py", line 400, in recv_exit_status
    self.status_event.wait()
  File "src/gevent/event.py", line 163, in gevent._gevent_cevent.Event.wait
  File "src/gevent/_abstract_linkable.py", line 521, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait
  File "src/gevent/_abstract_linkable.py", line 487, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 490, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 442, in gevent._gevent_c_abstract_linkable.AbstractLinkable._AbstractLinkable__wait_to_be_notified
  File "src/gevent/_abstract_linkable.py", line 451, in gevent._gevent_c_abstract_linkable.AbstractLinkable._switch_to_hub
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
gevent.exceptions.LoopExit: This operation would block forever
	Hub: <Hub '' at 0x7f9e40f9be70 epoll default pending=0 ref=1 fileno=4 thread_ident=0x7f9e435ce740>
	Handles:
[]
2024-07-24T06:58:41.533 DEBUG:teuthology.run_tasks:Unwinding manager internal.base
2024-07-24T06:58:41.540 DEBUG:teuthology.run_tasks:Unwinding manager kernel
2024-07-24T06:58:41.555 DEBUG:teuthology.run_tasks:Unwinding manager console_log```
<https://qa-proxy.ceph.com/teuthology/dis-2024-07-24_01:37:18-krbd-main-wip-exclusive-option-states-default-smithi/7815575/teuthology.log>
Have you seen this before?
2024-07-24T07:26:12.684Z
<Ilya Dryomov> These are links to the other 6 jobs (for a total of ~20 nodes taken out of the pool), all showing the same traces:
<https://qa-proxy.ceph.com/teuthology/dis-2024-07-24_01:37:18-krbd-main-wip-exclusive-option-states-default-smithi/7815587/teuthology.log>
<https://qa-proxy.ceph.com/teuthology/dis-2024-07-24_01:37:18-krbd-main-wip-exclusive-option-states-default-smithi/7815617/teuthology.log>
<https://qa-proxy.ceph.com/teuthology/dis-2024-07-24_01:37:18-krbd-main-wip-exclusive-option-states-default-smithi/7815697/teuthology.log>
<https://qa-proxy.ceph.com/teuthology/dis-2024-07-24_01:37:18-krbd-main-wip-exclusive-option-states-default-smithi/7815745/teuthology.log>
<https://qa-proxy.ceph.com/teuthology/dis-2024-07-24_01:37:18-krbd-main-wip-exclusive-option-states-default-smithi/7815747/teuthology.log>
<https://qa-proxy.ceph.com/teuthology/dis-2024-07-24_01:37:18-krbd-main-wip-exclusive-option-states-default-smithi/7815752/teuthology.log>
2024-07-24T12:43:22.823Z
<Patrick Donnelly> @Adam Kraitman @Dan Mick vossi01 has become unresponsive. Can one ofyou have a look and maybe kick it (i.e. reboot)
2024-07-24T13:54:04.801Z
<Adam Kraitman> I saw those errors it seems like something we can ignore for now, the server is up now
```ul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 234 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 262 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 262 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 224 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 224 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 252 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 252 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 280 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 280 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 242 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 242 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 270 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 270 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 232 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 232 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 260 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 260 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 222 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 222 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 250 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 250 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 279 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 279 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 240 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 240 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 269 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 269 affinity is now unmanaged
Jul 24 12:56:19 vossi01 irqbalance[2041]: Cannot change IRQ 0 affinity: Input/output error
Jul 24 12:56:19 vossi01 irqbalance[2041]: IRQ 0 affinity is now unmanaged```
2024-07-24T14:23:53.564Z
<Patrick Donnelly> cc @Venky Shankar
2024-07-24T15:56:28.573Z
<Zack Cerza> I have not seen this one before so thanks for all the information. I'll see if I can make anything of it. Interesting that it doesn't seem to involve much that's changed recently
2024-07-24T19:59:22.687Z
<Josh Durgin> @Dan Mick could the recent relabelling have caused the jenkins queue to back up? I see it's at a peak now compared to the last week, and jobs iike <https://jenkins.ceph.com/job/ceph-pull-requests/> are waiting on bionic, which seems wrong
2024-07-24T20:28:35.347Z
<Dan Mick> sure, it's possible
2024-07-24T20:28:44.840Z
<Dan Mick> the label system is a long way from sensible
2024-07-24T21:59:42.511Z
<Dan Mick> all the huge amd64 debian machines have bionic labels
2024-07-24T21:59:52.508Z
<Dan Mick> I changed the job's required tag to jammy and it made no difference
2024-07-24T22:00:11.199Z
<Dan Mick> there are 24 of them
2024-07-24T22:00:15.532Z
<Dan Mick> they're just all busy

Any issue? please create an issue here and use the infra label.