ceph - sepia - 2024-11-13

Timestamp (UTC)Message
2024-11-13T07:58:04.609Z
<Rongqi Sun> Seem like all nodes of label ‘[arm64&&!centos7&&!centos8&&!centos9](https://jenkins.ceph.com/label/arm64&&!centos7&&!centos8&&!centos9)’ which used for ARM CI are offline. <https://jenkins.ceph.com/job/ceph-pull-requests-arm64/>
2024-11-13T08:04:56.894Z
<Rongqi Sun> Few months ago, because a crimson-seastar related issue remained, ARM CI was not stable. But now, actually since 20 days ago, it became healthy enough.
2024-11-13T08:06:25.584Z
<Rongqi Sun> From my observation of 2 weeks make check results, X86 CI failed 49.46%, while ARM failed 38.75%.
2024-11-13T08:08:03.118Z
<Rongqi Sun> Of course, most failed results are because of the code related with specific PR, not arch related.
2024-11-13T08:08:33.483Z
<Rongqi Sun> Few months ago, because a crimson-seastar related issue remained, ARM CI was not stable. But now, actually since 20 days ago, after I fixed the issue, it became healthy enough.
2024-11-13T08:35:59.811Z
<rzarzynski> yup, looks like some limited outage. [o08.front.sepia.ceph.com](http://o08.front.sepia.ceph.com) is offline while [o06.front.sepia.ceph.com](http://o06.front.sepia.ceph.com) is fine. CC: @Adam Kraitman
2024-11-13T12:19:58.750Z
<Guillaume Abrioux> some braggi nodes are offline too
2024-11-13T12:20:02.452Z
<Guillaume Abrioux> @Adam Kraitman ^^^
2024-11-13T12:20:24.898Z
<Guillaume Abrioux> <https://2.jenkins.ceph.com/label/vagrant&&libvirt&&(braggi%7C%7Cadami)/>
2024-11-13T13:42:35.688Z
<Teoman Onay> Hi! All braggi nodes are offline, is it expected? Any idea when they will be back? I cannot run cephadm-ansible CI without them. Thx: https://files.slack.com/files-pri/T1HG3J90S-F080LL6U7DK/download/image.png
2024-11-13T13:43:46.556Z
<Teoman Onay> @Adam Kraitman Hi! All braggi nodes are offline, is it expected? Any idea when they will be back? I cannot run cephadm-ansible CI without them. Thx
2024-11-13T14:13:58.377Z
<Adam Kraitman> Hey they are back online
2024-11-13T14:14:11.219Z
<Teoman Onay> 👍
2024-11-13T14:14:14.520Z
<Teoman Onay> Thx
2024-11-13T14:36:39.984Z
<Adam Kraitman> It running again but I would check those core dumps
```Nov 12 07:01:17 o08 systemd[1]: Started Process Core Dump (PID 2035108/UID 0).
Nov 12 07:01:18 o08 systemd-coredump[2035109]: Process 2028800 (ceph-mon) of user 1322 dumped core.#012#012Stack trace of thread 2028829:#012#0  0x00007ffff5e8b94c __pthread_kill_imp
lementation (libc.so.6 + 0x8b94c)#012#1  0x00007ffff5e3e646 raise (libc.so.6 + 0x3e646)#012#2  0x00005555564a3311 n/a (/home/jayaprakash/ceph/build/bin/ceph-mon (deleted) + 0xf4f311)
#012ELF object binary architecture: AMD x86-64
Nov 12 07:01:18 o08 systemd[1]: systemd-coredump@388-2035108-0.service: Deactivated successfully.
Nov 12 07:01:19 o08 systemd[1]: Started Process Core Dump (PID 2035118/UID 0).
Nov 12 07:01:20 o08 systemd-coredump[2035119]: Process 2028882 (ceph-mon) of user 1322 dumped core.#012#012Stack trace of thread 2028911:#012#0  0x00007ffff5e8b94c __pthread_kill_implementation (libc.so.6 + 0x8b94c)#012#1  0x00007ffff5e3e646 raise (libc.so.6 + 0x3e646)#012#2  0x00005555564a3311 n/a (/home/jayaprakash/ceph/build/bin/ceph-mon (deleted) + 0xf4f311)#012ELF object binary architecture: AMD x86-64
Nov 12 07:01:20 o08 systemd[1]: systemd-coredump@389-2035118-0.service: Deactivated successfully.
Nov 12 07:01:21 o08 systemd[1]: Started Process Core Dump (PID 2035125/UID 0).
Nov 12 07:01:22 o08 systemd-coredump[2035126]: Process 2028841 (ceph-mon) of user 1322 dumped core.#012#012Stack trace of thread 2028870:#012#0  0x00007ffff5e8b94c __pthread_kill_implementation (libc.so.6 + 0x8b94c)#012#1  0x00007ffff5e3e646 raise (libc.so.6 + 0x3e646)#012#2  0x00005555564a3311 n/a (/home/jayaprakash/ceph/build/bin/ceph-mon (deleted) + 0xf4f311)#012ELF object binary architecture: AMD x86-64```
2024-11-13T14:40:18.633Z
<Adam Kraitman> Problem fixed
2024-11-13T15:35:14.528Z
<Adam Kupczyk> @Adam Kraitman Yesterday we had a problem with ceph-fuse that completely hung, and caused all commands referring to mounted "/teuto" to hang.
This was the reason for me to manually reboot o08 yesterday.
2024-11-13T16:25:27.812Z
<Casey Bodley> given that Ceph still supports arm packages for upstream releases, i think it's important that we keep the CI working. in the past, the arm check would break for months because nobody felt responsible for fixing it. now that we have folks like Rongqi committed to maintaining it, i agree that it should be required
2024-11-13T22:51:19.062Z
<Laura Flores> @Dan Mick Sentry seems down: <https://sentry.ceph.com/organizations/ceph/?query=c0854a8745974ee7b5d9b8c5b49e4205>
Can you take a look?
2024-11-13T23:05:46.033Z
<Dan Mick> seems back
2024-11-13T23:06:02.301Z
<Dan Mick> it had hit a kernel deadlock of some kind

Any issue? please create an issue here and use the infra label.