ceph - sepia - 2024-12-20

Timestamp (UTC)Message
2024-12-20T01:44:02.741Z
<Rongqi Sun> Waiting queue of ARM make check grew rapidly since arm servers which used for CI decreased, now only 1 confusa node and 3 omani nodes. Are the other ARM servers being used for other purposes or taken offline?
2024-12-20T01:44:27.383Z
<Rongqi Sun> Waiting queue of ARM make check grew rapidly since arm servers which used for CI decreased, now only 1 confusa node and 3 omani nodes. Are the other ARM servers being used for other purposes or taken offline? @Dan Mick @Adam Kraitman
2024-12-20T02:28:07.036Z
<Dan Mick> did someone turn off [ceph.github.io](http://ceph.github.io)?
2024-12-20T02:28:34.366Z
<Dan Mick> I don't know of anything that's happened to them on purpose
2024-12-20T03:13:11.162Z
<Dan Mick> omani001: idle (buster,jammy,arm64,huge,installed-os-jammy,xenial)

omani002: idle (sepia,arm64,gigantic,huge,centos9,installed-os-centos9)

omani003: idle (sepia,arm64,gigantic,huge,centos9,installed-os-centos9)

omani004: offline (jammy,focal,arm64,bionic,huge,installed-os-jammy,xenial)
omani005: idle (jammy,focal,arm64,bionic,huge,installed-os-jammy,xenial)

omani006: temporarily offline: Oct 27 Disconnected by akraitman: Hardware issue needs further investigation (jammy,focal,arm64,bionic,huge,installed-os-jammy,xenial)
omani007: offline (arm64,centos8,huge,installed-os-centos8)
omani008: idle (sepia,arm64,gigantic,huge,centos9,installed-os-centos9)

omani009: 1 active builds (jammy,arm64,bionic,huge,installed-os-jammy,xenial)
<https://jenkins.ceph.com/job/ceph-pull-requests-arm64/66152/>

omani010: idle (sepia,arm64,gigantic,huge,centos9,installed-os-centos9)

sulcata02: temporarily offline: Jul 10 Disconnected by dmick: Kernel crash dumps were filling up /var/crash.  Suspect hardware issue. (sepia,arm64,gigantic,huge,centos9,installed-os-centos9)
confusa01: offline (sepia,arm64,gigantic,centos8,huge,installed-os-centos8)
confusa02: offline (sepia,arm64,gigantic,centos8,huge,installed-os-centos8)
confusa03: offline (sepia,arm64,gigantic,huge,centos9,installed-os-centos9)
confusa04: idle (sepia,arm64,gigantic,huge,centos9,installed-os-centos9)

confusa05: offline (sepia,arm64,gigantic,huge,centos7)
confusa06: offline (sepia,arm64,gigantic,huge,centos7,installed-os-centos7)
confusa07: idle (sepia,arm64,gigantic,huge,centos9,installed-os-centos9)

confusa08: temporarily offline: Aug 25 Disconnected by akraitman: Seeing this in the ipmi console
 {94016}[Hardware Error]:   section_type: memory error
[554312.388350] {94016}[Hardware Error]:    error_status: Storage error in DRAM memory (0x0000000000000400)
[554312.397881] {94016}[Hardware Error]:   node:4 card:1032 rank:10 bank:65535 row:0 column:0
[554312.406234] {94016}[Hardware Error]:   error_type: 2, single-bit ECC
[554312.412661] {94016}[Hardware Error]:  Error 1, type: info
[554312.418150] {94016}[Hardware Error]:  fru_text: ecc_errc_count_l
[554312.424245] {94016}[Hardware Error]:   section_type: general processor error
[554312.431367] {94016}[Hardware Error]:   requestor_id: 0x000000000000ffff
[554312.438055] {94016}[Hardware Error]:  Error 2, type: info
[554312.443528] {94016}[Hardware Error]:  fru_text: ecc_errc_count_h (sepia,arm64,gigantic,huge,centos9,installed-os-centos9)
confusa10: idle (sepia,arm64,gigantic,huge,centos9,installed-os-centos9)

confusa11: 1 active builds (jammy,sepia,focal,arm64,bionic,gigantic,huge,installed-os-jammy,xenial)
<https://jenkins.ceph.com/job/ceph-pull-requests-arm64/66151/>

confusa12: temporarily offline: Nov 21 ?? (jammy,sepia,focal,arm64,bionic,gigantic,huge,installed-os-jammy,xenial)
confusa13: offline (sepia,arm64,gigantic,huge,centos9,installed-os-centos9)
confusa14: offline (jammy,sepia,focal,arm64,bionic,gigantic,huge,installed-os-jammy,xenial)
Idle: 9  Busy: 2 Offline: 13
2024-12-20T03:13:15.505Z
<Dan Mick> that's a lot offline
2024-12-20T03:15:48.900Z
<Dan Mick> omani004 is unresponsive, but powered on.  powercycling
2024-12-20T03:19:19.299Z
<Dan Mick> nothing in the journalctl log
2024-12-20T03:19:28.528Z
<Dan Mick> shrug?
2024-12-20T03:20:50.896Z
<Dan Mick> omani007 is not ssh-able; console showing gibberish
2024-12-20T03:21:30.986Z
<Dan Mick> after power cycle, still showing gibberish; must have wrong baud rate set
2024-12-20T03:24:23.801Z
<Dan Mick> console is still gibberish after setting SoL baudrate (<shrug?>), waiting to see if it comes up
2024-12-20T03:30:32.424Z
<Dan Mick> nope.  it seems as though it's displaying a grub screen (in the small preview that the IPMI website shows), but I can't use the Java console because Java won't accept self-signed certificates anymore and I can't type on the serial console
2024-12-20T03:34:15.636Z
<Dan Mick> confusa01's jenkins service dies with a huge java backtrace.  maybe renewing its agent.jar would help.  but I'm done for the night.
2024-12-20T03:40:30.540Z
<Dan Mick> I've added you to vossi04 and folio13
2024-12-20T03:45:29.396Z
<Slava Dubeyko> Sounds great! Thanks. 🙂
2024-12-20T03:47:34.475Z
<Dan Mick> Welcome!  And our experience caused me to update the vpn docs and tools.
2024-12-20T03:48:18.448Z
<Slava Dubeyko> Makes a lot of sense 🙂
2024-12-20T06:01:25.603Z
<Rongqi Sun> Which of them are used for ARM make check as originally designed?
2024-12-20T09:15:34.976Z
<Alexander Indenbaum> @Dan Mick, that sounds great! 😊 Thanks so much for your help! Would submitting a PR to the squid branch ensure  inclusion in the downstream 8.1?
2024-12-20T09:23:36.495Z
<Alexander Indenbaum> <https://github.com/ceph/ceph/pull/61154>
2024-12-20T16:11:17.082Z
<David Galloway> Not I
2024-12-20T18:30:08.228Z
<Dan Mick> @Alexander Indenbaum I'm not the expert on how things move from upstream to downstream I'm afraid; not certain how that happens
2024-12-20T21:41:43.966Z
<Dan Mick> I'm not sure what the question means.  These are all Jenkins builder machines, all arm64 hosts, and are all used by Jenkins for whatever job needs them

Any issue? please create an issue here and use the infra label.