ceph - sepia - 2024-10-11

Timestamp (UTC)Message
2024-10-11T05:18:41.123Z
<Sunil Angadi> facing similar issue <https://shaman.ceph.com/builds/ceph/wip-sangadi2-testing-2024-10-10-1448/>
@yuriw Does it worked now?
2024-10-11T05:27:50.464Z
<Sunil Angadi> Tried it got failed again
<https://shaman.ceph.com/builds/ceph/wip-sangadi2-testing-2024-10-10-1448/>
2024-10-11T05:50:04.665Z
<Guillaume Abrioux> ```Cloning repository <https://github.com/ceph/ceph.git>
 > git init /home/jenkins-build/build/workspace/ceph-pull-requests # timeout=10
Fetching upstream changes from <https://github.com/ceph/ceph.git>
 > git --version # timeout=10
 > git --version # 'git version 2.34.1'
 > git fetch --tags --force --progress --depth=1 -- <https://github.com/ceph/ceph.git> +refs/pull/60223/*:refs/remotes/origin/pr/60223/* # timeout=20
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --force --progress --depth=1 -- <https://github.com/ceph/ceph.git> +refs/pull/60223/*:refs/remotes/origin/pr/60223/*" returned status code 128:
stdout: 
stderr: fatal: unable to access '<https://github.com/ceph/ceph.git/>': Could not resolve host: [github.com](http://github.com)

	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2846)```
2024-10-11T05:54:50.839Z
<Guillaume Abrioux> looks like there's a dns issue in sepia or something like that?
2024-10-11T05:56:55.230Z
<Guillaume Abrioux> @Dan Mick just came across this <https://jenkins.ceph.com/job/ceph-docs/>
is this job still relevant? I see the last success was on 2021 and there's a job waiting for 3 months
2024-10-11T06:17:07.888Z
<Shraddha Agrawal> Folks, I am not able to access pulpito (<https://pulpito.ceph.com/>), return 502 Bad Gateway.
2024-10-11T06:17:16.787Z
<Shraddha Agrawal> Folks, I am not able to access pulpito (<https://pulpito.ceph.com/>), returns 502 Bad Gateway.
2024-10-11T06:20:23.730Z
<jcollin> +1
2024-10-11T06:45:13.574Z
<Vallari Agrawal> For now you can use [https://pulpito-ng.ceph.com/](https://pulpito-ng.ceph.com/) (after connecting to sepia VPN)
2024-10-11T08:02:11.191Z
<Igor Golikov> Hi I am not able to ssh to [teuthology.front.sepia.ceph.com](http://teuthology.front.sepia.ceph.com) :
got `port 22: Network is unreachable` VPN connects without any errors.
2024-10-11T08:17:19.452Z
<jcollin> same here.
2024-10-11T08:35:20.202Z
<Vallari Agrawal> For now you can use <https://pulpito-ng.ceph.com/> (after connecting to sepia VPN)

update: this isn't working either now (paddles is down)
2024-10-11T13:19:53.533Z
<Zac Dover> Docs aren't building:
2024-10-11T13:19:54.272Z
<Zac Dover> <https://github.com/ceph/ceph/pull/60248>
2024-10-11T13:20:46.204Z
<Zac Dover> ```ERROR: Error cloning remote repo 'origin'
[Checks API] No suitable checks publisher found.
Setting status of 856505079bc837c65b6a9a0c70cd5f6220419ac0 to FAILURE with url <https://jenkins.ceph.com/job/ceph-pr-docs/107739/> and message: 'Docs: failed with errors
 '```
2024-10-11T14:21:04.900Z
<yuriw> pulpito is down `502 Bad Gateway`
@Zack Cerza @Adam Kraitman ^
2024-10-11T14:25:51.279Z
<yuriw> No I still can't build suid =>

[https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVA[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull)

Failed:
  dbus-daemon-1:1.12.20-8.el9.x86_64     libstoragemgmt-1.10.1-1.el9.x86_64
  rpcbind-1.2.6-7.el9.x86_64            Error: Transaction failed


Error: building at STEP "RUN echo "=== INSTALLING ===" ; dnf install -y --setopt=install_weak_deps=False --setopt=skip_missing_names_on_install=False --enablerepo=crb $(cat packages.txt)": while running runtime: exit status 1
Fri Oct 11 02:41:56 AM UTC 2024 :: rm -fr /tmp/install-deps.3458658
Build step 'Execute shell' marked build as failure

ref: <https://tracker.ceph.com/issues/68447>
@Laura Flores @Dan Mick ^^^
2024-10-11T14:31:53.089Z
<yuriw> looks like the `teuthology` box is down too
2024-10-11T14:32:29.561Z
<yuriw> cc: @Laura Flores
2024-10-11T14:35:13.330Z
<yuriw> `pulpito` and `teuthology` are down `502 Bad Gateway`
@Zack Cerza @Adam Kraitman ^
2024-10-11T15:27:46.371Z
<Laura Flores> @badone you should be able to access it with either the sepia VPN, or by changing the ref to `[quay.ceph.io/ceph-ci/ceph](http://quay.ceph.io/ceph-ci/ceph)`
2024-10-11T15:28:14.285Z
<Laura Flores> Thx @Dan Mick, approved
2024-10-11T15:33:53.510Z
<Laura Flores> @Adam Kraitman are you able to take a look into the above problems? ^
2024-10-11T16:40:17.361Z
<Zack Cerza> alright, so basically every host we have in RHEV is down once again. can't get to the LRC dashboard or the wiki.
2024-10-11T16:40:59.339Z
<Dan Mick> Oh good.  Be right there.
2024-10-11T16:53:15.804Z
<Zack Cerza> tried to see why the ceph dashboard might be down; reesi006 had a very old copy of cephadm, and when I went to fetch the 19.2.0 version:
`curl: (6) Could not resolve host: [github.com](http://github.com)`
```root@reesi006:~# cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
# 127.0.0.53 is the systemd-resolved stub resolver.
# run "systemd-resolve --status" to see details about the actual nameservers.

nameserver 127.0.0.53
search [front.sepia.ceph.com](http://front.sepia.ceph.com) [sepia.ceph.com](http://sepia.ceph.com)
root@reesi006:~# systemd-resolve --status
systemd-resolve: command not found```
🫠
2024-10-11T17:07:11.429Z
<Dan Mick> 3 of 4 vhosts are down in rhev
2024-10-11T17:07:55.601Z
<Dan Mick> looks like it's complaining about iscsi
2024-10-11T17:10:49.719Z
<Dan Mick> reesi002's iscsi service down, poking
2024-10-11T17:12:35.681Z
<Zack Cerza> starting around 12h ago, lots of messages like: `Host hv04 cannot access the Storage Domain(s) LRC_ISCSI attached to the Data Center sepia. Setting Host state to Non-Operational.`
the last data from the teuthology exporter were received right around then
2024-10-11T17:12:49.896Z
<Dan Mick> nothing obvious, rebooting reesi002
2024-10-11T17:13:54.669Z
<Dan Mick> there is still a potential misconfiguration in the iscsi services that may be affecting failover/restart.  maybe iscsi died for some reason and was unable to restart.  The problem is the reconfiguration experiment may bring the cluster down too.  Maybe I should try that now while it's down anyway
2024-10-11T17:24:23.193Z
<Laura Flores> Thanks @Dan Mick and @Zack Cerza
2024-10-11T17:27:07.319Z
<Dan Mick> looks like reboot hung, powercycling
2024-10-11T17:42:07.896Z
<Dan Mick> ...and discovering the BIOS is set to hang on SMART warning, sigh
2024-10-11T17:48:15.200Z
<Dan Mick> iscsi service up, failover configuration updated, vmhosts back active, VMs starting
2024-10-11T17:49:10.030Z
<Dan Mick> teuthology up
2024-10-11T17:50:14.951Z
<Dan Mick> resolv.conf was probably written with an older os version.  The command has been renamed to resolvectl.  resolvectl status shows what you were after
2024-10-11T17:53:49.753Z
<Zack Cerza> is the SMART warning something we should be concerned about?
2024-10-11T17:56:03.199Z
<Zack Cerza> yeah thanks - I had found that and noticed it was configured to only point at internal nameservers; i was thinking about prepending the external one, but also noticed it wouldn't serve records for [github.com](http://github.com), and that made me not-so-confident in making changes
2024-10-11T18:01:15.388Z
<Zack Cerza> huh, teuthology didn't actually fully go down, that's nice
2024-10-11T18:05:20.653Z
<Dan Mick> it has the external one too
2024-10-11T18:07:05.022Z
<yuriw> No I still can't build squid =>

[https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVA[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull)

Failed:
  dbus-daemon-1:1.12.20-8.el9.x86_64     libstoragemgmt-1.10.1-1.el9.x86_64
  rpcbind-1.2.6-7.el9.x86_64            Error: Transaction failed


Error: building at STEP "RUN echo "=== INSTALLING ===" ; dnf install -y --setopt=install_weak_deps=False --setopt=skip_missing_names_on_install=False --enablerepo=crb $(cat packages.txt)": while running runtime: exit status 1
Fri Oct 11 02:41:56 AM UTC 2024 :: rm -fr /tmp/install-deps.3458658
Build step 'Execute shell' marked build as failure

ref: <https://tracker.ceph.com/issues/68447>
@Laura Flores @Dan Mick ^^^
2024-10-11T18:15:12.528Z
<Dan Mick> DNS Servers: 172.21.0.1 172.21.0.2 158.69.67.47
2024-10-11T18:16:44.425Z
<Dan Mick> SMART: well, many of the reesis have a 'bad' status, but smartctl doesn't report very much wear and the drives seem OK.  It's a little worrisome but...
2024-10-11T18:17:13.559Z
<Zack Cerza> huh. `/etc/resolvconf/resolv.conf.d/*` only contain the first two. guess I have some reading to do on resolvectl
2024-10-11T18:26:46.745Z
<Dan Mick> resolveconf is an older system and is probably not active on that host
2024-10-11T18:27:40.342Z
<Dan Mick> it's really difficult to tell sometimes.  resolveconf was a "run scripts on network events and twiddle resolv.conf contents".  systemd-resolved is "I am the DNS server and I will handle all splits and servers"
2024-10-11T18:28:16.761Z
<Dan Mick> and honestly systemd-resolved seems to work a lot closer to how I expect
2024-10-11T18:28:44.508Z
<Dan Mick> but, like all systemd tools, is a little magic and you have to discover things that matter instead of just being told them in an organized way
2024-10-11T18:33:08.370Z
<Zack Cerza> yeah, I had also looked through the systemd-resolved.service manpage, looking at the referenced config files and directories, and couldn't find one that existed and also had actual configuration in it. I'd missed `/run/systemd/resolve/resolv.conf`, which seems to be the Actual Config File
2024-10-11T18:33:30.888Z
<Zack Cerza> just one more dns configuration system bro i swear. just one more
2024-10-11T18:46:16.978Z
<Christina Meno> Nice work Dan! Would you please notify the sepia mailing list that you’ve recovered those iscsi based services?
2024-10-11T18:51:20.203Z
<Dan Mick> <insert xkcd about standards here>
2024-10-11T18:52:18.331Z
<Dan Mick> yes
2024-10-11T18:53:41.562Z
<Christina Meno> thank you
2024-10-11T19:45:44.775Z
<Laura Flores> Okay, so services should be back up thanks to @Dan Mick and @Zack Cerza. The only remaining issue I'm aware of, which isn't about the lab but about container creation, is that this still needs to be merged/backported: <https://github.com/ceph/ceph/pull/60255>

I reran the api check and will merge it when it's ready.
2024-10-11T19:46:05.474Z
<Laura Flores> @yuriw and everyone else, if there are still issues you notice in the lab, pls post here.
2024-10-11T19:58:39.965Z
<Dan Mick> I'm gonna go ahead and force-merge that fix.  the api test isn't going to touch that code and it's a critical fix.
2024-10-11T19:59:06.752Z
<Dan Mick> (and api succeeded anyway)
2024-10-11T20:00:07.576Z
<Dan Mick> Laura: for backport, is it the same process as before?  (invent a tracker, run the scripts etc.)?
2024-10-11T20:38:28.779Z
<badone> @Laura Flores I can access it, just couldn't log in 🙂

Any issue? please create an issue here and use the infra label.