ceph - sepia - 2024-10-11

Timestamp (UTC)	Message
2024-10-11T05:18:41.123Z	<Sunil Angadi> facing similar issue <https://shaman.ceph.com/builds/ceph/wip-sangadi2-testing-2024-10-10-1448/> @yuriw Does it worked now?
2024-10-11T05:27:50.464Z	<Sunil Angadi> Tried it got failed again <https://shaman.ceph.com/builds/ceph/wip-sangadi2-testing-2024-10-10-1448/>
2024-10-11T05:50:04.665Z	<Guillaume Abrioux> ```Cloning repository <https://github.com/ceph/ceph.git> > git init /home/jenkins-build/build/workspace/ceph-pull-requests # timeout=10 Fetching upstream changes from <https://github.com/ceph/ceph.git> > git --version # timeout=10 > git --version # 'git version 2.34.1' > git fetch --tags --force --progress --depth=1 -- <https://github.com/ceph/ceph.git> +refs/pull/60223/:refs/remotes/origin/pr/60223/ # timeout=20 ERROR: Error cloning remote repo 'origin' hudson.plugins.git.GitException: Command "git fetch --tags --force --progress --depth=1 -- <https://github.com/ceph/ceph.git> +refs/pull/60223/:refs/remotes/origin/pr/60223/" returned status code 128: stdout: stderr: fatal: unable to access '<https://github.com/ceph/ceph.git/>': Could not resolve host: [github.com](http://github.com) at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2846)```
2024-10-11T05:54:50.839Z	<Guillaume Abrioux> looks like there's a dns issue in sepia or something like that?
2024-10-11T05:56:55.230Z	<Guillaume Abrioux> @Dan Mick just came across this <https://jenkins.ceph.com/job/ceph-docs/> is this job still relevant? I see the last success was on 2021 and there's a job waiting for 3 months
2024-10-11T06:17:07.888Z	<Shraddha Agrawal> Folks, I am not able to access pulpito (<https://pulpito.ceph.com/>), return 502 Bad Gateway.
2024-10-11T06:17:16.787Z	<Shraddha Agrawal> Folks, I am not able to access pulpito (<https://pulpito.ceph.com/>), returns 502 Bad Gateway.
2024-10-11T06:20:23.730Z	<jcollin> +1
2024-10-11T06:45:13.574Z	<Vallari Agrawal> For now you can use [https://pulpito-ng.ceph.com/](https://pulpito-ng.ceph.com/) (after connecting to sepia VPN)
2024-10-11T08:02:11.191Z	<Igor Golikov> Hi I am not able to ssh to [teuthology.front.sepia.ceph.com](http://teuthology.front.sepia.ceph.com) : got `port 22: Network is unreachable` VPN connects without any errors.
2024-10-11T08:17:19.452Z	<jcollin> same here.
2024-10-11T08:35:20.202Z	<Vallari Agrawal> For now you can use <https://pulpito-ng.ceph.com/> (after connecting to sepia VPN) update: this isn't working either now (paddles is down)
2024-10-11T13:19:53.533Z	<Zac Dover> Docs aren't building:
2024-10-11T13:19:54.272Z	<Zac Dover> <https://github.com/ceph/ceph/pull/60248>
2024-10-11T13:20:46.204Z	<Zac Dover> ```ERROR: Error cloning remote repo 'origin' [Checks API] No suitable checks publisher found. Setting status of 856505079bc837c65b6a9a0c70cd5f6220419ac0 to FAILURE with url <https://jenkins.ceph.com/job/ceph-pr-docs/107739/> and message: 'Docs: failed with errors '```
2024-10-11T14:21:04.900Z	<yuriw> pulpito is down `502 Bad Gateway` @Zack Cerza @Adam Kraitman ^
2024-10-11T14:25:51.279Z	<yuriw> No I still can't build suid => [https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVA[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull) Failed: dbus-daemon-1:1.12.20-8.el9.x86_64 libstoragemgmt-1.10.1-1.el9.x86_64 rpcbind-1.2.6-7.el9.x86_64 Error: Transaction failed Error: building at STEP "RUN echo "=== INSTALLING ===" ; dnf install -y --setopt=install_weak_deps=False --setopt=skip_missing_names_on_install=False --enablerepo=crb $(cat packages.txt)": while running runtime: exit status 1 Fri Oct 11 02:41:56 AM UTC 2024 :: rm -fr /tmp/install-deps.3458658 Build step 'Execute shell' marked build as failure ref: <https://tracker.ceph.com/issues/68447> @Laura Flores @Dan Mick ^^^
2024-10-11T14:31:53.089Z	<yuriw> looks like the `teuthology` box is down too
2024-10-11T14:32:29.561Z	<yuriw> cc: @Laura Flores
2024-10-11T14:35:13.330Z	<yuriw> `pulpito` and `teuthology` are down `502 Bad Gateway` @Zack Cerza @Adam Kraitman ^
2024-10-11T15:27:46.371Z	<Laura Flores> @badone you should be able to access it with either the sepia VPN, or by changing the ref to `[quay.ceph.io/ceph-ci/ceph](http://quay.ceph.io/ceph-ci/ceph)`
2024-10-11T15:28:14.285Z	<Laura Flores> Thx @Dan Mick, approved
2024-10-11T15:33:53.510Z	<Laura Flores> @Adam Kraitman are you able to take a look into the above problems? ^
2024-10-11T16:40:17.361Z	<Zack Cerza> alright, so basically every host we have in RHEV is down once again. can't get to the LRC dashboard or the wiki.
2024-10-11T16:40:59.339Z	<Dan Mick> Oh good. Be right there.
2024-10-11T16:53:15.804Z	<Zack Cerza> tried to see why the ceph dashboard might be down; reesi006 had a very old copy of cephadm, and when I went to fetch the 19.2.0 version: `curl: (6) Could not resolve host: [github.com](http://github.com)` ```root@reesi006:~# cat /etc/resolv.conf # Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8) # DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN # 127.0.0.53 is the systemd-resolved stub resolver. # run "systemd-resolve --status" to see details about the actual nameservers. nameserver 127.0.0.53 search [front.sepia.ceph.com](http://front.sepia.ceph.com) [sepia.ceph.com](http://sepia.ceph.com) root@reesi006:~# systemd-resolve --status systemd-resolve: command not found``` 🫠
2024-10-11T17:07:11.429Z	<Dan Mick> 3 of 4 vhosts are down in rhev
2024-10-11T17:07:55.601Z	<Dan Mick> looks like it's complaining about iscsi
2024-10-11T17:10:49.719Z	<Dan Mick> reesi002's iscsi service down, poking
2024-10-11T17:12:35.681Z	<Zack Cerza> starting around 12h ago, lots of messages like: `Host hv04 cannot access the Storage Domain(s) LRC_ISCSI attached to the Data Center sepia. Setting Host state to Non-Operational.` the last data from the teuthology exporter were received right around then
2024-10-11T17:12:49.896Z	<Dan Mick> nothing obvious, rebooting reesi002
2024-10-11T17:13:54.669Z	<Dan Mick> there is still a potential misconfiguration in the iscsi services that may be affecting failover/restart. maybe iscsi died for some reason and was unable to restart. The problem is the reconfiguration experiment may bring the cluster down too. Maybe I should try that now while it's down anyway
2024-10-11T17:24:23.193Z	<Laura Flores> Thanks @Dan Mick and @Zack Cerza
2024-10-11T17:27:07.319Z	<Dan Mick> looks like reboot hung, powercycling
2024-10-11T17:42:07.896Z	<Dan Mick> ...and discovering the BIOS is set to hang on SMART warning, sigh
2024-10-11T17:48:15.200Z	<Dan Mick> iscsi service up, failover configuration updated, vmhosts back active, VMs starting
2024-10-11T17:49:10.030Z	<Dan Mick> teuthology up
2024-10-11T17:50:14.951Z	<Dan Mick> resolv.conf was probably written with an older os version. The command has been renamed to resolvectl. resolvectl status shows what you were after
2024-10-11T17:53:49.753Z	<Zack Cerza> is the SMART warning something we should be concerned about?
2024-10-11T17:56:03.199Z	<Zack Cerza> yeah thanks - I had found that and noticed it was configured to only point at internal nameservers; i was thinking about prepending the external one, but also noticed it wouldn't serve records for [github.com](http://github.com), and that made me not-so-confident in making changes
2024-10-11T18:01:15.388Z	<Zack Cerza> huh, teuthology didn't actually fully go down, that's nice
2024-10-11T18:05:20.653Z	<Dan Mick> it has the external one too
2024-10-11T18:07:05.022Z	<yuriw> No I still can't build squid => [https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVA[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull) Failed: dbus-daemon-1:1.12.20-8.el9.x86_64 libstoragemgmt-1.10.1-1.el9.x86_64 rpcbind-1.2.6-7.el9.x86_64 Error: Transaction failed Error: building at STEP "RUN echo "=== INSTALLING ===" ; dnf install -y --setopt=install_weak_deps=False --setopt=skip_missing_names_on_install=False --enablerepo=crb $(cat packages.txt)": while running runtime: exit status 1 Fri Oct 11 02:41:56 AM UTC 2024 :: rm -fr /tmp/install-deps.3458658 Build step 'Execute shell' marked build as failure ref: <https://tracker.ceph.com/issues/68447> @Laura Flores @Dan Mick ^^^
2024-10-11T18:15:12.528Z	<Dan Mick> DNS Servers: 172.21.0.1 172.21.0.2 158.69.67.47
2024-10-11T18:16:44.425Z	<Dan Mick> SMART: well, many of the reesis have a 'bad' status, but smartctl doesn't report very much wear and the drives seem OK. It's a little worrisome but...
2024-10-11T18:17:13.559Z	<Zack Cerza> huh. `/etc/resolvconf/resolv.conf.d/*` only contain the first two. guess I have some reading to do on resolvectl
2024-10-11T18:26:46.745Z	<Dan Mick> resolveconf is an older system and is probably not active on that host
2024-10-11T18:27:40.342Z	<Dan Mick> it's really difficult to tell sometimes. resolveconf was a "run scripts on network events and twiddle resolv.conf contents". systemd-resolved is "I am the DNS server and I will handle all splits and servers"
2024-10-11T18:28:16.761Z	<Dan Mick> and honestly systemd-resolved seems to work a lot closer to how I expect
2024-10-11T18:28:44.508Z	<Dan Mick> but, like all systemd tools, is a little magic and you have to discover things that matter instead of just being told them in an organized way
2024-10-11T18:33:08.370Z	<Zack Cerza> yeah, I had also looked through the systemd-resolved.service manpage, looking at the referenced config files and directories, and couldn't find one that existed and also had actual configuration in it. I'd missed `/run/systemd/resolve/resolv.conf`, which seems to be the Actual Config File
2024-10-11T18:33:30.888Z	<Zack Cerza> just one more dns configuration system bro i swear. just one more
2024-10-11T18:46:16.978Z	<Christina Meno> Nice work Dan! Would you please notify the sepia mailing list that you’ve recovered those iscsi based services?
2024-10-11T18:51:20.203Z	<Dan Mick> <insert xkcd about standards here>
2024-10-11T18:52:18.331Z	<Dan Mick> yes
2024-10-11T18:53:41.562Z	<Christina Meno> thank you
2024-10-11T19:45:44.775Z	<Laura Flores> Okay, so services should be back up thanks to @Dan Mick and @Zack Cerza. The only remaining issue I'm aware of, which isn't about the lab but about container creation, is that this still needs to be merged/backported: <https://github.com/ceph/ceph/pull/60255> I reran the api check and will merge it when it's ready.
2024-10-11T19:46:05.474Z	<Laura Flores> @yuriw and everyone else, if there are still issues you notice in the lab, pls post here.
2024-10-11T19:58:39.965Z	<Dan Mick> I'm gonna go ahead and force-merge that fix. the api test isn't going to touch that code and it's a critical fix.
2024-10-11T19:59:06.752Z	<Dan Mick> (and api succeeded anyway)
2024-10-11T20:00:07.576Z	<Dan Mick> Laura: for backport, is it the same process as before? (invent a tracker, run the scripts etc.)?
2024-10-11T20:38:28.779Z	<badone> @Laura Flores I can access it, just couldn't log in 🙂

ceph - sepia - 2024-10-11

Any issue? please create an issue here and use the infra label.