2024-10-11T05:18:41.123Z | <Sunil Angadi> facing similar issue <https://shaman.ceph.com/builds/ceph/wip-sangadi2-testing-2024-10-10-1448/>
@yuriw Does it worked now? |
2024-10-11T05:27:50.464Z | <Sunil Angadi> Tried it got failed again
<https://shaman.ceph.com/builds/ceph/wip-sangadi2-testing-2024-10-10-1448/> |
2024-10-11T05:50:04.665Z | <Guillaume Abrioux> ```Cloning repository <https://github.com/ceph/ceph.git>
> git init /home/jenkins-build/build/workspace/ceph-pull-requests # timeout=10
Fetching upstream changes from <https://github.com/ceph/ceph.git>
> git --version # timeout=10
> git --version # 'git version 2.34.1'
> git fetch --tags --force --progress --depth=1 -- <https://github.com/ceph/ceph.git> +refs/pull/60223/*:refs/remotes/origin/pr/60223/* # timeout=20
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --force --progress --depth=1 -- <https://github.com/ceph/ceph.git> +refs/pull/60223/*:refs/remotes/origin/pr/60223/*" returned status code 128:
stdout:
stderr: fatal: unable to access '<https://github.com/ceph/ceph.git/>': Could not resolve host: [github.com](http://github.com)
at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2846)``` |
2024-10-11T05:54:50.839Z | <Guillaume Abrioux> looks like there's a dns issue in sepia or something like that? |
2024-10-11T05:56:55.230Z | <Guillaume Abrioux> @Dan Mick just came across this <https://jenkins.ceph.com/job/ceph-docs/>
is this job still relevant? I see the last success was on 2021 and there's a job waiting for 3 months |
2024-10-11T06:17:07.888Z | <Shraddha Agrawal> Folks, I am not able to access pulpito (<https://pulpito.ceph.com/>), return 502 Bad Gateway. |
2024-10-11T06:17:16.787Z | <Shraddha Agrawal> Folks, I am not able to access pulpito (<https://pulpito.ceph.com/>), returns 502 Bad Gateway. |
2024-10-11T06:20:23.730Z | <jcollin> +1 |
2024-10-11T06:45:13.574Z | <Vallari Agrawal> For now you can use [https://pulpito-ng.ceph.com/](https://pulpito-ng.ceph.com/) (after connecting to sepia VPN) |
2024-10-11T08:02:11.191Z | <Igor Golikov> Hi I am not able to ssh to [teuthology.front.sepia.ceph.com](http://teuthology.front.sepia.ceph.com) :
got `port 22: Network is unreachable` VPN connects without any errors. |
2024-10-11T08:17:19.452Z | <jcollin> same here. |
2024-10-11T08:35:20.202Z | <Vallari Agrawal> For now you can use <https://pulpito-ng.ceph.com/> (after connecting to sepia VPN)
update: this isn't working either now (paddles is down) |
2024-10-11T13:19:53.533Z | <Zac Dover> Docs aren't building: |
2024-10-11T13:19:54.272Z | <Zac Dover> <https://github.com/ceph/ceph/pull/60248> |
2024-10-11T13:20:46.204Z | <Zac Dover> ```ERROR: Error cloning remote repo 'origin'
[Checks API] No suitable checks publisher found.
Setting status of 856505079bc837c65b6a9a0c70cd5f6220419ac0 to FAILURE with url <https://jenkins.ceph.com/job/ceph-pr-docs/107739/> and message: 'Docs: failed with errors
'```
|
2024-10-11T14:21:04.900Z | <yuriw> pulpito is down `502 Bad Gateway`
@Zack Cerza @Adam Kraitman ^ |
2024-10-11T14:25:51.279Z | <yuriw> No I still can't build suid =>
[https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVA[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull)
Failed:
dbus-daemon-1:1.12.20-8.el9.x86_64 libstoragemgmt-1.10.1-1.el9.x86_64
rpcbind-1.2.6-7.el9.x86_64 Error: Transaction failed
Error: building at STEP "RUN echo "=== INSTALLING ===" ; dnf install -y --setopt=install_weak_deps=False --setopt=skip_missing_names_on_install=False --enablerepo=crb $(cat packages.txt)": while running runtime: exit status 1
Fri Oct 11 02:41:56 AM UTC 2024 :: rm -fr /tmp/install-deps.3458658
Build step 'Execute shell' marked build as failure
ref: <https://tracker.ceph.com/issues/68447>
@Laura Flores @Dan Mick ^^^ |
2024-10-11T14:31:53.089Z | <yuriw> looks like the `teuthology` box is down too |
2024-10-11T14:32:29.561Z | <yuriw> cc: @Laura Flores |
2024-10-11T14:35:13.330Z | <yuriw> `pulpito` and `teuthology` are down `502 Bad Gateway`
@Zack Cerza @Adam Kraitman ^ |
2024-10-11T15:27:46.371Z | <Laura Flores> @badone you should be able to access it with either the sepia VPN, or by changing the ref to `[quay.ceph.io/ceph-ci/ceph](http://quay.ceph.io/ceph-ci/ceph)` |
2024-10-11T15:28:14.285Z | <Laura Flores> Thx @Dan Mick, approved |
2024-10-11T15:33:53.510Z | <Laura Flores> @Adam Kraitman are you able to take a look into the above problems? ^ |
2024-10-11T16:40:17.361Z | <Zack Cerza> alright, so basically every host we have in RHEV is down once again. can't get to the LRC dashboard or the wiki. |
2024-10-11T16:40:59.339Z | <Dan Mick> Oh good. Be right there. |
2024-10-11T16:53:15.804Z | <Zack Cerza> tried to see why the ceph dashboard might be down; reesi006 had a very old copy of cephadm, and when I went to fetch the 19.2.0 version:
`curl: (6) Could not resolve host: [github.com](http://github.com)`
```root@reesi006:~# cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
# 127.0.0.53 is the systemd-resolved stub resolver.
# run "systemd-resolve --status" to see details about the actual nameservers.
nameserver 127.0.0.53
search [front.sepia.ceph.com](http://front.sepia.ceph.com) [sepia.ceph.com](http://sepia.ceph.com)
root@reesi006:~# systemd-resolve --status
systemd-resolve: command not found```
🫠 |
2024-10-11T17:07:11.429Z | <Dan Mick> 3 of 4 vhosts are down in rhev |
2024-10-11T17:07:55.601Z | <Dan Mick> looks like it's complaining about iscsi |
2024-10-11T17:10:49.719Z | <Dan Mick> reesi002's iscsi service down, poking |
2024-10-11T17:12:35.681Z | <Zack Cerza> starting around 12h ago, lots of messages like: `Host hv04 cannot access the Storage Domain(s) LRC_ISCSI attached to the Data Center sepia. Setting Host state to Non-Operational.`
the last data from the teuthology exporter were received right around then |
2024-10-11T17:12:49.896Z | <Dan Mick> nothing obvious, rebooting reesi002 |
2024-10-11T17:13:54.669Z | <Dan Mick> there is still a potential misconfiguration in the iscsi services that may be affecting failover/restart. maybe iscsi died for some reason and was unable to restart. The problem is the reconfiguration experiment may bring the cluster down too. Maybe I should try that now while it's down anyway |
2024-10-11T17:24:23.193Z | <Laura Flores> Thanks @Dan Mick and @Zack Cerza |
2024-10-11T17:27:07.319Z | <Dan Mick> looks like reboot hung, powercycling |
2024-10-11T17:42:07.896Z | <Dan Mick> ...and discovering the BIOS is set to hang on SMART warning, sigh |
2024-10-11T17:48:15.200Z | <Dan Mick> iscsi service up, failover configuration updated, vmhosts back active, VMs starting |
2024-10-11T17:49:10.030Z | <Dan Mick> teuthology up |
2024-10-11T17:50:14.951Z | <Dan Mick> resolv.conf was probably written with an older os version. The command has been renamed to resolvectl. resolvectl status shows what you were after |
2024-10-11T17:53:49.753Z | <Zack Cerza> is the SMART warning something we should be concerned about? |
2024-10-11T17:56:03.199Z | <Zack Cerza> yeah thanks - I had found that and noticed it was configured to only point at internal nameservers; i was thinking about prepending the external one, but also noticed it wouldn't serve records for [github.com](http://github.com), and that made me not-so-confident in making changes |
2024-10-11T18:01:15.388Z | <Zack Cerza> huh, teuthology didn't actually fully go down, that's nice |
2024-10-11T18:05:20.653Z | <Dan Mick> it has the external one too |
2024-10-11T18:07:05.022Z | <yuriw> No I still can't build squid =>
[https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVA[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/83742//consoleFull)
Failed:
dbus-daemon-1:1.12.20-8.el9.x86_64 libstoragemgmt-1.10.1-1.el9.x86_64
rpcbind-1.2.6-7.el9.x86_64 Error: Transaction failed
Error: building at STEP "RUN echo "=== INSTALLING ===" ; dnf install -y --setopt=install_weak_deps=False --setopt=skip_missing_names_on_install=False --enablerepo=crb $(cat packages.txt)": while running runtime: exit status 1
Fri Oct 11 02:41:56 AM UTC 2024 :: rm -fr /tmp/install-deps.3458658
Build step 'Execute shell' marked build as failure
ref: <https://tracker.ceph.com/issues/68447>
@Laura Flores @Dan Mick ^^^ |
2024-10-11T18:15:12.528Z | <Dan Mick> DNS Servers: 172.21.0.1 172.21.0.2 158.69.67.47 |
2024-10-11T18:16:44.425Z | <Dan Mick> SMART: well, many of the reesis have a 'bad' status, but smartctl doesn't report very much wear and the drives seem OK. It's a little worrisome but... |
2024-10-11T18:17:13.559Z | <Zack Cerza> huh. `/etc/resolvconf/resolv.conf.d/*` only contain the first two. guess I have some reading to do on resolvectl |
2024-10-11T18:26:46.745Z | <Dan Mick> resolveconf is an older system and is probably not active on that host |
2024-10-11T18:27:40.342Z | <Dan Mick> it's really difficult to tell sometimes. resolveconf was a "run scripts on network events and twiddle resolv.conf contents". systemd-resolved is "I am the DNS server and I will handle all splits and servers" |
2024-10-11T18:28:16.761Z | <Dan Mick> and honestly systemd-resolved seems to work a lot closer to how I expect |
2024-10-11T18:28:44.508Z | <Dan Mick> but, like all systemd tools, is a little magic and you have to discover things that matter instead of just being told them in an organized way |
2024-10-11T18:33:08.370Z | <Zack Cerza> yeah, I had also looked through the systemd-resolved.service manpage, looking at the referenced config files and directories, and couldn't find one that existed and also had actual configuration in it. I'd missed `/run/systemd/resolve/resolv.conf`, which seems to be the Actual Config File |
2024-10-11T18:33:30.888Z | <Zack Cerza> just one more dns configuration system bro i swear. just one more |
2024-10-11T18:46:16.978Z | <Christina Meno> Nice work Dan! Would you please notify the sepia mailing list that you’ve recovered those iscsi based services? |
2024-10-11T18:51:20.203Z | <Dan Mick> <insert xkcd about standards here> |
2024-10-11T18:52:18.331Z | <Dan Mick> yes |
2024-10-11T18:53:41.562Z | <Christina Meno> thank you |
2024-10-11T19:45:44.775Z | <Laura Flores> Okay, so services should be back up thanks to @Dan Mick and @Zack Cerza. The only remaining issue I'm aware of, which isn't about the lab but about container creation, is that this still needs to be merged/backported: <https://github.com/ceph/ceph/pull/60255>
I reran the api check and will merge it when it's ready. |
2024-10-11T19:46:05.474Z | <Laura Flores> @yuriw and everyone else, if there are still issues you notice in the lab, pls post here. |
2024-10-11T19:58:39.965Z | <Dan Mick> I'm gonna go ahead and force-merge that fix. the api test isn't going to touch that code and it's a critical fix. |
2024-10-11T19:59:06.752Z | <Dan Mick> (and api succeeded anyway) |
2024-10-11T20:00:07.576Z | <Dan Mick> Laura: for backport, is it the same process as before? (invent a tracker, run the scripts etc.)? |
2024-10-11T20:38:28.779Z | <badone> @Laura Flores I can access it, just couldn't log in 🙂 |