2024-06-07T04:34:49.995Z | <Zac Dover> I've run into another failure during a routine string update in the /src directory: " 262 - unittest-omap-manager (Failed)" |
2024-06-07T04:34:54.826Z | <Zac Dover> <https://jenkins.ceph.com/job/ceph-pull-requests-arm64/57594/consoleFull#1073670958e840cee4-f4a4-4183-81dd-42855615f2c1> |
2024-06-07T04:35:06.128Z | <Zac Dover> <https://github.com/ceph/ceph/pull/57923> |
2024-06-07T04:35:22Z | <Zac Dover> I'm sorry for spamming this channel, but I don't know what else to do. |
2024-06-07T13:17:41.175Z | <Rishabh Dave> @Patrick Donnelly @Zack Cerza @Dan Mick I am uisng Quincy nightly build from shaman to test reproducibility of a bug. Scheduling a run for ~~6 day old Quincy (`658e3c7068357222a961b3107ed1c91a5ab3a893`) is doing fine but scheduling a run for ~~1 day old Quincy build (`086e633da00cf25bd1c1c7d658229b6617c08335`) is not doing fine. I get scheduling error - `teuthology.exceptions.ScheduleFailError: Scheduling rishabh-2024-06-07_13:12:04-fs:functional-quincy-testing-default-smithi failed: '086e633da00cf25bd1c1c7d658229b6617c08335' not found in repo: ceph-ci.git!`. |
2024-06-07T13:18:23.212Z | <Milind Changire> +1 |
2024-06-07T16:12:40.594Z | <Casey Bodley> seeing lots of dead teuthology jobs due to
> teuthology.exceptions.AnsibleFailedError: ['Error getting key from: <https://raw.githubusercontent.com/ceph/keys/autogenerated/ssh/@all.pub'>] |
2024-06-07T16:25:27.985Z | <Christopher Hoffman> Some sort of dns issue maybe?
```vossi04 ~]$ curl <https://raw.githubusercontent.com/ceph/keys/autogenerated/ssh/@all.pub>
curl: (6) Could not resolve host: [raw.githubusercontent.com](http://raw.githubusercontent.com)``` |
2024-06-07T16:42:13.185Z | <Patrick Donnelly> probably related to centos8 fallout |
2024-06-07T16:43:19.597Z | <Zack Cerza> I'm seeing it on various machines in sepia as well |
2024-06-07T16:43:30.720Z | <Ronen Friedman> I am unable to ssh into o10.front.... or teuthology.front.
Is it a general problem? |
2024-06-07T16:44:09.896Z | <Zack Cerza> @Kamoltat (Junior) Sirivadhna so it wasn't just vossi01 it seems |
2024-06-07T16:45:15.119Z | <Kamoltat (Junior) Sirivadhna> yep so I |
2024-06-07T16:46:44.512Z | <Kamoltat (Junior) Sirivadhna> yep so I’ve been trying to push to [github.com/ceph/](http://github.com/ceph/)<my-username> (using vossi03) as the usual dev workflow but kept getting an error saying host doesn’t recognize git, guessing dns issues |
2024-06-07T16:49:02.258Z | <Zack Cerza> yes I'm getting `status: REFUSED` vossi03 trying to `dig [github.com](http://github.com)` and `status: SERVFAIL` on vossi01, among others |
2024-06-07T16:52:29.765Z | <Zack Cerza> not just github:
```Jun 07 15:22:24 vpn-pub.localdomain named[22844]: client @0x7f83980a9060 8.43.84.3#46648 ([download.copr.fedorainfracloud.org](http://download.copr.fedorainfracloud.org)): query (cache) '[download.copr.fedorainfracloud.org/AAAA/IN](http://download.copr.fedorainfracloud.org/AAAA/IN)' denied``` |
2024-06-07T16:59:10.922Z | <Zack Cerza> both ns1 and vpn-pub can resolve [github.com](http://github.com) locally |
2024-06-07T17:06:08.981Z | <Zack Cerza> @Dan Mick |
2024-06-07T17:09:59.708Z | <Dan Mick> There is a WAN link error reported for RDU, about 3 hours old |
2024-06-07T17:17:05.133Z | <Zack Cerza> ah that seems important |
2024-06-07T17:17:33.798Z | <Zack Cerza> do you know why we'd be seeing all the denials from vpn-pub though? |
2024-06-07T17:20:23.784Z | <Dan Mick> Not sure what that transaction was that caused that error or exactly what the error means. I doubt vpn-pub does ipv6 at all,maybe that? |
2024-06-07T17:23:37.934Z | <Zack Cerza> this is caused by the DNS issues we're seeing; [git.ceph.com](http://git.ceph.com) mirrors haven't been able to update |
2024-06-07T17:24:37.074Z | <Zack Cerza> pausing the queue because of the ongoing network issues |
2024-06-07T17:24:50.490Z | <Dan Mick> ...and the DNS issues are surely because of the upstream provider outage. Another cable cut.
Don't know why the redundant link isn't working again. |
2024-06-07T17:25:36.434Z | <Zack Cerza> they just love cutting those cables huh |
2024-06-07T17:27:33.189Z | <Christopher Hoffman> This is working on [teuthology.front.sepia.com](http://teuthology.front.sepia.com)
`curl <https://raw.githubusercontent.com/ceph/keys/autogenerated/ssh/@all.pub>` |
2024-06-07T17:28:13.290Z | <nehaojha> and I was wondering if it is just me |
2024-06-07T17:29:52.305Z | <Christopher Hoffman> If it's due to a redundant link being down is it possible their lacp or trunk is only set to hash at layer2? Could explain why when links go down a subset of hosts only are effected |
2024-06-07T17:31:03.749Z | <Zack Cerza> if by "that error" you mean the line I pasted, that's one of hundreds per second on vpn-pub |
2024-06-07T17:31:29.016Z | <Zack Cerza> hm, but are they all v6? checking |
2024-06-07T17:31:50.160Z | <Zack Cerza> nope, v4 also |
2024-06-07T17:33:28.909Z | <Christopher Hoffman> Do we use bonding of network interfaces on the nameservers? |
2024-06-07T17:41:57.916Z | <Dan Mick> ok, let's try and log into vpn-pub and search around for logs to see what "denials" means |
2024-06-07T17:43:38.557Z | <Dan Mick> @Christopher Hoffman no, the redundancy I'm talking about is at the data center for the WAN uplinks |
2024-06-07T17:43:49.310Z | <Dan Mick> denials: ok, journalctl, from named |
2024-06-07T17:46:15.958Z | <Laura Flores> Looks like CI checks on PRs are also affected by the DNS issues:
<https://jenkins.ceph.com/job/ceph-pull-requests/136449/console>
```No credentials specified
Wiping out workspace first.
Cloning the remote Git repository
Using shallow clone with depth 1
Honoring refspec on initial clone
Cloning repository <https://github.com/ceph/ceph.git>
> git init /home/jenkins-build/build/workspace/ceph-pull-requests # timeout=10
Fetching upstream changes from <https://github.com/ceph/ceph.git>
> git --version # timeout=10
> git --version # 'git version 2.34.1'
> git fetch --tags --force --progress --depth=1 -- <https://github.com/ceph/ceph.git> +refs/pull/57934/*:refs/remotes/origin/pr/57934/* # timeout=20
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --force --progress --depth=1 -- <https://github.com/ceph/ceph.git> +refs/pull/57934/*:refs/remotes/origin/pr/57934/*" returned status code 128:
stdout:
stderr: fatal: unable to access '<https://github.com/ceph/ceph.git/>': Could not resolve host: [github.com](http://github.com)```
I know this is a known issue, but in case anyone is wondering about the CI checks, I'm posting for awareness. |
2024-06-07T17:46:44.378Z | <Dan Mick> seems like that message may mean "you tried to send me a recursive query but I don't do that" |
2024-06-07T17:50:26.239Z | <Dan Mick> ns1.front handled a query to [www.google.com](http://www.google.com) |
2024-06-07T17:55:03.350Z | <Dan Mick> all of those denials are coming from WAN addresses (probably gw) which might not be hip to the idea that vpn-pub should only be used for certain domains, but probably wouldn't be trying if the public DNS wasn't failing |
2024-06-07T17:57:41.706Z | <Zack Cerza> ohhh that makes some sense. is there a reason we shouldn't allow recursion from gw? |
2024-06-07T17:59:41.287Z | <Dan Mick> I mean it's extra load on vpn-pub to say "no" a thousand times, but it should be temporary, and probably it's the famous old "linux took forever to really wrap around domain-specific lookup in the client" |
2024-06-07T17:59:49.201Z | <Dan Mick> (although of course Windows took even longer) |
2024-06-07T18:06:16.780Z | <Dan Mick> actually the address is most often .3 which is probably the gateway to get to vpn, so it could be hosts on the lab net too |
2024-06-07T18:06:21.981Z | <Dan Mick> with the original request |
2024-06-07T18:07:09.862Z | <Dan Mick> maybe the local ns'es could be told not to consult vpn-pub except for sepia |
2024-06-07T21:17:08.939Z | <Sepia OpenShift> [FIRING:1] teuthology (SmithiQueuePaused metrics [teuthology.front.sepia.ceph.com:61764](http://teuthology.front.sepia.ceph.com:61764) teuthology-exporter smithi 2099989 openshift-user-workload-monitoring/user-workload teuthology-exporter warning) | https:///console-openshift-console.apps.os.sepia.ceph.com/monitoring/#/alerts?receiver=%23sepia |
2024-06-07T21:17:27.350Z | <Sepia OpenShift> [FIRING:1] teuthology (SmithiQueuePaused metrics [teuthology.front.sepia.ceph.com:61764](http://teuthology.front.sepia.ceph.com:61764) teuthology-exporter smithi 2099989 openshift-user-workload-monitoring/user-workload teuthology-exporter warning) | https:///console-openshift-console.apps.os.sepia.ceph.com/monitoring/#/alerts?receiver=%23sepia |
2024-06-07T22:11:00.945Z | <Dan Mick> well the DNS storm seems to have passed through without me finding a culprit between other things |
2024-06-07T22:12:54.704Z | <Dan Mick> poking around a bit I don't have problems. Anyone else still seeing issues? |