ceph - sepia - 2024-06-07

Timestamp (UTC)Message
2024-06-07T04:34:49.995Z
<Zac Dover> I've run into another failure during a routine string update in the /src directory: "	262 - unittest-omap-manager (Failed)"
2024-06-07T04:34:54.826Z
<Zac Dover> <https://jenkins.ceph.com/job/ceph-pull-requests-arm64/57594/consoleFull#1073670958e840cee4-f4a4-4183-81dd-42855615f2c1>
2024-06-07T04:35:06.128Z
<Zac Dover> <https://github.com/ceph/ceph/pull/57923>
2024-06-07T04:35:22Z
<Zac Dover> I'm sorry for spamming this channel, but I don't know what else to do.
2024-06-07T13:17:41.175Z
<Rishabh Dave> @Patrick Donnelly @Zack Cerza @Dan Mick I am uisng Quincy nightly build from shaman to test reproducibility of a bug. Scheduling a run for ~~6 day old Quincy (`658e3c7068357222a961b3107ed1c91a5ab3a893`) is doing fine but scheduling a run for ~~1 day old Quincy build (`086e633da00cf25bd1c1c7d658229b6617c08335`) is not doing fine. I get scheduling error - `teuthology.exceptions.ScheduleFailError: Scheduling rishabh-2024-06-07_13:12:04-fs:functional-quincy-testing-default-smithi failed: '086e633da00cf25bd1c1c7d658229b6617c08335' not found in repo: ceph-ci.git!`.
2024-06-07T13:18:23.212Z
<Milind Changire> +1
2024-06-07T16:12:40.594Z
<Casey Bodley> seeing lots of dead teuthology jobs due to
> teuthology.exceptions.AnsibleFailedError: ['Error getting key from: <https://raw.githubusercontent.com/ceph/keys/autogenerated/ssh/@all.pub'>]
2024-06-07T16:25:27.985Z
<Christopher Hoffman> Some sort of dns issue maybe?
```vossi04 ~]$ curl <https://raw.githubusercontent.com/ceph/keys/autogenerated/ssh/@all.pub>
curl: (6) Could not resolve host: [raw.githubusercontent.com](http://raw.githubusercontent.com)```
2024-06-07T16:42:13.185Z
<Patrick Donnelly> probably related to centos8 fallout
2024-06-07T16:43:19.597Z
<Zack Cerza> I'm seeing it on various machines in sepia as well
2024-06-07T16:43:30.720Z
<Ronen Friedman> I am unable to ssh into o10.front.... or teuthology.front.
Is it a general problem?
2024-06-07T16:44:09.896Z
<Zack Cerza> @Kamoltat (Junior) Sirivadhna so it wasn't just vossi01 it seems
2024-06-07T16:45:15.119Z
<Kamoltat (Junior) Sirivadhna> yep so I
2024-06-07T16:46:44.512Z
<Kamoltat (Junior) Sirivadhna> yep so I’ve been trying to push to [github.com/ceph/](http://github.com/ceph/)<my-username> (using vossi03) as the usual dev workflow but kept getting an error saying host doesn’t recognize git, guessing dns issues
2024-06-07T16:49:02.258Z
<Zack Cerza> yes I'm getting `status: REFUSED` vossi03 trying to `dig [github.com](http://github.com)` and `status: SERVFAIL` on vossi01, among others
2024-06-07T16:52:29.765Z
<Zack Cerza> not just github:
```Jun 07 15:22:24 vpn-pub.localdomain named[22844]: client @0x7f83980a9060 8.43.84.3#46648 ([download.copr.fedorainfracloud.org](http://download.copr.fedorainfracloud.org)): query (cache) '[download.copr.fedorainfracloud.org/AAAA/IN](http://download.copr.fedorainfracloud.org/AAAA/IN)' denied```
2024-06-07T16:59:10.922Z
<Zack Cerza> both ns1 and vpn-pub can resolve [github.com](http://github.com) locally
2024-06-07T17:06:08.981Z
<Zack Cerza> @Dan Mick
2024-06-07T17:09:59.708Z
<Dan Mick> There is a WAN link error reported for RDU, about 3 hours old
2024-06-07T17:17:05.133Z
<Zack Cerza> ah that seems important
2024-06-07T17:17:33.798Z
<Zack Cerza> do you know why we'd be seeing all the denials from vpn-pub though?
2024-06-07T17:20:23.784Z
<Dan Mick> Not sure what that transaction was that caused that error or exactly what the error means. I doubt vpn-pub does ipv6 at all,maybe that?
2024-06-07T17:23:37.934Z
<Zack Cerza> this is caused by the DNS issues we're seeing; [git.ceph.com](http://git.ceph.com) mirrors haven't been able to update
2024-06-07T17:24:37.074Z
<Zack Cerza> pausing the queue because of the ongoing network issues
2024-06-07T17:24:50.490Z
<Dan Mick> ...and the DNS issues are surely because of the upstream provider outage.  Another cable cut.
  Don't know why the redundant link isn't working again.
2024-06-07T17:25:36.434Z
<Zack Cerza> they just love cutting those cables huh
2024-06-07T17:27:33.189Z
<Christopher Hoffman> This is working on [teuthology.front.sepia.com](http://teuthology.front.sepia.com)
`curl <https://raw.githubusercontent.com/ceph/keys/autogenerated/ssh/@all.pub>`
2024-06-07T17:28:13.290Z
<nehaojha> and I was wondering if it is just me
2024-06-07T17:29:52.305Z
<Christopher Hoffman> If it's due to a redundant link being down is it possible their lacp or trunk is only set to hash at layer2? Could explain why when links go down a subset of hosts only are effected
2024-06-07T17:31:03.749Z
<Zack Cerza> if by "that error" you mean the line I pasted, that's one of hundreds per second on vpn-pub
2024-06-07T17:31:29.016Z
<Zack Cerza> hm, but are they all v6? checking
2024-06-07T17:31:50.160Z
<Zack Cerza> nope, v4 also
2024-06-07T17:33:28.909Z
<Christopher Hoffman> Do we use bonding of network interfaces on the nameservers?
2024-06-07T17:41:57.916Z
<Dan Mick> ok, let's try and log into vpn-pub and search around for logs to see what "denials" means
2024-06-07T17:43:38.557Z
<Dan Mick> @Christopher Hoffman no, the redundancy I'm talking about is at the data center for the WAN uplinks
2024-06-07T17:43:49.310Z
<Dan Mick> denials: ok, journalctl, from named
2024-06-07T17:46:15.958Z
<Laura Flores> Looks like CI checks on PRs are also affected by the DNS issues:
<https://jenkins.ceph.com/job/ceph-pull-requests/136449/console>
```No credentials specified
Wiping out workspace first.
Cloning the remote Git repository
Using shallow clone with depth 1
Honoring refspec on initial clone
Cloning repository <https://github.com/ceph/ceph.git>
 > git init /home/jenkins-build/build/workspace/ceph-pull-requests # timeout=10
Fetching upstream changes from <https://github.com/ceph/ceph.git>
 > git --version # timeout=10
 > git --version # 'git version 2.34.1'
 > git fetch --tags --force --progress --depth=1 -- <https://github.com/ceph/ceph.git> +refs/pull/57934/*:refs/remotes/origin/pr/57934/* # timeout=20
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --force --progress --depth=1 -- <https://github.com/ceph/ceph.git> +refs/pull/57934/*:refs/remotes/origin/pr/57934/*" returned status code 128:
stdout: 
stderr: fatal: unable to access '<https://github.com/ceph/ceph.git/>': Could not resolve host: [github.com](http://github.com)```
I know this is a known issue, but in case anyone is wondering about the CI checks, I'm posting for awareness.
2024-06-07T17:46:44.378Z
<Dan Mick> seems like that message may mean "you tried to send me a recursive query but I don't do that"
2024-06-07T17:50:26.239Z
<Dan Mick> ns1.front handled a query to [www.google.com](http://www.google.com)
2024-06-07T17:55:03.350Z
<Dan Mick> all of those denials are coming from WAN addresses (probably gw) which might not be hip to the idea that vpn-pub should only be used for certain domains, but probably wouldn't be trying if the public DNS wasn't failing
2024-06-07T17:57:41.706Z
<Zack Cerza> ohhh that makes some sense. is there a reason we shouldn't allow recursion from gw?
2024-06-07T17:59:41.287Z
<Dan Mick> I mean it's extra load on vpn-pub to say "no" a thousand times, but it should be temporary, and probably it's the famous old "linux took forever to really wrap around domain-specific lookup in the client"
2024-06-07T17:59:49.201Z
<Dan Mick> (although of course Windows took even longer)
2024-06-07T18:06:16.780Z
<Dan Mick> actually the address is most often .3 which is probably the gateway to get to vpn, so it could be hosts on the lab net too
2024-06-07T18:06:21.981Z
<Dan Mick> with the original request
2024-06-07T18:07:09.862Z
<Dan Mick> maybe the local ns'es could be told not to consult vpn-pub except for sepia
2024-06-07T21:17:08.939Z
<Sepia OpenShift> [FIRING:1] teuthology (SmithiQueuePaused metrics [teuthology.front.sepia.ceph.com:61764](http://teuthology.front.sepia.ceph.com:61764) teuthology-exporter smithi 2099989 openshift-user-workload-monitoring/user-workload teuthology-exporter warning) | https:///console-openshift-console.apps.os.sepia.ceph.com/monitoring/#/alerts?receiver=%23sepia
2024-06-07T21:17:27.350Z
<Sepia OpenShift> [FIRING:1] teuthology (SmithiQueuePaused metrics [teuthology.front.sepia.ceph.com:61764](http://teuthology.front.sepia.ceph.com:61764) teuthology-exporter smithi 2099989 openshift-user-workload-monitoring/user-workload teuthology-exporter warning) | https:///console-openshift-console.apps.os.sepia.ceph.com/monitoring/#/alerts?receiver=%23sepia
2024-06-07T22:11:00.945Z
<Dan Mick> well the DNS storm seems to have passed through without me finding a culprit between other things
2024-06-07T22:12:54.704Z
<Dan Mick> poking around a bit I don't have problems.  Anyone else still seeing issues?

Any issue? please create an issue here and use the infra label.