ceph - sepia - 2024-06-13

Timestamp (UTC)Message
2024-06-13T06:33:59.962Z
<Adam Kraitman> It happened while you ran this bulk operation on 8 issues ?
2024-06-13T06:55:45.371Z
<Leonid Usov> no, unrelated. Happened when I was working with a single issue.
2024-06-13T06:57:16.459Z
<Leonid Usov> I posted that screenshot immediately when got it plus a few seconds to find the thread. Hopefully, you can find relevant traces in the logs
2024-06-13T11:22:17.144Z
<Leonid Usov> Can we please configure the `QA Approved` status of the `QA Run` tracker type to be “closing” or “resolving”, i.e. a ticket in that status will be shows as ~~completed~~, and issues that are linked as “blocked by” this ticket will be resolveable.
2024-06-13T15:03:33.874Z
<Adam Kraitman> Maybe there is some plugin that adds that functionality to redmine I can test it if you find something that is doing it from that list <https://www.redmine.org/plugins?utf8=%E2%9C%93&page=1&sort=&v=5.1>
2024-06-13T15:18:29.513Z
<Leonid Usov> Hm.. What does the `QA Closed` state mean? Should we move to that after approving a qa run?
2024-06-13T15:19:46.925Z
<Leonid Usov> OK it’s actually documented. Let me try and see if that state is working and then it’s a better fit
2024-06-13T15:19:50.891Z
<Leonid Usov> https://files.slack.com/files-pri/T1HG3J90S-F077TA2N31U/download/image.png
2024-06-13T15:19:52.325Z
<Leonid Usov> <https://tracker.ceph.com/projects/ceph-qa/wiki>
2024-06-13T15:43:16.981Z
<yuriw> `QA Closed` means what it says "closed" and all PRs were merged
2024-06-13T15:44:51.272Z
<Leonid Usov> Yes, this worked. We’ll follow suite by closing approved runs
2024-06-13T15:46:49.201Z
<Leonid Usov> Are there other reasons to have a QA Run in the QA Closed state besides being approved first?
2024-06-13T15:48:08.361Z
<yuriw> It could be that for whatever reason you decide not to test and "untag" PRs and stop testing this batch, then you'd close the tracker IMO
2024-06-13T15:48:24.671Z
<yuriw> that's not too often tho
2024-06-13T15:53:50.880Z
<Leonid Usov> yeah.. so that was the reason I haven’t considered “Closed” before. Without an explicit Rejected state, Closed becomes ambiguous. And there may be value in leaving a run in a final state that records approval
2024-06-13T16:40:23.466Z
<Patrick Donnelly> @Adam Kraitman @Dan Mick can this be addressed easily? <https://tracker.ceph.com/issues/66337>
2024-06-13T16:40:35.315Z
<Patrick Donnelly> I'd like to rotate the client.admin key but this (at least) blocks that
2024-06-13T16:56:47.303Z
<Yuval Lifshitz> i am trying to test my shaman build, which looks all green: <https://shaman.ceph.com/builds/ceph/wip-yuval-64305/7e679576a8082e7a83db4ceb2120950fe445aa4a/>
but i get this error in teuthology:
```teuthology.exceptions.ScheduleFailError: Scheduling yuvalif-2024-06-13_16:55:06-rgw:notifications-wip-yuval-64305-distro-default-smithi failed: Packages for os_type 'centos', flavor default and ceph hash '7e679576a8082e7a83db4ceb2120950fe445aa4a' not found```
2024-06-13T16:56:58.873Z
<Yuval Lifshitz> is this related to the centos8 change?
2024-06-13T16:58:39.939Z
<yuriw> You likely hitting the issue that centos8 was EOL and removed
Try running on c9 only and/or modify tests not to use c8
2024-06-13T17:02:47.381Z
<Yuval Lifshitz> all of the related tests are pointing to centos_latest.yaml and ubuntu_latest.yaml
2024-06-13T17:02:59.447Z
<Yuval Lifshitz> no centos8 there
2024-06-13T17:03:13.829Z
<yuriw> are you sure those point to c9?
2024-06-13T17:06:39.783Z
<Yuval Lifshitz> yes. i think this is the case for quite some time
2024-06-13T17:10:02.364Z
<Yuval Lifshitz> this was done about a year ago: <https://github.com/ceph/ceph/commit/a85f50c24bd478144fa02df39af47944cb7bc33e>
2024-06-13T17:53:22.292Z
<Adam King> Someone else (forget who) told me to add `--distro centos --distro-version 9` to my teuthology-suite commands and that seems to work. It also doesn't filter the jobs to only centos 9 ones in case you wanted ubuntu jobs as well.
2024-06-13T18:00:11.441Z
<Dan Mick> it can at least be addressed.  fog and signer are not hard; chacra is a little more complex just because restarting it can interrupt builds and we don't have a great interlock for that.
2024-06-13T18:01:02.114Z
<Dan Mick> I noticed this week that our OCP instance is also using the LRC as backing store for some things, and I'm not at all sure I know what secret(s) it's using.  Have you cataloged that at all?
2024-06-13T18:19:49.904Z
<Zack Cerza> Is your teuthology copy out-of-date? The default CentOS version was updated to 9 a few weeks ago.
2024-06-13T18:22:15.792Z
<Zack Cerza> This worked for me just now: `teuthology-suite -v --owner zmc -s rgw:notifications -m smithi --priority 9000 -c wip-yuval-64305 --ceph-repo ceph-ci`
2024-06-13T18:32:40.810Z
<Patrick Donnelly> the three instances noted in the ticket are the places I saw the admin credential being used
2024-06-13T18:32:55.970Z
<Patrick Donnelly> if we get rid of those three, I can probably see better if anything else is using the admin key even outside of cephfs
2024-06-13T18:44:11.444Z
<Dan Mick> my teuthology was out of date.  make sure you pull the new commits @Yuval Lifshitz
2024-06-13T18:45:12.303Z
<Dan Mick> yeah, because there are definitely consumers of rbd and rgw too, and I'd be shocked if none of them were using client.admin
2024-06-13T18:46:39.318Z
<yuriw> I wonder if adding `distro centos` `--distro-version 9` helps, why do we have to use it?  It seems like it should not be required
2024-06-13T18:47:18.230Z
<Dan Mick> agreed.  according to Zack's post it's not
2024-06-13T18:47:24.255Z
<Yuval Lifshitz> thanks! update teuthology fixed the issue
2024-06-13T18:49:31.551Z
<yuriw> I actually keep teuthology updates automatic via crontab, but sometimes it requires rerunning `bootstrap` and that has to be done manually AFAIK
Maybe we can rerun `bootstrap` on schedule as well 🤷‍♂️
2024-06-13T18:51:12.976Z
<yuriw> what do you do @Zack Cerza to stay kosher?
2024-06-13T19:31:17.791Z
<Dan Mick> this is a really dumb question, but: where does radosgw store its ceph auth secret?
2024-06-13T19:32:32.051Z
<Dan Mick> I would have expected a keyring file in /etc/ceph on the rgw host
2024-06-13T19:33:52.690Z
<Dan Mick> oh.  /var/lib/ceph/<instance> for containerized daemons.  nm
2024-06-13T19:38:25.587Z
<Dan Mick> ok, rgw isn't affected; its ceph auth is local to the rgw host (in the LRC).  (of course).  I think OCP only uses RGW.
2024-06-13T19:43:33.627Z
<Zack Cerza> I guess when I see something unexpected I check what version I have and update if I can - both for things I work on, and things I don't
2024-06-13T19:43:43.115Z
<Zack Cerza> I suppose some sort of update notification could be useful here though?
2024-06-13T20:28:43.129Z
<Dan Mick> @Patrick Donnelly fog didn't seem to have an issue mounting a cephfs without a ceph.conf, but signer is having issues and seems to be demanding a ceph.conf.  Did something change in later kernel versions or something?  I don't know why it would need it; in both cases the mon addresses are in the fstab line
2024-06-13T20:29:00.731Z
<Dan Mick> did not load config file, using default settings.
2024-06-13T13:21:40.396-0700 7f852f48bf40 -1 Errors while parsing config file!
2024-06-13T13:21:40.396-0700 7f852f48bf40 -1 can't open ceph.conf: (2) No such file or directory
2024-06-13T13:21:40.396-0700 7f852f48bf40 -1 Errors while parsing config file!
2024-06-13T13:21:40.396-0700 7f852f48bf40 -1 can't open ceph.conf: (2) No such file or directoryunable to get monitor info from DNS SRV with service name: ceph-mon

2024-06-13T13:21:40.400-0700 7f852f48bf40 -1 failed for service _ceph-mon._tcp
2024-06-13T13:21:40.400-0700 7f852f48bf40 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory
2024-06-13T20:32:27.278Z
<Dan Mick> oh.  those are lies.
2024-06-13T20:32:37.676Z
<Dan Mick> perhaps the issue is that the cephfs auth doesn't allow access to /signer
2024-06-13T20:36:56.029Z
<Dan Mick> ok, stuck; I can't find a cephfs named signer
2024-06-13T20:58:47.552Z
<Patrick Donnelly> soryr was afk
2024-06-13T20:58:52.701Z
<Patrick Donnelly> you figured out signer ?
2024-06-13T20:58:55.944Z
<Patrick Donnelly> from the ticket update i just saw
2024-06-13T20:59:46.900Z
<Dan Mick> yeah.  I don't know what the issue was; perhaps a transient failure is the best I've got.  I was distracted by 1) the error noise and 2) not being certain that fstab entries using subdirs were actually supported
2024-06-13T21:00:05.337Z
<Dan Mick> but it's working.  I've just restarted chacra and am making sure it's working
2024-06-13T21:01:02.130Z
<Dan Mick> looks god
2024-06-13T21:02:20.667Z
<Patrick Donnelly> I'm seeing this new one:
2024-06-13T21:02:26.827Z
<Patrick Donnelly> https://files.slack.com/files-pri/T1HG3J90S-F0781ST8PK5/download/untitled
2024-06-13T21:02:38.735Z
<Patrick Donnelly> fyi Dan I'm just using this command:
2024-06-13T21:02:58.074Z
<Patrick Donnelly> https://files.slack.com/files-pri/T1HG3J90S-F078ELBSGAD/download/untitled
2024-06-13T21:03:17.797Z
<Patrick Donnelly> then grep `admin`
2024-06-13T21:03:40.319Z
<Dan Mick> yeah, at one point I had only changed 'secret' and not 'name' in the mount line
2024-06-13T21:03:49.270Z
<Patrick Donnelly> oh
2024-06-13T21:04:22.198Z
<Dan Mick> there should be no active session for name=admin from signer now
2024-06-13T21:04:44.222Z
<Patrick Donnelly> I still see 2
2024-06-13T21:04:48.116Z
<Patrick Donnelly> perhaps lazy unmounts?
2024-06-13T21:04:56.655Z
<Dan Mick> I just used umount
2024-06-13T21:05:02.124Z
<Patrick Donnelly> O.o
2024-06-13T21:05:09.620Z
<Patrick Donnelly> well let's give it a little time to see
2024-06-13T21:05:28.486Z
<Patrick Donnelly> okay so if that's truly fixed then we're pretty close to done
2024-06-13T21:05:30.279Z
<Dan Mick> hm.  "mount | grep ceph" shows four mounts of the same fs
2024-06-13T21:05:37.191Z
<Dan Mick> wt...
2024-06-13T21:05:38.302Z
<Patrick Donnelly> overlaid mounts?
2024-06-13T21:06:18.238Z
<Patrick Donnelly> I have to go afk again but you can use `for v in $(ceph mon stat --format=json-pretty | jq -r '.quorum | map(.name) | .[]'); do ceph tell mon.$v sessions ; done  | less` to hunt for more uses of the admin key
2024-06-13T21:06:20.410Z
<Patrick Donnelly> i see at least 4 more
2024-06-13T21:06:23.752Z
<Dan Mick> well it beats me, but I umounted 4 times
2024-06-13T21:06:30.691Z
<Dan Mick> none failed, and now mount shows none
2024-06-13T21:06:44.543Z
<Dan Mick> now remounted
2024-06-13T21:07:23.329Z
<Patrick Donnelly> admin sessions appear to be gone
2024-06-13T21:07:31.538Z
<Patrick Donnelly> so all cephfs mounts with client.admin are gone; yay!
2024-06-13T21:07:35.981Z
<Dan Mick> wacky.  cool
2024-06-13T21:07:51.661Z
<Patrick Donnelly> but ya, you can use the above command I pasted to hunt for more uses
2024-06-13T21:08:02.404Z
<Patrick Donnelly> those are mon client sessions so it should capture everything
2024-06-13T21:08:08.482Z
<Patrick Donnelly> except transient clients (like some crontab job)
2024-06-13T21:08:26.765Z
<Patrick Donnelly> Thanks Dan!
2024-06-13T21:08:47.901Z
<Dan Mick> yw
2024-06-13T21:08:50.897Z
<Dan Mick> yw
2024-06-13T21:09:45.328Z
<Dan Mick> if and when you do rotate the key, please lmk
2024-06-13T22:43:46.747Z
<yuriw> Definitely 👍   I am not even sure how to check version.
I just do git pull and bootstrap 

Any issue? please create an issue here and use the infra label.