ceph - cephfs - 2024-07-09

Timestamp (UTC)	Message
2024-07-09T05:40:38.608Z	<Venky Shankar> Hey @Jos Collin
2024-07-09T05:43:50.431Z	<Jos Collin> RBD Mirror doesn't use Throttle.h. It uses it's own Throttler.h `ceph/src/tools/rbd_mirror/Throttler.h`
2024-07-09T05:44:13.331Z	<Jos Collin> hi @Venky Shankar
2024-07-09T05:45:18.689Z	<Venky Shankar> I'll let you know soon how to proceed with the quincy runs for this backport
2024-07-09T05:45:24.280Z	<Venky Shankar> since its a bit tricky
2024-07-09T05:45:52.413Z	<Jos Collin> ok, I'll wait for your instructions.
2024-07-09T05:46:33.818Z	<Jos Collin> @Venky Shankar RBD Mirror doesn't use Throttle.h. It uses it's own Throttler.h `ceph/src/tools/rbd_mirror/Throttler.h`
2024-07-09T05:59:23.517Z	<Jos Collin> Not sure why rbd mirror does so. That would be a better option for cephfs-mirror too?
2024-07-09T06:46:07.741Z	<Dhairya Parmar> @Venky Shankar
2024-07-09T07:12:12.233Z	<Venky Shankar> @Jos Collin - <https://tracker.ceph.com/issues/66520#note-2>
2024-07-09T07:56:15.138Z	<Jos Collin> @Venky Shankar ok, so what do we do for quincy QA batches? Just use filter-out and run the fewer jobs (resulting 10 to 35 jobs)?
2024-07-09T07:56:58.839Z	<Venky Shankar> can the user use a new fs for now?
2024-07-09T07:57:03.311Z	<Venky Shankar> to make progress?
2024-07-09T07:57:17.672Z	<Venky Shankar> I mean its data loss, but the fs is in really bad state
2024-07-09T07:57:58.002Z	<Venky Shankar> they could follow the disaster recovery steps again and see if things improve
2024-07-09T07:58:22.132Z	<Venky Shankar> that's sub-optimal
2024-07-09T07:58:23.065Z	<Venky Shankar> 😕
2024-07-09T07:58:38.295Z	<Venky Shankar> so, max 35 jobs after filtering out?
2024-07-09T07:58:43.146Z	<Venky Shankar> in whole fs suite
2024-07-09T07:58:43.384Z	<Venky Shankar> ?
2024-07-09T07:58:58.517Z	<Jos Collin> Let me get those numbers for each batch.
2024-07-09T08:01:19.565Z	<Jos Collin> @Venky Shankar Yes it's around 32 to 37 jobs in 4 different batches.
2024-07-09T08:01:47.800Z	<Venky Shankar> let me see what can be done
2024-07-09T08:02:14.659Z	<Venky Shankar> merging PRs with those tiny bit of runs is calling for trouble
2024-07-09T08:02:34.223Z	<Jos Collin> <https://pulpito.ceph.com/?branch=wip-jcollin-testing-20240624.084813-quincy>, <https://pulpito.ceph.com/?branch=wip-jcollin-testing-20240621.095223-quincy>, <https://pulpito.ceph.com/?branch=wip-jcollin-testing-20240621.062340-quincy>, <https://pulpito.ceph.com/?branch=wip-jcollin-testing-20240621.040312-quincy>. These are those runs.
2024-07-09T08:02:55.226Z	<Jos Collin> Yeah, agree
2024-07-09T08:04:09.166Z	<Venky Shankar> kk
2024-07-09T08:04:11.181Z	<Venky Shankar> give me sometime
2024-07-09T08:04:49.330Z	<Jos Collin> ok
2024-07-09T08:37:14.207Z	<Xiubo Li> @Erich Weiler I raised one PR to fix this <https://github.com/ceph/ceph/pull/58474>, could you help verify it ?
2024-07-09T09:16:03.695Z	<Dhairya Parmar> disaster recovery steps? you mean journal reset?
2024-07-09T09:16:42.639Z	<Dhairya Parmar> i scrolled through <https://docs.ceph.com/en/latest/cephfs/disaster-recovery/> but im confused what steps to recommend
2024-07-09T09:22:57.030Z	<Venky Shankar> this one: <https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts>
2024-07-09T09:23:30.815Z	<Venky Shankar> (the whole process)
2024-07-09T09:24:48.515Z	<Dhairya Parmar> all the steps mentioned on the page?
2024-07-09T09:25:53.711Z	<Venky Shankar> I think the session reset can be avoided. Its hard to recommend that steps to ignore without understanding the underlying issue which in this case is pretty worrysome.
2024-07-09T09:26:56.636Z	<Dhairya Parmar> then it mostly looks to me as journal reset
2024-07-09T09:27:00.994Z	<Dhairya Parmar> which they already did
2024-07-09T09:27:51.416Z	<Venky Shankar> sorry, I don't understand the problem enough. Its a new mds crash and the details are really thin.
2024-07-09T09:44:45.927Z	<Dhairya Parmar> okay
2024-07-09T10:48:34.699Z	<Milind Changire> @Kotresh H R fyi - tracker for the config_notify block issue - <https://tracker.ceph.com/issues/66009>
2024-07-09T11:07:38.091Z	<Kotresh H R> Sry my bad, it does use per module finisher thread <https://tracker.ceph.com/issues/66009#note-28>
2024-07-09T11:10:55.999Z	<Kotresh H R> And gil ist taken before calling the modules config_notify method
2024-07-09T11:11:41.130Z	<Milind Changire> yes
2024-07-09T11:47:13.356Z	<Jos Collin> @Venky Shankar Shall we merge this PR <https://github.com/ceph/ceph/pull/56193>, so that the QA wouldn't have repeated mirror failures? Anything blocking?
2024-07-09T11:47:42.603Z	<Venky Shankar> yeh soon
2024-07-09T11:47:48.345Z	<Venky Shankar> tests have passed
2024-07-09T11:47:55.817Z	<Venky Shankar> but I need to prepare the run wiki
2024-07-09T11:48:03.344Z	<Venky Shankar> so, it going to get merged soon
2024-07-09T11:48:07.536Z	<Venky Shankar> probably tomorrow
2024-07-09T11:49:40.343Z	<Jos Collin> ok 👍
2024-07-09T12:40:00.865Z	<Ivan Clayson> Hi, I am the original poster with regards to this error and I'm more than happy to share further details of this failure. We got the filesystem up and running following a journal reset and a performing a scrub followed by repair scrub of where metadata damaged was listed (all the damage was in 1 directory). We then left the filesystem to be backed up to over the weekend where the first attempt lead to the first log here (<https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz>). I thought this was due to the damage not being fully repaired or the whole directory itself being an issue so I repeated the journal reset steps and effectively what we did last time but that seems to have lead to this issue. My thinking was that I could just clear the problematic directory of all the data on the filesystem and then just sync from the live FS. Do you think we'll just need to start over with this filesystem then?
2024-07-09T12:42:50.360Z	<Dhairya Parmar> welcome @Ivan Clayson, pinging @Venky Shankar @Xiubo Li for their insights
2024-07-09T15:41:34.479Z	<Erich Weiler> @Xiubo Li Sure - just let me know what you need me to do. Are you asking me to test it? Is there a build close to Reef 18.2.1-2 that I can try? And, I’ve only been using rpms from the RHEL/CentOS repos, is there another way to get updated rpms?
2024-07-09T15:43:40.230Z	<Erich Weiler> The OS I am using is RHEL 9.3
2024-07-09T15:44:43.840Z	<Erich Weiler> I see this in my repos area: ```[root@pr-md-01 yum.repos.d]# cat CentOS-Ceph-Reef.repo # CentOS-Ceph-Reef.repo # # Please see <https://wiki.centos.org/SpecialInterestGroup/Storage> for more # information [centos-ceph-reef] name=CentOS-$stream - Ceph Reef metalink=<https://mirrors.centos.org/metalink?repo=centos-storage-sig-ceph-reef-9-stream&arch=$basearch> #baseurl=<http://mirror.stream.centos.org/SIGs/$stream/$basearch/storage-ceph-reef/> gpgcheck=0 enabled=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-Storage [centos-ceph-reef-test] name=CentOS-$stream - Ceph Reef Testing baseurl=<https://buildlogs.centos.org/centos/$stream/storage/$basearch/ceph-reef/> gpgcheck=0 enabled=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-Storage [centos-ceph-reef-source] name=CentOS-$stream - Ceph Reef Source baseurl=<http://mirror.stream.centos.org/SIGs/$stream/storage/source/ceph-reef/> gpgcheck=0 enabled=0 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-Storage```
2024-07-09T15:45:27.104Z	<Erich Weiler> Will the fix appear in the `[centos-ceph-reef-test]` repository?

ceph - cephfs - 2024-07-09

Any issue? please create an issue here and use the infra label.