2024-07-09T05:40:38.608Z | <Venky Shankar> Hey @Jos Collin |
2024-07-09T05:43:50.431Z | <Jos Collin> RBD Mirror doesn't use Throttle.h. It uses it's own Throttler.h `ceph/src/tools/rbd_mirror/Throttler.h` |
2024-07-09T05:44:13.331Z | <Jos Collin> hi @Venky Shankar |
2024-07-09T05:45:18.689Z | <Venky Shankar> I'll let you know soon how to proceed with the quincy runs for this backport |
2024-07-09T05:45:24.280Z | <Venky Shankar> since its a bit tricky |
2024-07-09T05:45:52.413Z | <Jos Collin> ok, I'll wait for your instructions. |
2024-07-09T05:46:33.818Z | <Jos Collin> @Venky Shankar RBD Mirror doesn't use Throttle.h. It uses it's own Throttler.h `ceph/src/tools/rbd_mirror/Throttler.h` |
2024-07-09T05:59:23.517Z | <Jos Collin> Not sure why rbd mirror does so. That would be a better option for cephfs-mirror too? |
2024-07-09T06:46:07.741Z | <Dhairya Parmar> @Venky Shankar |
2024-07-09T07:12:12.233Z | <Venky Shankar> @Jos Collin - <https://tracker.ceph.com/issues/66520#note-2> |
2024-07-09T07:56:15.138Z | <Jos Collin> @Venky Shankar ok, so what do we do for quincy QA batches? Just use filter-out and run the fewer jobs (resulting 10 to 35 jobs)? |
2024-07-09T07:56:58.839Z | <Venky Shankar> can the user use a new fs for now? |
2024-07-09T07:57:03.311Z | <Venky Shankar> to make progress? |
2024-07-09T07:57:17.672Z | <Venky Shankar> I mean its data loss, but the fs is in really bad state |
2024-07-09T07:57:58.002Z | <Venky Shankar> they could follow the disaster recovery steps again and see if things improve |
2024-07-09T07:58:22.132Z | <Venky Shankar> that's sub-optimal |
2024-07-09T07:58:23.065Z | <Venky Shankar> 😕 |
2024-07-09T07:58:38.295Z | <Venky Shankar> so, max 35 jobs after filtering out? |
2024-07-09T07:58:43.146Z | <Venky Shankar> in whole fs suite |
2024-07-09T07:58:43.384Z | <Venky Shankar> ? |
2024-07-09T07:58:58.517Z | <Jos Collin> Let me get those numbers for each batch. |
2024-07-09T08:01:19.565Z | <Jos Collin> @Venky Shankar Yes it's around 32 to 37 jobs in 4 different batches. |
2024-07-09T08:01:47.800Z | <Venky Shankar> let me see what can be done |
2024-07-09T08:02:14.659Z | <Venky Shankar> merging PRs with those tiny bit of runs is calling for trouble |
2024-07-09T08:02:34.223Z | <Jos Collin> <https://pulpito.ceph.com/?branch=wip-jcollin-testing-20240624.084813-quincy>, <https://pulpito.ceph.com/?branch=wip-jcollin-testing-20240621.095223-quincy>, <https://pulpito.ceph.com/?branch=wip-jcollin-testing-20240621.062340-quincy>, <https://pulpito.ceph.com/?branch=wip-jcollin-testing-20240621.040312-quincy>. These are those runs. |
2024-07-09T08:02:55.226Z | <Jos Collin> Yeah, agree |
2024-07-09T08:04:09.166Z | <Venky Shankar> kk |
2024-07-09T08:04:11.181Z | <Venky Shankar> give me sometime |
2024-07-09T08:04:49.330Z | <Jos Collin> ok |
2024-07-09T08:37:14.207Z | <Xiubo Li> @Erich Weiler I raised one PR to fix this <https://github.com/ceph/ceph/pull/58474>, could you help verify it ? |
2024-07-09T09:16:03.695Z | <Dhairya Parmar> disaster recovery steps? you mean journal reset? |
2024-07-09T09:16:42.639Z | <Dhairya Parmar> i scrolled through <https://docs.ceph.com/en/latest/cephfs/disaster-recovery/> but im confused what steps to recommend |
2024-07-09T09:22:57.030Z | <Venky Shankar> this one: <https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts> |
2024-07-09T09:23:30.815Z | <Venky Shankar> (the whole process) |
2024-07-09T09:24:48.515Z | <Dhairya Parmar> all the steps mentioned on the page? |
2024-07-09T09:25:53.711Z | <Venky Shankar> I think the session reset can be avoided. Its hard to recommend that steps to ignore without understanding the underlying issue which in this case is pretty worrysome. |
2024-07-09T09:26:56.636Z | <Dhairya Parmar> then it mostly looks to me as journal reset |
2024-07-09T09:27:00.994Z | <Dhairya Parmar> which they already did |
2024-07-09T09:27:51.416Z | <Venky Shankar> sorry, I don't understand the problem enough. Its a new mds crash and the details are really thin. |
2024-07-09T09:44:45.927Z | <Dhairya Parmar> okay |
2024-07-09T10:48:34.699Z | <Milind Changire> @Kotresh H R fyi - tracker for the config_notify block issue - <https://tracker.ceph.com/issues/66009> |
2024-07-09T11:07:38.091Z | <Kotresh H R> Sry my bad, it does use per module finisher thread
<https://tracker.ceph.com/issues/66009#note-28> |
2024-07-09T11:10:55.999Z | <Kotresh H R> And gil ist taken before calling the modules config_notify method |
2024-07-09T11:11:41.130Z | <Milind Changire> yes |
2024-07-09T11:47:13.356Z | <Jos Collin> @Venky Shankar Shall we merge this PR <https://github.com/ceph/ceph/pull/56193>, so that the QA wouldn't have repeated mirror failures? Anything blocking? |
2024-07-09T11:47:42.603Z | <Venky Shankar> yeh soon |
2024-07-09T11:47:48.345Z | <Venky Shankar> tests have passed |
2024-07-09T11:47:55.817Z | <Venky Shankar> but I need to prepare the run wiki |
2024-07-09T11:48:03.344Z | <Venky Shankar> so, it going to get merged soon |
2024-07-09T11:48:07.536Z | <Venky Shankar> probably tomorrow |
2024-07-09T11:49:40.343Z | <Jos Collin> ok 👍 |
2024-07-09T12:40:00.865Z | <Ivan Clayson> Hi, I am the original poster with regards to this error and I'm more than happy to share further details of this failure. We got the filesystem up and running following a journal reset and a performing a scrub followed by repair scrub of where metadata damaged was listed (all the damage was in 1 directory). We then left the filesystem to be backed up to over the weekend where the first attempt lead to the first log here (<https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz>). I thought this was due to the damage not being fully repaired or the whole directory itself being an issue so I repeated the journal reset steps and effectively what we did last time but that seems to have lead to this issue. My thinking was that I could just clear the problematic directory of all the data on the filesystem and then just sync from the live FS. Do you think we'll just need to start over with this filesystem then? |
2024-07-09T12:42:50.360Z | <Dhairya Parmar> welcome @Ivan Clayson, pinging @Venky Shankar @Xiubo Li for their insights |
2024-07-09T15:41:34.479Z | <Erich Weiler> @Xiubo Li Sure - just let me know what you need me to do. Are you asking me to test it? Is there a build close to Reef 18.2.1-2 that I can try? And, I’ve only been using rpms from the RHEL/CentOS repos, is there another way to get updated rpms? |
2024-07-09T15:43:40.230Z | <Erich Weiler> The OS I am using is RHEL 9.3 |
2024-07-09T15:44:43.840Z | <Erich Weiler> I see this in my repos area:
```[root@pr-md-01 yum.repos.d]# cat CentOS-Ceph-Reef.repo
# CentOS-Ceph-Reef.repo
#
# Please see <https://wiki.centos.org/SpecialInterestGroup/Storage> for more
# information
[centos-ceph-reef]
name=CentOS-$stream - Ceph Reef
metalink=<https://mirrors.centos.org/metalink?repo=centos-storage-sig-ceph-reef-9-stream&arch=$basearch>
#baseurl=<http://mirror.stream.centos.org/SIGs/$stream/$basearch/storage-ceph-reef/>
gpgcheck=0
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-Storage
[centos-ceph-reef-test]
name=CentOS-$stream - Ceph Reef Testing
baseurl=<https://buildlogs.centos.org/centos/$stream/storage/$basearch/ceph-reef/>
gpgcheck=0
enabled=1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-Storage
[centos-ceph-reef-source]
name=CentOS-$stream - Ceph Reef Source
baseurl=<http://mirror.stream.centos.org/SIGs/$stream/storage/source/ceph-reef/>
gpgcheck=0
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-Storage``` |
2024-07-09T15:45:27.104Z | <Erich Weiler> Will the fix appear in the `[centos-ceph-reef-test]` repository? |