2025-01-10T02:57:20.525Z | <Austin Axworthy> Looks like we may be running into <https://tracker.ceph.com/issues/64717>
The resolve has been working on recalc_auth_bits for about 2.5 hours now without going laggy. This has been the longest yet since increasing beacon grace very high.
It looks like a waiting game as it is progressing thus far and hopefully does not crash. |
2025-01-10T12:28:18.400Z | <Dhairya Parmar> @Venky Shankar closed this tracker, added notes <https://tracker.ceph.com/issues/69315>. |
2025-01-10T12:28:57.424Z | <Dhairya Parmar> can you you take a look at <https://tracker.ceph.com/issues/67673#note-2>? |
2025-01-10T12:39:23.035Z | <Venky Shankar> checking |
2025-01-10T13:55:42.731Z | <Patrick Donnelly> @Venky Shankar another thing to note for <https://github.com/ceph/ceph/pull/60746> is that a subvoluemgroup with ceph.dir.casesensitive=0 is inherited for all subvolumes but must be (a) set just after creating the subvolumegroup; (b) affects all metadata / subvolume names |
2025-01-10T13:55:58.512Z | <Patrick Donnelly> I think that's okay but wanted to point it out. |
2025-01-10T13:56:27.687Z | <Patrick Donnelly> whereas for a subvolume only the data directory (uuid) has the vxattr set, not the metadata |
2025-01-10T13:57:01.465Z | <Venky Shankar> kk. could not get to this today. |
2025-01-10T13:57:09.497Z | <Venky Shankar> Its in my list though! |
2025-01-10T14:14:55.950Z | <Patrick Donnelly> no worries |
2025-01-10T21:16:37.920Z | <Bailey Allison> so an update on this one.....we've now got both mds ranks 0 and 1 in rejoin state, which they have been in for ~12 hours now. doing some research and finding a few different sources recommending to enable mds_wipe_sessions and to delete the mdsx_openfiles.x from the metadata pool. don't want to restart mds to apply that config change currently because it could potentially take it down for another day, and can't inject into the daemons either. would there be any value into deleting the mdsx_openfiles.x objects from the metadata pool, or would that also require an mds restart to even do anything ? |
2025-01-10T21:17:35.205Z | <Bailey Allison> also version is 17.2.7 from public repos |
2025-01-10T21:25:55.923Z | <Bailey Allison> other thing is of course a lot of those posts and related bugs are couple years old, so curious if that would even still be applicable for quincy to do |
2025-01-10T22:30:19.383Z | <Dan van der Ster> @Bailey Allison are they still in rejoin? what does perf top show on the ceph-mds processes? |
2025-01-10T22:31:14.803Z | <Bailey Allison> they are still in rejoin, give me few mins to go check with the team (currently on vacation :) ) |
2025-01-10T22:54:12.708Z | <Bailey Allison> I was able to grab a screenshot from a team member that's connected, going to see if I can get in myself to get better info: https://files.slack.com/files-pri/T1HG3J90S-F088W78MUV6/download/image.png |
2025-01-10T22:58:13.711Z | <Dan van der Ster> yeah that doesn't look too useful. Is the MDS CPU idle?
if not, we need `perf top -p $(pidof ceph-mds)` |
2025-01-10T23:00:33.499Z | <Bailey Allison> working on that now I just got connected |
2025-01-10T23:01:25.326Z | <Bailey Allison> mds cpu not idle around 50% for both daemons |
2025-01-10T23:06:36.227Z | <Dan van der Ster> did you get the perf top -p <pid> output? |
2025-01-10T23:08:23.484Z | <Bailey Allison> waiting for it to load the host is chugging |
2025-01-10T23:13:57.311Z | <Bailey Allison> finally loaded, sadly I can't copy the text out but I can get a screenshot from the remote session, don't see anything in there besides the fourth line: https://files.slack.com/files-pri/T1HG3J90S-F0884K7HSTX/download/image.png |
2025-01-10T23:15:18.204Z | <Dan van der Ster> that's really weird.
I wonder if the wallclock profiler will show anything... if you can even compile that on the host
<https://github.com/markhpc/uwpmp> |
2025-01-10T23:19:11.940Z | <Dan van der Ster> @Bailey Allison any kernel errors etc messages in dmesg ? that looks like nasty swap issues |
2025-01-10T23:22:39.857Z | <Bailey Allison> it 100% is we're about 200 gb of swap in currently , but nothing much related in dmesg sadly |
2025-01-10T23:23:04.839Z | <Bailey Allison> we'll try to get profiler up at least |
2025-01-10T23:27:39.087Z | <Bailey Allison> checking deeper on the only ceph-mds thing doing anything in perf top shows this (sorry for not pasting text of these): https://files.slack.com/files-pri/T1HG3J90S-F088A16A1HA/download/image.png |
2025-01-10T23:27:59.066Z | <Bailey Allison> I'm gonna see if we can free up some memory on the node too..... |
2025-01-10T23:28:10.245Z | <Dan van der Ster> i'm just trying to get at least the thread name for that process -- `perf` must not be working on that host...
does `top -H` give a clue? |
2025-01-10T23:28:21.805Z | <Dan van der Ster> i'm just trying to get at least the thread name for that what ceph-mds is doing -- `perf` must not be working on that host...
does `top -H` give a clue? |
2025-01-10T23:28:52.526Z | <Dan van der Ster> (you had symbols earlier on the ceph-mds -- its weird they are not workign now either) |
2025-01-10T23:34:58.279Z | <Bailey Allison> filtered by %MEM too, without it ctdb was spamming output, going to stop that service too: https://files.slack.com/files-pri/T1HG3J90S-F088A1VBM8C/download/image.png |
2025-01-10T23:37:27.633Z | <Dan van der Ster> both mds's in ms_dispatch, both in D state.
and very low Avail mem...
I think this is mostly caused by too little memory now.
Try your best to free up memory until there's more space for the mds to malloc faster: https://files.slack.com/files-pri/T1HG3J90S-F088WASEFRN/download/image.png |
2025-01-10T23:45:29.279Z | <Bailey Allison> that makes sense to me, thank you very much for the help friend !
now, I'm off to empty water from the titanic with a bucket..... |
2025-01-10T23:45:56.352Z | <Bailey Allison> that makes sense to me, thank you very much for the help friend !
now, I'm off to empty water from the titanic with a bucket..... |