ceph - cephfs - 2025-01-13

Timestamp (UTC)Message
2025-01-13T00:53:03.744Z
<Bailey Allison> so to give an update on the above.....still going

doing some napkin math with my limited understanding of the inner workings of cephfs + we're using crazy swap and it's all that we got, it seems like it might take forever
2025-01-13T00:53:55.408Z
<Bailey Allison> increasing debug logs on an mds and watching the reconnect rejoin process it seems like rejoin relies on session map to get into rejoin_scour_survivor_replicas
2025-01-13T00:54:31.075Z
<Bailey Allison> is there any value in this point in resetting session map ? is that something dangerous to do or what are the possible side effects of that ?
2025-01-13T00:54:59.681Z
<Bailey Allison> would that also require an mds restart for for it to realize it's been reset ?
2025-01-13T13:48:23.777Z
<Austin Axworthy> To add to Bailey above, the replay/resolve completes in about 12 hours (known timeframe). Where the rejoin has been running for approximately 3 days. The thought here is whether we need to restart the MDS or not when resetting the sessions  and if this would help us skip or lessen the required time in rejoin. Specifically the rejoin_scour_survivor_replicas, which is the task it has been in for most of the rejoin process. The idea is to possible use the following command to reset them.

```cephfs-table-tool all reset session```
But we were wondering what this could possibly affect? And if there are any further steps required before or after the MDS becomes active that are required to ensure consistency?
2025-01-13T14:04:04.379Z
<Venky Shankar> @Austin Axworthy I don't think resetting the session would help in shortening the rejoin time (or resolve).
2025-01-13T14:06:21.519Z
<Venky Shankar> From what I'm seeing from @Dan van der Ster update in the tracker -- and I was looking at the relevant code today -- there is a (recursive) iteration of directory fragments in the metadata cache and if that's running for over 12 hrs (resovle state), then the number of items is really huge or more likely there is a bug in there which causes that really long iterations.
2025-01-13T14:07:59.100Z
<Venky Shankar> I will have to look at it closely.
2025-01-13T14:24:08.087Z
<Austin Axworthy> Just wanted to throw a breakdown of each MDS and the time it took in each state:

Rank 0
• Replay - 3.5 hours 
• Resolve - 7 hours
• Rejoin - In progress - approximately 72+ hours
Rank 1
• Replay - 4 hours
• Resolve 10 hours
• Rejoin - in progress - approximately 72+ hours
2025-01-13T14:26:48.564Z
<Austin Axworthy> As we do not think resetting the sessions would help here. I wanted to ask if the Advanced: metadata repair tool docs would help us here. Specifically the recover_dentries, into the journal truncation and the following steps?

We know there is an associated risk with this process, and we are not at this point yet. Just wanted to get your thoughts
2025-01-13T14:43:44.450Z
<Bailey Allison> to clarify too
2025-01-13T14:43:52.851Z
<Bailey Allison> currently stuck in rejoin state past resolve
2025-01-13T15:20:56.444Z
<Venky Shankar> @Austin Axworthy yeh -- the recover_dentries+journal reset trick has been used in the past to workaround huge journal replay and rejoin times.
2025-01-13T15:21:28.818Z
<Venky Shankar> (past == last week, btw)
2025-01-13T15:35:18.100Z
<Bailey Allison> true so just to confirm journal reset is good to get around long rejoin specifically ?
2025-01-13T15:58:11.493Z
<Venky Shankar> @Bailey Allison Yes -- but note, for recover_dentries, you'd need the fs to be offline...
2025-01-13T15:58:31.009Z
<Bailey Allison> yes okay thanks venky
2025-01-13T15:58:41.308Z
<Bailey Allison> that'd be okay too
2025-01-13T15:58:58.333Z
<Bailey Allison> need to set it down too or just have each daemon offline
2025-01-13T16:51:03.516Z
<Mark Nelson (nhm)> @Venky Shankar Does that time spent in create_subtree_map  match your expectations with that recursive iteration of directory fragments?
2025-01-13T19:55:29.657Z
<Bailey Allison> the client is setting up another system for us with about 1TB of RAM instead we're going to move MDS to, to no longer have them be using about 200GB of swap each so hopefully that should make things much faster
2025-01-13T20:06:00.489Z
<Bailey Allison> also going to setup logrotate to prune log files over certain size so we can jack debugging and have some info on it too from logs

Any issue? please create an issue here and use the infra label.