ceph - cephfs - 2024-06-18

Timestamp (UTC)Message
2024-06-18T05:14:17.734Z
<Kotresh H R> checking
2024-06-18T05:55:12.399Z
<Jos Collin> @Leonid Usov @Venky Shankar Could you please approve <https://github.com/ceph/ceph/pull/56701>. The run <https://pulpito.ceph.com/leonidus-2024-06-16_14:17:14-fs-wip-leonidus-testing-20240616.070940-reef-distro-default-smithi/> looks good.
2024-06-18T08:53:13.855Z
<Rishabh Dave> @Kotresh H R any opinion?
2024-06-18T10:12:35.311Z
<Rishabh Dave> @Venky Shankar There is only one review comment on `clone stats` PR that needs fixing, the one where we want destination path to point to subvol UUID dir instead of subvol base dir. I've fixed it here - <https://github.com/ceph/ceph/pull/54620/commits/b6ebc933c496108b8bc8825e2b75b5ab1c1fe4a2>. If possible, can you please take quick look and let me know if it looks okay?

Since acc to last Friday's conversation we look forward to final QA run and merge after this, if this commit looks okay, I'll proceed to do few runs with teuth to ensure that tests on the PR don't reveal more any race conditions anymore.
2024-06-18T10:13:06.707Z
<Venky Shankar> I have to get some backport work done
2024-06-18T10:13:16.474Z
<Venky Shankar> which is urgent, so the review would have to wait till tomorrow
2024-06-18T10:13:39.839Z
<Venky Shankar> and the qa bits needs review (I haven't looked at those closely)
2024-06-18T10:13:54.184Z
<Venky Shankar> let's not call it final QA runs yet šŸ™‚
2024-06-18T10:16:43.850Z
<Rishabh Dave> i meant i can run only tests added on this PR few times (because they've failed in past due to race condition) so that we can move a step towards final QA run.
2024-06-18T10:17:05.560Z
<Venky Shankar> kk, you can continue running tests
2024-06-18T10:17:10.655Z
<Venky Shankar> no problem
2024-06-18T10:17:20.056Z
<Rishabh Dave> if you can review just that commit (14 LoC), it would be great. i'll wait otherwise. no issues.
2024-06-18T10:17:34.729Z
<Rishabh Dave> > you can continue running tests
yeah, i
2024-06-18T10:17:42.904Z
<Venky Shankar> Better to wait -- I'll review it tomorrow in totallity.
2024-06-18T10:17:53.822Z
<Rishabh Dave> > you can continue running tests
yeah, i'm unsure if that new commits is how we want it.
2024-06-18T10:17:55.908Z
<Rishabh Dave> cool
2024-06-18T10:18:02.916Z
<Rishabh Dave> i'll wait.
2024-06-18T10:44:07.394Z
<Kotresh H R> In the code, does this refer to the subvolume being cloned ?
2024-06-18T10:44:21.431Z
<Rishabh Dave> yes
2024-06-18T10:45:08.236Z
<Rishabh Dave> should we be concerned for src subvol instead of dst subvol?
2024-06-18T10:45:24.965Z
<Kotresh H R> Then I don't get Venky's comment in the PR ? How can the clone be removed with --retain-snapshots option.
2024-06-18T10:46:14.091Z
<Kotresh H R> But Venky's point is still valid in the sense, we should be relying on the rstats of the uuid directory rather than the subvolume base.
2024-06-18T10:46:27.876Z
<Kotresh H R> Subvolume base does contain other metadata files too.
2024-06-18T10:47:07.662Z
<Kotresh H R> src subvolume, aren't you fetching rstats from the snapshot ?
2024-06-18T10:47:19.422Z
<Rishabh Dave> i am.
2024-06-18T10:47:57.354Z
<Venky Shankar> rstat() of base would report stats aggregated from everything under it, isn't it? That would include all snaps..
2024-06-18T10:47:59.131Z
<Kotresh H R> ok, In src too, it's better to fetch rstats of subvolume root
2024-06-18T10:48:33.120Z
<Kotresh H R> >  rstat() of base would report stats aggregated from everything under it, isn't it? That would include all snaps..
Yes, right. So we should always rely on subvolume root
2024-06-18T10:48:38.874Z
<Kotresh H R> in all the places
2024-06-18T10:48:40.910Z
<Rishabh Dave> @Kotresh H R subvolume root meaning the UUID dir or base dir?
2024-06-18T10:48:53.942Z
<Kotresh H R> subvolume root = UUID
2024-06-18T10:49:09.512Z
<Rishabh Dave> got it, was confirming.
2024-06-18T10:49:54.995Z
<Kotresh H R> I am not recollecting exactly, the snapshot path varies between subvolume versions after retian snapshot feature is introduced
2024-06-18T10:50:23.172Z
<Kotresh H R> But I think the get snapshot path interface should take care of it
2024-06-18T10:50:35.367Z
<Rishabh Dave> on dst subvol side, fetching rstats on subvol base dir is fine because it is still under creation, right?

although i agree it's ideal and better to fetch from dst subvol root rather than dst subvol base.
2024-06-18T10:50:43.800Z
<Rishabh Dave> > But I think the get snapshot path interface should take care of it
2024-06-18T10:51:29.021Z
<Kotresh H R> >  on dst subvol side, fetching rstats on subvol base dir is fine because it is still under creation, right?
For progress stats to be correct, both src and dst should rely on subvolume_root
2024-06-18T10:51:35.416Z
<Rishabh Dave> > But I think the get snapshot path interface should take care of it
yes, we are using that of src subvol side. so it's taken care of.
2024-06-18T10:52:10.567Z
<Rishabh Dave> right.
2024-06-18T10:52:39.289Z
<Rishabh Dave> so on src side, we already do that and i've added this commit to do that on dst subvol side as well - <https://github.com/ceph/ceph/pull/54620/commits/b6ebc933c496108b8bc8825e2b75b5ab1c1fe4a2>
2024-06-18T10:55:54.034Z
<Kotresh H R> ok
2024-06-18T10:58:50.533Z
<Rishabh Dave> @Kotresh H R was there some reason to not allow `get_path` op while subvol is not in `STATE_COMPLETE` ?
2024-06-18T10:58:51.025Z
<Rishabh Dave> i
2024-06-18T10:59:38.334Z
<Rishabh Dave> @Kotresh H R was there some reason to not allow `get_path` op while subvol is not in `STATE_COMPLETE` ?

i've allowed it now since we need do `get_path` op while dst subvol is being created during cloning. [https://github.com/ceph/ceph/pull/54620/commits/b6ebc933c496108b8bc8825e2b75b5ab1c1fe4a2#diff-2afc20b5874dc7ad850a970[ā€¦]d6ab0a44e2140f8491feaf44R220](https://github.com/ceph/ceph/pull/54620/commits/b6ebc933c496108b8bc8825e2b75b5ab1c1fe4a2#diff-2afc20b5874dc7ad850a970bcba2be7246f04c94d6ab0a44e2140f8491feaf44R220)
2024-06-18T10:59:46.948Z
<Rishabh Dave> @Venky Shankar In context to this comment <https://github.com/ceph/ceph/pull/54620#discussion_r1637695939>, `dst_path` should be path to subvolume UUID, so that `rsize` doesn't include snapshots of the subvolume?
2024-06-18T11:08:16.604Z
<Kotresh H R> get_path didn't make sense on a subvolume which is not complete yet
2024-06-18T11:08:46.038Z
<Kotresh H R> But isn't that for the external subvolume op
2024-06-18T11:09:04.834Z
<Kotresh H R> Internally you can directly call the subvolume funcations right ?
2024-06-18T11:09:46.231Z
<Kotresh H R> Internally you can directly call the subvolume functions right ?
2024-06-18T11:22:50.054Z
<Rishabh Dave> internally, yes, but in this case i need to open subvol but therefore need to use right SubvolOp
2024-06-18T11:26:52.404Z
<Kotresh H R> ok
2024-06-18T13:30:47.155Z
<Dhairya Parmar> @Leonid Usov I was taking a look at this tracker <https://tracker.ceph.com/issues/66107> and wanted to understand what "roots" are in this log `sending ack q-map[v:(19:133) roots:1/0] to the leader 4294` line. When reading the `QuiesceMap` code it looks to me that the word "root" means nodes of the tree i.e. every node in the quiesce map is a representation of the quiesce state?
2024-06-18T13:44:56.385Z
<Leonid Usov> roots here are ā€œquiesce rootsā€. Those are paths into the FS that need to be quiesced, including everything below them - so, they are the root of the quiesced tree
2024-06-18T13:45:32.353Z
<Leonid Usov> the notation above means that there was an acknowledgement carrying 1 active root and 0 inactive roots
2024-06-18T13:46:44.721Z
<Leonid Usov> since its a diff_ack, we can conclude that the active root thatā€™s part of this ack is in a state that differs from what the quiesce db assumes at version 19:133
2024-06-18T13:47:33.799Z
<Leonid Usov> usually, this means that the root got QUIESCED by this agent, while the DB still believes that the root is QUIESCING
2024-06-18T13:51:58.221Z
<Leonid Usov> however, it could (theoretically) be the opposite: an agent may report a root as QUIESCING when the DB believes that itā€™s QUIESCED.
This transition is valid DB-wise, though it is not supported on the agent currently.
2024-06-18T13:52:58.035Z
<Leonid Usov> From the quiescedb.h
```template <class CharT, class Traits>
static std::basic_ostream<CharT, Traits>&
operator<<(std::basic_ostream<CharT, Traits>& os, const QuiesceMap& map)
{

  size_t active = 0, inactive = 0;

  for (auto&& [_, r] : map.roots) {
    if (r.state < QS__TERMINAL) {
      ++active;
    } else {
      ++inactive;
    }
  }

  return os << "q-map[v:" << map.db_version << " roots:" << active << "/" << inactive << "]";
}```
2024-06-18T14:31:27.649Z
<Dhairya Parmar> so `QuiesceMap` is just the representation of the root states while `QuiesceSet` contains roots that need to be quiseced
2024-06-18T14:32:26.370Z
<Leonid Usov> correct. In the QuiesceSet, the roots are referred to as ā€œmembersā€. For a given set, its members are necessarily unique roots, but the same root may be a member of more than one set.
2024-06-18T14:33:21.119Z
<Leonid Usov> when the quiesce db generates a quiesce map, it scans all active quiesce sets and aggregates all set members into a unique list of quiesce roots
2024-06-18T14:34:16.475Z
<Dhairya Parmar> and where exactly does that happen?
2024-06-18T14:34:52.128Z
<Leonid Usov> https://files.slack.com/files-pri/T1HG3J90S-F078QLCJMK6/download/untitled.cpp
2024-06-18T14:36:25.773Z
<Leonid Usov> that code got updated recently in <https://github.com/ceph/ceph/pull/57980>, pending review. I found an error with how the TTL is aggregated and fixed it
2024-06-18T14:47:23.111Z
<Dhairya Parmar> ```        // for a quiesce map, we want to report active roots as either QUIESCING or RELEASING
        // this is to make sure that clients always have a reason to report back and confirm
        // the quiesced state.
        auto requested = set.get_requested_member_state();```
the `requested` here means the requested state of all the roots inside the set `set`  right?
2024-06-18T14:48:02.196Z
<Leonid Usov> yes
2024-06-18T14:48:58.136Z
<Dhairya Parmar> okay, the name `get_requested_member_state`  confused me a bit, `get_requested_members_state` would've been easier
2024-06-18T14:49:23.838Z
<Dhairya Parmar> at first glance it looked like it was trying to fetch state of each member in the set
2024-06-18T14:49:36.277Z
<Leonid Usov> i donā€™t think this is a big enough change, so itā€™s within the personal taste margin. I understand this as the state for a member
2024-06-18T14:49:46.567Z
<Dhairya Parmar> okay
2024-06-18T14:50:03.993Z
<Leonid Usov> on the other hand, you are free to refactor it
2024-06-18T14:50:40.589Z
<Dhairya Parmar> šŸ˜…
2024-06-18T14:51:04.543Z
<Dhairya Parmar> that would be the last thing to do i guess
2024-06-18T14:51:13.998Z
<Leonid Usov> this loop is per member, so i think it has some reasoning
2024-06-18T14:51:50.490Z
<Leonid Usov> there is another function in the set:
```  /// @brief The effective state of a member is just a max of its parent set state and its own state
  ///        The exception is when the set is releasing: we want to consider any ack from peers
  ///        that confirms quiesced state of the member to be treated as RELEASED.
  /// @param member_state the reported state of the member
  /// @return the effective member state
  QuiesceState get_effective_member_state(QuiesceState reported_state) const```
2024-06-18T14:53:10.798Z
<Leonid Usov> these two functions are names similarly. the difference is that the `effective` version takes the reported member state and combines it with the set state, whereas the `requested` version doesnā€™t care about the reported member state
2024-06-18T14:53:15.734Z
<Leonid Usov> these two functions are named similarly. the difference is that the `effective` version takes the reported member state and combines it with the set state, whereas the `requested` version doesnā€™t care about the reported member state
2024-06-18T14:54:41.049Z
<Dhairya Parmar> oh okay
2024-06-18T14:56:22.356Z
<Dhairya Parmar> `get_requested_member_state`  makes sense now, the loop just means we iterate through the entire set and get the requested state but since it is `set.get_requested_member_state()` i.e. doesn't that mean all the `requested` values would be the same for all members of that particular set?
2024-06-18T14:59:01.123Z
<Leonid Usov> thatā€™s correct. You could take that call out of the inner loop
2024-06-18T14:59:39.974Z
<Leonid Usov> though Iā€™m sure the compiler is doing that for us, as that method is marked `const`
2024-06-18T15:01:46.866Z
<Leonid Usov> well, not so sure. Probably itā€™d be best to move that call before the loop over the members
2024-06-18T15:06:02.886Z
<Dhairya Parmar> agreed, although from the readability perspective, since this func is part of the `QuiesceSet`  and not the `Members members` (the word "member" in the func name made me assert this initially), and since this state applies to all the members, i'd think of rewording this as `get_requested_members_state` and put the call out of the inner loop. this would be like this now
```void QuiesceDbManager::calculate_quiesce_map(QuiesceMap &map)
{
  map.roots.clear();
  map.db_version = db.version();
  auto db_age = db.get_age();

  for(auto & [set_id, set]: db.sets) {
    if (set.is_active()) {
      // we only report active sets;
      auto requested = set.get_requested_members_state();
      for(auto & [root, member]: set.members) {
        if (member.excluded) {
          continue;
        }
        // for a quiesce map, we want to report active roots as either QUIESCING or RELEASING
        // this is to make sure that clients always have a reason to report back and confirm
        // the quiesced state.
        auto ttl = get_root_ttl(set, member, db_age);
        auto root_it = map.roots.try_emplace(root, QuiesceMap::RootInfo { requested, ttl }).first;

        // the min below resolves conditions when members representing the same root have different state/ttl
        // e.g. if at least one member is QUIESCING then the root should be QUIESCING
        root_it->second.state = std::min(root_it->second.state, requested);
        root_it->second.ttl = std::min(root_it->second.ttl, ttl);
      }
    }
  }
}```
2024-06-18T15:06:14.115Z
<Dhairya Parmar> this should be hard to read, just a sec
2024-06-18T15:06:38.082Z
<Dhairya Parmar> here <https://termbin.com/dq50>
2024-06-18T15:07:30.300Z
<Leonid Usov> iā€™m reluctant about the renaming, see above, but the move of the call outside the inner loop makes sense
2024-06-18T15:07:36.796Z
<Dhairya Parmar> okay
2024-06-18T15:08:12.770Z
<Dhairya Parmar> should this be my first quiesce db pr šŸ˜…? a minor refactor?
2024-06-18T15:08:20.003Z
<Leonid Usov> if you go for the renaming, then the second function should be renamed, too
2024-06-18T15:08:33.473Z
<Dhairya Parmar> i'll leave that renaming for now
2024-06-18T15:09:17.530Z
<Leonid Usov> > should this be my first quiesce db pr šŸ˜…? a minor refactor?
I would approve such a PR if you are ready to go for it. Though please note that with this feature being part of squid, you may need to backport all refactors
2024-06-18T15:09:35.767Z
<Dhairya Parmar> whoops
2024-06-18T15:10:01.918Z
<Leonid Usov> well, thatā€™s what I had to do with all my fixes
2024-06-18T15:10:40.810Z
<Dhairya Parmar> then i think it is better to create a tracker for refactors(if they may come up later as i read more code) and do it all at once?
2024-06-18T15:11:04.105Z
<Leonid Usov> yes, that would be my approach for such refactors
2024-06-18T15:11:19.025Z
<Dhairya Parmar> alright
2024-06-18T15:11:23.529Z
<Dhairya Parmar> thanks šŸ™‚

Any issue? please create an issue here and use the infra label.