ceph - cephfs - 2024-06-25

Timestamp (UTC)Message
2024-06-25T08:17:19.112Z
<Jos Collin> @Venky Shankar @Rishabh Dave This change conflicts in reef. <https://tracker.ceph.com/issues/66075>. Are you backporting this to reef/squid? Then I could cherry-pick it to my backport PR.
2024-06-25T08:23:36.171Z
<Jos Collin> @Venky Shankar @Rishabh Dave This change conflicts in reef/squid. <https://tracker.ceph.com/issues/66075>. Are you backporting this to reef/squid? Then I could cherry-pick it to my backport PR.
2024-06-25T08:24:41.128Z
<Rishabh Dave> checking
2024-06-25T08:29:30.321Z
<Rishabh Dave> this needs to be backported because the PR that introduces this issue has been backported to squid and reef.
2024-06-25T08:32:38.215Z
<Rishabh Dave> @Jos Collin i'll post the backports.
2024-06-25T08:38:02.623Z
<Jos Collin> Wait, I'll add those commits to my backport PR. Otherwise, my PR has conflicts.
2024-06-25T08:38:21.858Z
<Rishabh Dave> already done. you can cherry-pick those commits.
2024-06-25T08:38:41.652Z
<Rishabh Dave> squid: <https://github.com/ceph/ceph/pull/58253>
2024-06-25T08:38:51.425Z
<Rishabh Dave> reef: <https://github.com/ceph/ceph/pull/58254>
2024-06-25T08:39:26.303Z
<Rishabh Dave> @Jos Collin can you please post link to the PR which you are backporting?
2024-06-25T08:39:55.728Z
<Jos Collin> It's already created a while ago. Before your PR: <https://github.com/ceph/ceph/pulls>
2024-06-25T08:40:34.895Z
<Jos Collin> <https://github.com/ceph/ceph/pull/58252>, <https://github.com/ceph/ceph/pull/58251>.
2024-06-25T08:41:50.377Z
<Rishabh Dave> ohkay, i see why you were seeing the conflict.
2024-06-25T08:42:43.876Z
<Dhairya Parmar> @Venky Shankar some upstream user just faced <https://tracker.ceph.com/issues/61009>
2024-06-25T08:44:09.438Z
<Jos Collin> Yeah, your commit should be in between to resolve the conflicts.
2024-06-25T08:44:42.245Z
<Rishabh Dave> i thought you meant you wanted me to do my backport so that you can cherry-pick from it.
2024-06-25T08:45:05.665Z
<Jos Collin> No, I meant, You could close your PR, I'll do your backport.
2024-06-25T08:45:48.318Z
<Jos Collin> Thus avoids later rebases.
2024-06-25T08:47:05.300Z
<Rishabh Dave> got it. if you had told me that you have already cherry-picked it i would've not created those backports.
2024-06-25T08:47:34.133Z
<Rishabh Dave> anyways, since you already have those commits, i'll close those backport PRs.
2024-06-25T08:47:37.744Z
<Jos Collin> sorry it caused confusion
2024-06-25T08:47:42.779Z
<Rishabh Dave> can you take over the backport tickets?
2024-06-25T08:47:55.257Z
<Jos Collin> yes, I'll take it now.
2024-06-25T08:48:08.710Z
<Rishabh Dave> cool. close my backport PRs.
2024-06-25T08:52:26.416Z
<Jos Collin> Updated  the trackers 👍
2024-06-25T08:57:53.113Z
<Rishabh Dave> cool. closed my backport PRs.
2024-06-25T08:58:13.416Z
<Rishabh Dave> cool, i've closed my backport PRs in favour of your backport PRs.
2024-06-25T08:59:02.915Z
<Dhairya Parmar> CCed you
2024-06-25T11:42:54.434Z
<Dhairya Parmar> hey @Leonid Usov
2024-06-25T11:43:44.753Z
<Dhairya Parmar> from your response in PR about ``next_event_at_age` is used [here](https://github.com/ceph/ceph/pull/57946/files/b3edbbabdbf64c39fb91da3f47fcfce25bd3c93c#diff-f5065f0feaead10f74178b07ecb08b30d186380ef35684c0364024fe42f62a04R87) and [here](https://github.com/ceph/ceph/pull/57946/files/b3edbbabdbf64c39fb91da3f47fcfce25bd3c93c#diff-f5065f0feaead10f74178b07ecb08b30d186380ef35684c0364024fe42f62a04R91). It will cause the main thread to make a pass at or after the designated age.`

If the db thread has some work, and the boostrap_interval is non-zero, what happens in that case?
2024-06-25T11:44:32.581Z
<Leonid Usov> if it has work, then it won’t sleep, and the next_event_at_age will be updated with a new, slightly smaller bootstrap_interval
2024-06-25T11:45:02.747Z
<Leonid Usov> if it has work, then it won’t sleep, and the next_event_at_age will be updated with a new, slightly smaller bootstrap_interval - given that the work the db had didn’t cause the bootstrap to complete
2024-06-25T11:47:20.261Z
<Leonid Usov> but we try to avoid that by having a dedicated sleep waiting for new peer listings regardless of the other work that’s already there:
```          if (pending_db_updates.empty()) {
            // we are waiting here because if requests/acks aren't empty
            // the code above will skip the sleep due to the `db_thread_has_work`
            // returning true, causing a busy-loop of the quiesce manager thread.
            // This sleep may be interrupted by the submit_condition, in which case
            // we will re-consider everything and may end up here again, but with a shorter
            // bootstrap_delay.
            dout(5) << "bootstrap: waiting for new peers with pending acks: " << pending_acks.size()
              << " requests: " << pending_requests.size()
              << ". Wait timeout: " << bootstrap_delay << dendl;
            submit_condition.wait_for(ls, bootstrap_delay);
          } ```
2024-06-25T11:47:37.689Z
<Leonid Usov> that comment should have explained it
2024-06-25T11:47:46.435Z
<Leonid Usov> it’s a pity it didn’t work
2024-06-25T11:54:15.314Z
<Dhairya Parmar> shouldn't this be `if (pending_db_updates.empty() || !pending_acks.empty() || !pending_requests.empty())` then?
2024-06-25T11:54:19.446Z
<Leonid Usov> the `dout` could be better. Please update it to
```... << "bootstrap: waiting for new peers; pending acks: " << ...```
2024-06-25T11:54:39.120Z
<Dhairya Parmar> if we're waiting on peers for the acks and requests(pending ones)
2024-06-25T11:55:12.171Z
<Dhairya Parmar> because
```while (!requests.empty()) {
	pending_requests.emplace_front(std::move(requests.back()));
	requests.pop_back();
}
while (!acks.empty()) {
	pending_acks.emplace_front(std::move(acks.back()));
	acks.pop_back();
}```
can be no-ops right
2024-06-25T11:55:12.476Z
<Leonid Usov> no the bootstrap only cares about the db_updates, because that’s how new peers are discovered
2024-06-25T11:56:14.364Z
<Leonid Usov> the only reason for us to go for another spin from this bootstrap branch is if we have non-empty db updates, meaning that some of the peers we’re waiting for responded
2024-06-25T11:56:42.046Z
<Leonid Usov> the only reason for us to go for another spin from this bootstrap branch is if we have non-empty db updates, meaning that some of the peers we’re waiting for have responded
2024-06-25T11:57:47.808Z
<Dhairya Parmar> okay i get that but what im pointing to is that if there are no pending db updates, we shouldnt go for a sleep right
2024-06-25T11:57:57.289Z
<Leonid Usov> >  shouldn’t this be `if (pending_db_updates.empty() || !pending_acks.empty() || !pending_requests.empty())` then?
it could be, but functionally that would be equivalent
2024-06-25T11:58:23.221Z
<Leonid Usov> >  if there are no pending db updates, we shouldnt go for a sleep right
the opposite, we should sleep if there are no pending db update
2024-06-25T11:58:24.727Z
<Leonid Usov> >  if there are no pending db updates, we shouldnt go for a sleep right
the opposite, we should sleep if there are no pending db updates
2024-06-25T11:59:02.156Z
<Dhairya Parmar> from what i understand from this code snippet
```          while (!requests.empty()) {
            pending_requests.emplace_front(std::move(requests.back()));
            requests.pop_back();
          }
          while (!acks.empty()) {
            pending_acks.emplace_front(std::move(acks.back()));
            acks.pop_back();
          }
          if (pending_db_updates.empty()) {
            dout(5) << "bootstrap: waiting for new peers with pending acks: " << pending_acks.size()
              << " requests: " << pending_requests.size()
              << ". Wait timeout: " << bootstrap_delay << dendl;
            submit_condition.wait_for(ls, bootstrap_delay);
          }```
is that we first gather the pending reqs+acks and then sleep on it cause we're waiting for new peers for the pending reqs. but what if the `requests` is empty?
2024-06-25T11:59:07.461Z
<Dhairya Parmar> the while loop is a no-op
2024-06-25T11:59:30.518Z
<Dhairya Parmar> so is it necessary to wait?
2024-06-25T11:59:34.420Z
<Leonid Usov> those while loops are to make sure we don’t lose acks and requests that have been submitted while we’re bootstrapping
2024-06-25T11:59:57.915Z
<Leonid Usov> here they are all getting moved from the pending lists
```      decltype(pending_acks) acks(std::move(pending_acks));
      decltype(pending_requests) requests(std::move(pending_requests));
      decltype(pending_db_updates) db_updates(std::move(pending_db_updates));```
2024-06-25T12:00:16.460Z
<Leonid Usov> so we need to put them back before we go for the next iteration
2024-06-25T12:01:10.978Z
<Leonid Usov> we’re also using `emplace_front` in those loops to preserve the original order of the acks and the requests
2024-06-25T12:01:32.516Z
<Leonid Usov> then we enter the sleep unless there are pending db updates
2024-06-25T12:01:36.916Z
<Dhairya Parmar> okay so what if we have
```bootstrap: waiting for new peers with pending acks: 0 requests: 0. Wait timeout: 15```
2024-06-25T12:01:55.712Z
<Dhairya Parmar> we still wait right?
2024-06-25T12:02:02.277Z
<Leonid Usov> this means there are no pending acks or requests, but some peers are unknown
2024-06-25T12:02:18.368Z
<Dhairya Parmar> oh!
2024-06-25T12:02:19.837Z
<Leonid Usov> this means there are no pending acks or requests, but some peers are unknown, and we sleep waiting for their discovery responses which arrive as db listings
2024-06-25T12:03:43.298Z
<Leonid Usov> ```      decltype(pending_acks) acks(std::move(pending_acks));
      decltype(pending_requests) requests(std::move(pending_requests));
      decltype(pending_db_updates) db_updates(std::move(pending_db_updates));```
this can be changed to
```      auto acks(std::move(pending_acks));
      auto requests(std::move(pending_requests));
      auto db_updates(std::move(pending_db_updates));```
2024-06-25T12:04:14.220Z
<Leonid Usov> ```      decltype(pending_acks) acks(std::move(pending_acks));
      decltype(pending_requests) requests(std::move(pending_requests));
      decltype(pending_db_updates) db_updates(std::move(pending_db_updates));```
this can be changed to
```      auto acks(std::move(pending_acks));
      auto requests(std::move(pending_requests));
      auto db_updates(std::move(pending_db_updates));```
or
```      auto acks       = std::move(pending_acks);
      auto requests   = std::move(pending_requests));
      auto db_updates = std::move(pending_db_updates));```
2024-06-25T12:04:28.964Z
<Leonid Usov> ```      decltype(pending_acks) acks(std::move(pending_acks));
      decltype(pending_requests) requests(std::move(pending_requests));
      decltype(pending_db_updates) db_updates(std::move(pending_db_updates));```
this can be changed to
```      auto acks(std::move(pending_acks));
      auto requests(std::move(pending_requests));
      auto db_updates(std::move(pending_db_updates));```
or
```      auto acks       = std::move(pending_acks);
      auto requests   = std::move(pending_requests);
      auto db_updates = std::move(pending_db_updates);```
2024-06-25T12:04:39.907Z
<Dhairya Parmar> since the comment starts with `we are waiting here because if requests/acks aren't empty` it confused me and made me think that the empty reqs case is not being considered.
2024-06-25T12:05:05.324Z
<Leonid Usov> well, there are more lines to the comment:
```            // we are waiting here because if requests/acks aren't empty
            // the code above will skip the sleep due to the `db_thread_has_work`
            // returning true, causing a busy-loop of the quiesce manager thread.```
2024-06-25T12:05:48.504Z
<Leonid Usov> feel free to rephrase this for clarity
2024-06-25T12:06:07.425Z
<Leonid Usov> I often do this to my own comments when I read them again and don’t understand
2024-06-25T12:06:20.376Z
<Leonid Usov> I often do this to my own comments when I read them again and don’t understand, which happens regularly
2024-06-25T12:07:38.442Z
<Leonid Usov> actually, now that I think about it, there might be something missing in this code
2024-06-25T12:07:53.359Z
<Leonid Usov> for example, if there’s a pending membership update, we won’t consider it here
2024-06-25T12:08:24.226Z
<Dhairya Parmar> and how bad is it?
2024-06-25T12:08:27.082Z
<Dhairya Parmar> to non consider?
2024-06-25T12:08:40.213Z
<Dhairya Parmar> to not consider?
2024-06-25T12:08:57.019Z
<Leonid Usov> not very good. We want to process that update asap, and without considering it we will sleep up to bootstrap delay
2024-06-25T12:09:31.054Z
<Leonid Usov> BTW, why do you have 15 as the bootstrap delay? It should be around 1 second
2024-06-25T12:09:46.011Z
<Dhairya Parmar> it was just an example, i didnt actully run it 😛
2024-06-25T12:09:50.569Z
<Leonid Usov> ah ok
2024-06-25T12:11:43.709Z
<Leonid Usov> and alternative approach - a more robust one - would be to never sleep anywhere else but at the beginning of the main loop, but also consider the `bootstrapping` condition there, so that we ignore pending acks and requests. That could be an argument to `db_thread_has_work(bool leader_bootstrapping = false)`
2024-06-25T12:14:11.516Z
<Dhairya Parmar> beginning of the main loop?
2024-06-25T12:14:42.759Z
<Leonid Usov> here
```    while (!db_thread_has_work()) {
      auto db_age = db.get_age();
      if (next_event_at_age <= db_age) {
        break;
      }
      dout(20) << "db idle, age: " << db_age << " next_event_at_age: " << next_event_at_age << dendl;
      auto timeout = std::min(max_wait, next_event_at_age - db_age);
      submit_condition.wait_for(ls, timeout);
    }```
2024-06-25T12:15:21.732Z
<Leonid Usov> will be
```while (!db_thread_has_work(leader_is_bootstrapping)) { ...```
2024-06-25T12:16:59.264Z
<Leonid Usov> then you won’t need a dedicated sleep in the bootstrap branch, and will avoid bugs like the one we have with the membership because there will be a single function that considers if there is work to be done
2024-06-25T12:17:26.782Z
<Leonid Usov> actually, you can know if the leader is bootstrapping without additional arguments:
2024-06-25T12:17:57.681Z
<Leonid Usov> `bool bootstrapping = is_leader && peers.contains_unknown_peers`
2024-06-25T12:18:19.841Z
<Leonid Usov> peers will be empty for a non-leader, so you can drop the is_leader
2024-06-25T12:19:01.680Z
<Leonid Usov> it’s just not completely trivial, but we still do that every time we call `leader_bootstrap`:
```  // check that we've heard from all peers in this epoch
  std::unordered_set<QuiesceInterface::PeerId> unknown_peers;
  for (auto&& [peer, info] : peers) {
    if (info.diff_map.db_version.epoch < membership.epoch && info.diff_map.db_version.set_version == 0) {
      if (peer != [membership.me](http://membership.me)) {
        unknown_peers.insert(peer);
      }
    }
  }```
2024-06-25T12:22:08.603Z
<Leonid Usov> ok.. it’s not that straight forward, sorry. this check is only valid after we process the membership
2024-06-25T12:22:25.605Z
<Leonid Usov> so maybe passing it via a boolean argument is the best option
2024-06-25T12:24:40.879Z
<Leonid Usov> feel free to update the interface of the `leader_bootstrap` method. You want it to
• give you an idea about whether the bootstrap is complete, in other words, there are no unknown peers
• provide you with a delay or next_event_at_age in case it wants to re-query unknown peers
You could achieve it by changing the return type to an optional. then, `nullopt` return will mean that there are no unknown peers, and a value would mean the `next_event_at`
2024-06-25T12:25:48.927Z
<Leonid Usov> ok, that’s what we have already, we just use `zero` as the special value to denote “no unknown peers”
2024-06-25T12:26:47.319Z
<Leonid Usov> so yeah, we just need to add a boolean and set it to true if the response is non-zero, and then use that boolean in the next call to `db_thread_has_work`, masking the `acks` and `requests` if the arg is true
2024-06-25T12:27:05.460Z
<Leonid Usov> but an optional would be cleaner
2024-06-25T12:28:07.679Z
<Dhairya Parmar> a bit more to process at one go 😅
2024-06-25T12:30:30.785Z
<Leonid Usov> good, I’ll get back to MDLog
2024-06-25T12:30:36.614Z
<Leonid Usov> good, I’ll get back to MDLog 🙂
2024-06-25T15:49:45.729Z
<Venky Shankar> oh!
2024-06-25T15:49:53.059Z
<Venky Shankar> thx for adding me
2024-06-25T15:50:10.553Z
<Venky Shankar> have you proposed the workaround yet @Dhairya Parmar
2024-06-25T15:50:11.707Z
<Venky Shankar> ?
2024-06-25T15:57:52.355Z
<Dhairya Parmar> not yet but they seem to easily reproduce this crash
2024-06-25T16:20:25.784Z
<Venky Shankar> lets get them up stable first
2024-06-25T16:20:33.719Z
<Venky Shankar> do you mean they have a reproducer?
2024-06-25T16:20:53.347Z
<Venky Shankar> or are they running into the crash which on mds failover, will happen over and over
2024-06-25T16:27:52.625Z
<Venky Shankar> @Patrick Donnelly around? could you please kill [vshankar-2024-06-24_04:32:30-fs-wip-vshankar-testing-20240617.094036-debug-testing-default-smithi](https://pulpito.ceph.com/vshankar-2024-06-24_04:32:30-fs-wip-vshankar-testing-20240617.094036-debug-testing-default-smithi/#collapseOne) run?
2024-06-25T17:14:03.078Z
<Erich Weiler> Hey all, is @Xiubo Li still around?  I’ve been emailing him about some debug logs for a few days and he’s not responded.  This is regarding:
<https://tracker.ceph.com/issues/65607>
Or maybe someone else can take a look?
2024-06-25T17:15:44.067Z
<Erich Weiler> Hey all, is @Xiubo Li still around?  I’ve been emailing him about some debug logs for a few days and he’s not responded.  This is regarding:
<https://tracker.ceph.com/issues/65607>
Or maybe someone else can take a look?  Possibly this is related to this also:
<https://tracker.ceph.com/issues/62123>
2024-06-25T18:06:10.420Z
<Dhairya Parmar> both which is actually a good sign for us since we starved to get a reproducer
2024-06-25T18:07:16.275Z
<Dhairya Parmar> venky, i was thinking how about we expose the preallocated inodes somewhere? maybe thro' ceph-dencoder or the journal tool? this way at least debugging can be made easier
2024-06-25T18:23:14.243Z
<Dhairya Parmar> Hey erich, he's on PTO
2024-06-25T18:27:44.657Z
<Erich Weiler> Thanks @Dhairya Parmar I’ll wait till he gets back

Any issue? please create an issue here and use the infra label.