2024-06-25T08:17:19.112Z | <Jos Collin> @Venky Shankar @Rishabh Dave This change conflicts in reef. <https://tracker.ceph.com/issues/66075>. Are you backporting this to reef/squid? Then I could cherry-pick it to my backport PR. |
2024-06-25T08:23:36.171Z | <Jos Collin> @Venky Shankar @Rishabh Dave This change conflicts in reef/squid. <https://tracker.ceph.com/issues/66075>. Are you backporting this to reef/squid? Then I could cherry-pick it to my backport PR. |
2024-06-25T08:24:41.128Z | <Rishabh Dave> checking |
2024-06-25T08:29:30.321Z | <Rishabh Dave> this needs to be backported because the PR that introduces this issue has been backported to squid and reef. |
2024-06-25T08:32:38.215Z | <Rishabh Dave> @Jos Collin i'll post the backports. |
2024-06-25T08:38:02.623Z | <Jos Collin> Wait, I'll add those commits to my backport PR. Otherwise, my PR has conflicts. |
2024-06-25T08:38:21.858Z | <Rishabh Dave> already done. you can cherry-pick those commits. |
2024-06-25T08:38:41.652Z | <Rishabh Dave> squid: <https://github.com/ceph/ceph/pull/58253> |
2024-06-25T08:38:51.425Z | <Rishabh Dave> reef: <https://github.com/ceph/ceph/pull/58254> |
2024-06-25T08:39:26.303Z | <Rishabh Dave> @Jos Collin can you please post link to the PR which you are backporting? |
2024-06-25T08:39:55.728Z | <Jos Collin> It's already created a while ago. Before your PR: <https://github.com/ceph/ceph/pulls> |
2024-06-25T08:40:34.895Z | <Jos Collin> <https://github.com/ceph/ceph/pull/58252>, <https://github.com/ceph/ceph/pull/58251>. |
2024-06-25T08:41:50.377Z | <Rishabh Dave> ohkay, i see why you were seeing the conflict. |
2024-06-25T08:42:43.876Z | <Dhairya Parmar> @Venky Shankar some upstream user just faced <https://tracker.ceph.com/issues/61009> |
2024-06-25T08:44:09.438Z | <Jos Collin> Yeah, your commit should be in between to resolve the conflicts. |
2024-06-25T08:44:42.245Z | <Rishabh Dave> i thought you meant you wanted me to do my backport so that you can cherry-pick from it. |
2024-06-25T08:45:05.665Z | <Jos Collin> No, I meant, You could close your PR, I'll do your backport. |
2024-06-25T08:45:48.318Z | <Jos Collin> Thus avoids later rebases. |
2024-06-25T08:47:05.300Z | <Rishabh Dave> got it. if you had told me that you have already cherry-picked it i would've not created those backports. |
2024-06-25T08:47:34.133Z | <Rishabh Dave> anyways, since you already have those commits, i'll close those backport PRs. |
2024-06-25T08:47:37.744Z | <Jos Collin> sorry it caused confusion |
2024-06-25T08:47:42.779Z | <Rishabh Dave> can you take over the backport tickets? |
2024-06-25T08:47:55.257Z | <Jos Collin> yes, I'll take it now. |
2024-06-25T08:48:08.710Z | <Rishabh Dave> cool. close my backport PRs. |
2024-06-25T08:52:26.416Z | <Jos Collin> Updated the trackers 👍 |
2024-06-25T08:57:53.113Z | <Rishabh Dave> cool. closed my backport PRs. |
2024-06-25T08:58:13.416Z | <Rishabh Dave> cool, i've closed my backport PRs in favour of your backport PRs. |
2024-06-25T08:59:02.915Z | <Dhairya Parmar> CCed you |
2024-06-25T11:42:54.434Z | <Dhairya Parmar> hey @Leonid Usov |
2024-06-25T11:43:44.753Z | <Dhairya Parmar> from your response in PR about ``next_event_at_age` is used [here](https://github.com/ceph/ceph/pull/57946/files/b3edbbabdbf64c39fb91da3f47fcfce25bd3c93c#diff-f5065f0feaead10f74178b07ecb08b30d186380ef35684c0364024fe42f62a04R87) and [here](https://github.com/ceph/ceph/pull/57946/files/b3edbbabdbf64c39fb91da3f47fcfce25bd3c93c#diff-f5065f0feaead10f74178b07ecb08b30d186380ef35684c0364024fe42f62a04R91). It will cause the main thread to make a pass at or after the designated age.`
If the db thread has some work, and the boostrap_interval is non-zero, what happens in that case? |
2024-06-25T11:44:32.581Z | <Leonid Usov> if it has work, then it won’t sleep, and the next_event_at_age will be updated with a new, slightly smaller bootstrap_interval |
2024-06-25T11:45:02.747Z | <Leonid Usov> if it has work, then it won’t sleep, and the next_event_at_age will be updated with a new, slightly smaller bootstrap_interval - given that the work the db had didn’t cause the bootstrap to complete |
2024-06-25T11:47:20.261Z | <Leonid Usov> but we try to avoid that by having a dedicated sleep waiting for new peer listings regardless of the other work that’s already there:
``` if (pending_db_updates.empty()) {
// we are waiting here because if requests/acks aren't empty
// the code above will skip the sleep due to the `db_thread_has_work`
// returning true, causing a busy-loop of the quiesce manager thread.
// This sleep may be interrupted by the submit_condition, in which case
// we will re-consider everything and may end up here again, but with a shorter
// bootstrap_delay.
dout(5) << "bootstrap: waiting for new peers with pending acks: " << pending_acks.size()
<< " requests: " << pending_requests.size()
<< ". Wait timeout: " << bootstrap_delay << dendl;
submit_condition.wait_for(ls, bootstrap_delay);
} ``` |
2024-06-25T11:47:37.689Z | <Leonid Usov> that comment should have explained it |
2024-06-25T11:47:46.435Z | <Leonid Usov> it’s a pity it didn’t work |
2024-06-25T11:54:15.314Z | <Dhairya Parmar> shouldn't this be `if (pending_db_updates.empty() || !pending_acks.empty() || !pending_requests.empty())` then? |
2024-06-25T11:54:19.446Z | <Leonid Usov> the `dout` could be better. Please update it to
```... << "bootstrap: waiting for new peers; pending acks: " << ...``` |
2024-06-25T11:54:39.120Z | <Dhairya Parmar> if we're waiting on peers for the acks and requests(pending ones) |
2024-06-25T11:55:12.171Z | <Dhairya Parmar> because
```while (!requests.empty()) {
pending_requests.emplace_front(std::move(requests.back()));
requests.pop_back();
}
while (!acks.empty()) {
pending_acks.emplace_front(std::move(acks.back()));
acks.pop_back();
}```
can be no-ops right |
2024-06-25T11:55:12.476Z | <Leonid Usov> no the bootstrap only cares about the db_updates, because that’s how new peers are discovered |
2024-06-25T11:56:14.364Z | <Leonid Usov> the only reason for us to go for another spin from this bootstrap branch is if we have non-empty db updates, meaning that some of the peers we’re waiting for responded |
2024-06-25T11:56:42.046Z | <Leonid Usov> the only reason for us to go for another spin from this bootstrap branch is if we have non-empty db updates, meaning that some of the peers we’re waiting for have responded |
2024-06-25T11:57:47.808Z | <Dhairya Parmar> okay i get that but what im pointing to is that if there are no pending db updates, we shouldnt go for a sleep right |
2024-06-25T11:57:57.289Z | <Leonid Usov> > shouldn’t this be `if (pending_db_updates.empty() || !pending_acks.empty() || !pending_requests.empty())` then?
it could be, but functionally that would be equivalent |
2024-06-25T11:58:23.221Z | <Leonid Usov> > if there are no pending db updates, we shouldnt go for a sleep right
the opposite, we should sleep if there are no pending db update |
2024-06-25T11:58:24.727Z | <Leonid Usov> > if there are no pending db updates, we shouldnt go for a sleep right
the opposite, we should sleep if there are no pending db updates |
2024-06-25T11:59:02.156Z | <Dhairya Parmar> from what i understand from this code snippet
``` while (!requests.empty()) {
pending_requests.emplace_front(std::move(requests.back()));
requests.pop_back();
}
while (!acks.empty()) {
pending_acks.emplace_front(std::move(acks.back()));
acks.pop_back();
}
if (pending_db_updates.empty()) {
dout(5) << "bootstrap: waiting for new peers with pending acks: " << pending_acks.size()
<< " requests: " << pending_requests.size()
<< ". Wait timeout: " << bootstrap_delay << dendl;
submit_condition.wait_for(ls, bootstrap_delay);
}```
is that we first gather the pending reqs+acks and then sleep on it cause we're waiting for new peers for the pending reqs. but what if the `requests` is empty? |
2024-06-25T11:59:07.461Z | <Dhairya Parmar> the while loop is a no-op |
2024-06-25T11:59:30.518Z | <Dhairya Parmar> so is it necessary to wait? |
2024-06-25T11:59:34.420Z | <Leonid Usov> those while loops are to make sure we don’t lose acks and requests that have been submitted while we’re bootstrapping |
2024-06-25T11:59:57.915Z | <Leonid Usov> here they are all getting moved from the pending lists
``` decltype(pending_acks) acks(std::move(pending_acks));
decltype(pending_requests) requests(std::move(pending_requests));
decltype(pending_db_updates) db_updates(std::move(pending_db_updates));``` |
2024-06-25T12:00:16.460Z | <Leonid Usov> so we need to put them back before we go for the next iteration |
2024-06-25T12:01:10.978Z | <Leonid Usov> we’re also using `emplace_front` in those loops to preserve the original order of the acks and the requests |
2024-06-25T12:01:32.516Z | <Leonid Usov> then we enter the sleep unless there are pending db updates |
2024-06-25T12:01:36.916Z | <Dhairya Parmar> okay so what if we have
```bootstrap: waiting for new peers with pending acks: 0 requests: 0. Wait timeout: 15``` |
2024-06-25T12:01:55.712Z | <Dhairya Parmar> we still wait right? |
2024-06-25T12:02:02.277Z | <Leonid Usov> this means there are no pending acks or requests, but some peers are unknown |
2024-06-25T12:02:18.368Z | <Dhairya Parmar> oh! |
2024-06-25T12:02:19.837Z | <Leonid Usov> this means there are no pending acks or requests, but some peers are unknown, and we sleep waiting for their discovery responses which arrive as db listings |
2024-06-25T12:03:43.298Z | <Leonid Usov> ``` decltype(pending_acks) acks(std::move(pending_acks));
decltype(pending_requests) requests(std::move(pending_requests));
decltype(pending_db_updates) db_updates(std::move(pending_db_updates));```
this can be changed to
``` auto acks(std::move(pending_acks));
auto requests(std::move(pending_requests));
auto db_updates(std::move(pending_db_updates));``` |
2024-06-25T12:04:14.220Z | <Leonid Usov> ``` decltype(pending_acks) acks(std::move(pending_acks));
decltype(pending_requests) requests(std::move(pending_requests));
decltype(pending_db_updates) db_updates(std::move(pending_db_updates));```
this can be changed to
``` auto acks(std::move(pending_acks));
auto requests(std::move(pending_requests));
auto db_updates(std::move(pending_db_updates));```
or
``` auto acks = std::move(pending_acks);
auto requests = std::move(pending_requests));
auto db_updates = std::move(pending_db_updates));``` |
2024-06-25T12:04:28.964Z | <Leonid Usov> ``` decltype(pending_acks) acks(std::move(pending_acks));
decltype(pending_requests) requests(std::move(pending_requests));
decltype(pending_db_updates) db_updates(std::move(pending_db_updates));```
this can be changed to
``` auto acks(std::move(pending_acks));
auto requests(std::move(pending_requests));
auto db_updates(std::move(pending_db_updates));```
or
``` auto acks = std::move(pending_acks);
auto requests = std::move(pending_requests);
auto db_updates = std::move(pending_db_updates);``` |
2024-06-25T12:04:39.907Z | <Dhairya Parmar> since the comment starts with `we are waiting here because if requests/acks aren't empty` it confused me and made me think that the empty reqs case is not being considered. |
2024-06-25T12:05:05.324Z | <Leonid Usov> well, there are more lines to the comment:
``` // we are waiting here because if requests/acks aren't empty
// the code above will skip the sleep due to the `db_thread_has_work`
// returning true, causing a busy-loop of the quiesce manager thread.``` |
2024-06-25T12:05:48.504Z | <Leonid Usov> feel free to rephrase this for clarity |
2024-06-25T12:06:07.425Z | <Leonid Usov> I often do this to my own comments when I read them again and don’t understand |
2024-06-25T12:06:20.376Z | <Leonid Usov> I often do this to my own comments when I read them again and don’t understand, which happens regularly |
2024-06-25T12:07:38.442Z | <Leonid Usov> actually, now that I think about it, there might be something missing in this code |
2024-06-25T12:07:53.359Z | <Leonid Usov> for example, if there’s a pending membership update, we won’t consider it here |
2024-06-25T12:08:24.226Z | <Dhairya Parmar> and how bad is it? |
2024-06-25T12:08:27.082Z | <Dhairya Parmar> to non consider? |
2024-06-25T12:08:40.213Z | <Dhairya Parmar> to not consider? |
2024-06-25T12:08:57.019Z | <Leonid Usov> not very good. We want to process that update asap, and without considering it we will sleep up to bootstrap delay |
2024-06-25T12:09:31.054Z | <Leonid Usov> BTW, why do you have 15 as the bootstrap delay? It should be around 1 second |
2024-06-25T12:09:46.011Z | <Dhairya Parmar> it was just an example, i didnt actully run it 😛 |
2024-06-25T12:09:50.569Z | <Leonid Usov> ah ok |
2024-06-25T12:11:43.709Z | <Leonid Usov> and alternative approach - a more robust one - would be to never sleep anywhere else but at the beginning of the main loop, but also consider the `bootstrapping` condition there, so that we ignore pending acks and requests. That could be an argument to `db_thread_has_work(bool leader_bootstrapping = false)` |
2024-06-25T12:14:11.516Z | <Dhairya Parmar> beginning of the main loop? |
2024-06-25T12:14:42.759Z | <Leonid Usov> here
``` while (!db_thread_has_work()) {
auto db_age = db.get_age();
if (next_event_at_age <= db_age) {
break;
}
dout(20) << "db idle, age: " << db_age << " next_event_at_age: " << next_event_at_age << dendl;
auto timeout = std::min(max_wait, next_event_at_age - db_age);
submit_condition.wait_for(ls, timeout);
}``` |
2024-06-25T12:15:21.732Z | <Leonid Usov> will be
```while (!db_thread_has_work(leader_is_bootstrapping)) { ...``` |
2024-06-25T12:16:59.264Z | <Leonid Usov> then you won’t need a dedicated sleep in the bootstrap branch, and will avoid bugs like the one we have with the membership because there will be a single function that considers if there is work to be done |
2024-06-25T12:17:26.782Z | <Leonid Usov> actually, you can know if the leader is bootstrapping without additional arguments: |
2024-06-25T12:17:57.681Z | <Leonid Usov> `bool bootstrapping = is_leader && peers.contains_unknown_peers` |
2024-06-25T12:18:19.841Z | <Leonid Usov> peers will be empty for a non-leader, so you can drop the is_leader |
2024-06-25T12:19:01.680Z | <Leonid Usov> it’s just not completely trivial, but we still do that every time we call `leader_bootstrap`:
``` // check that we've heard from all peers in this epoch
std::unordered_set<QuiesceInterface::PeerId> unknown_peers;
for (auto&& [peer, info] : peers) {
if (info.diff_map.db_version.epoch < membership.epoch && info.diff_map.db_version.set_version == 0) {
if (peer != [membership.me](http://membership.me)) {
unknown_peers.insert(peer);
}
}
}``` |
2024-06-25T12:22:08.603Z | <Leonid Usov> ok.. it’s not that straight forward, sorry. this check is only valid after we process the membership |
2024-06-25T12:22:25.605Z | <Leonid Usov> so maybe passing it via a boolean argument is the best option |
2024-06-25T12:24:40.879Z | <Leonid Usov> feel free to update the interface of the `leader_bootstrap` method. You want it to
• give you an idea about whether the bootstrap is complete, in other words, there are no unknown peers
• provide you with a delay or next_event_at_age in case it wants to re-query unknown peers
You could achieve it by changing the return type to an optional. then, `nullopt` return will mean that there are no unknown peers, and a value would mean the `next_event_at` |
2024-06-25T12:25:48.927Z | <Leonid Usov> ok, that’s what we have already, we just use `zero` as the special value to denote “no unknown peers” |
2024-06-25T12:26:47.319Z | <Leonid Usov> so yeah, we just need to add a boolean and set it to true if the response is non-zero, and then use that boolean in the next call to `db_thread_has_work`, masking the `acks` and `requests` if the arg is true |
2024-06-25T12:27:05.460Z | <Leonid Usov> but an optional would be cleaner |
2024-06-25T12:28:07.679Z | <Dhairya Parmar> a bit more to process at one go 😅 |
2024-06-25T12:30:30.785Z | <Leonid Usov> good, I’ll get back to MDLog |
2024-06-25T12:30:36.614Z | <Leonid Usov> good, I’ll get back to MDLog 🙂 |
2024-06-25T15:49:45.729Z | <Venky Shankar> oh! |
2024-06-25T15:49:53.059Z | <Venky Shankar> thx for adding me |
2024-06-25T15:50:10.553Z | <Venky Shankar> have you proposed the workaround yet @Dhairya Parmar |
2024-06-25T15:50:11.707Z | <Venky Shankar> ? |
2024-06-25T15:57:52.355Z | <Dhairya Parmar> not yet but they seem to easily reproduce this crash |
2024-06-25T16:20:25.784Z | <Venky Shankar> lets get them up stable first |
2024-06-25T16:20:33.719Z | <Venky Shankar> do you mean they have a reproducer? |
2024-06-25T16:20:53.347Z | <Venky Shankar> or are they running into the crash which on mds failover, will happen over and over |
2024-06-25T16:27:52.625Z | <Venky Shankar> @Patrick Donnelly around? could you please kill [vshankar-2024-06-24_04:32:30-fs-wip-vshankar-testing-20240617.094036-debug-testing-default-smithi](https://pulpito.ceph.com/vshankar-2024-06-24_04:32:30-fs-wip-vshankar-testing-20240617.094036-debug-testing-default-smithi/#collapseOne) run? |
2024-06-25T17:14:03.078Z | <Erich Weiler> Hey all, is @Xiubo Li still around? I’ve been emailing him about some debug logs for a few days and he’s not responded. This is regarding:
<https://tracker.ceph.com/issues/65607>
Or maybe someone else can take a look? |
2024-06-25T17:15:44.067Z | <Erich Weiler> Hey all, is @Xiubo Li still around? I’ve been emailing him about some debug logs for a few days and he’s not responded. This is regarding:
<https://tracker.ceph.com/issues/65607>
Or maybe someone else can take a look? Possibly this is related to this also:
<https://tracker.ceph.com/issues/62123> |
2024-06-25T18:06:10.420Z | <Dhairya Parmar> both which is actually a good sign for us since we starved to get a reproducer |
2024-06-25T18:07:16.275Z | <Dhairya Parmar> venky, i was thinking how about we expose the preallocated inodes somewhere? maybe thro' ceph-dencoder or the journal tool? this way at least debugging can be made easier |
2024-06-25T18:23:14.243Z | <Dhairya Parmar> Hey erich, he's on PTO |
2024-06-25T18:27:44.657Z | <Erich Weiler> Thanks @Dhairya Parmar I’ll wait till he gets back |