ceph - cephfs - 2024-06-25

Timestamp (UTC)	Message
2024-06-25T08:17:19.112Z	<Jos Collin> @Venky Shankar @Rishabh Dave This change conflicts in reef. <https://tracker.ceph.com/issues/66075>. Are you backporting this to reef/squid? Then I could cherry-pick it to my backport PR.
2024-06-25T08:23:36.171Z	<Jos Collin> @Venky Shankar @Rishabh Dave This change conflicts in reef/squid. <https://tracker.ceph.com/issues/66075>. Are you backporting this to reef/squid? Then I could cherry-pick it to my backport PR.
2024-06-25T08:24:41.128Z	<Rishabh Dave> checking
2024-06-25T08:29:30.321Z	<Rishabh Dave> this needs to be backported because the PR that introduces this issue has been backported to squid and reef.
2024-06-25T08:32:38.215Z	<Rishabh Dave> @Jos Collin i'll post the backports.
2024-06-25T08:38:02.623Z	<Jos Collin> Wait, I'll add those commits to my backport PR. Otherwise, my PR has conflicts.
2024-06-25T08:38:21.858Z	<Rishabh Dave> already done. you can cherry-pick those commits.
2024-06-25T08:38:41.652Z	<Rishabh Dave> squid: <https://github.com/ceph/ceph/pull/58253>
2024-06-25T08:38:51.425Z	<Rishabh Dave> reef: <https://github.com/ceph/ceph/pull/58254>
2024-06-25T08:39:26.303Z	<Rishabh Dave> @Jos Collin can you please post link to the PR which you are backporting?
2024-06-25T08:39:55.728Z	<Jos Collin> It's already created a while ago. Before your PR: <https://github.com/ceph/ceph/pulls>
2024-06-25T08:40:34.895Z	<Jos Collin> <https://github.com/ceph/ceph/pull/58252>, <https://github.com/ceph/ceph/pull/58251>.
2024-06-25T08:41:50.377Z	<Rishabh Dave> ohkay, i see why you were seeing the conflict.
2024-06-25T08:42:43.876Z	<Dhairya Parmar> @Venky Shankar some upstream user just faced <https://tracker.ceph.com/issues/61009>
2024-06-25T08:44:09.438Z	<Jos Collin> Yeah, your commit should be in between to resolve the conflicts.
2024-06-25T08:44:42.245Z	<Rishabh Dave> i thought you meant you wanted me to do my backport so that you can cherry-pick from it.
2024-06-25T08:45:05.665Z	<Jos Collin> No, I meant, You could close your PR, I'll do your backport.
2024-06-25T08:45:48.318Z	<Jos Collin> Thus avoids later rebases.
2024-06-25T08:47:05.300Z	<Rishabh Dave> got it. if you had told me that you have already cherry-picked it i would've not created those backports.
2024-06-25T08:47:34.133Z	<Rishabh Dave> anyways, since you already have those commits, i'll close those backport PRs.
2024-06-25T08:47:37.744Z	<Jos Collin> sorry it caused confusion
2024-06-25T08:47:42.779Z	<Rishabh Dave> can you take over the backport tickets?
2024-06-25T08:47:55.257Z	<Jos Collin> yes, I'll take it now.
2024-06-25T08:48:08.710Z	<Rishabh Dave> cool. close my backport PRs.
2024-06-25T08:52:26.416Z	<Jos Collin> Updated the trackers 👍
2024-06-25T08:57:53.113Z	<Rishabh Dave> cool. closed my backport PRs.
2024-06-25T08:58:13.416Z	<Rishabh Dave> cool, i've closed my backport PRs in favour of your backport PRs.
2024-06-25T08:59:02.915Z	<Dhairya Parmar> CCed you
2024-06-25T11:42:54.434Z	<Dhairya Parmar> hey @Leonid Usov
2024-06-25T11:43:44.753Z	<Dhairya Parmar> from your response in PR about ``next_event_at_age` is used [here](https://github.com/ceph/ceph/pull/57946/files/b3edbbabdbf64c39fb91da3f47fcfce25bd3c93c#diff-f5065f0feaead10f74178b07ecb08b30d186380ef35684c0364024fe42f62a04R87) and [here](https://github.com/ceph/ceph/pull/57946/files/b3edbbabdbf64c39fb91da3f47fcfce25bd3c93c#diff-f5065f0feaead10f74178b07ecb08b30d186380ef35684c0364024fe42f62a04R91). It will cause the main thread to make a pass at or after the designated age.` If the db thread has some work, and the boostrap_interval is non-zero, what happens in that case?
2024-06-25T11:44:32.581Z	<Leonid Usov> if it has work, then it won’t sleep, and the next_event_at_age will be updated with a new, slightly smaller bootstrap_interval
2024-06-25T11:45:02.747Z	<Leonid Usov> if it has work, then it won’t sleep, and the next_event_at_age will be updated with a new, slightly smaller bootstrap_interval - given that the work the db had didn’t cause the bootstrap to complete
2024-06-25T11:47:20.261Z	<Leonid Usov> but we try to avoid that by having a dedicated sleep waiting for new peer listings regardless of the other work that’s already there: ``` if (pending_db_updates.empty()) { // we are waiting here because if requests/acks aren't empty // the code above will skip the sleep due to the `db_thread_has_work` // returning true, causing a busy-loop of the quiesce manager thread. // This sleep may be interrupted by the submit_condition, in which case // we will re-consider everything and may end up here again, but with a shorter // bootstrap_delay. dout(5) << "bootstrap: waiting for new peers with pending acks: " << pending_acks.size() << " requests: " << pending_requests.size() << ". Wait timeout: " << bootstrap_delay << dendl; submit_condition.wait_for(ls, bootstrap_delay); } ```
2024-06-25T11:47:37.689Z	<Leonid Usov> that comment should have explained it
2024-06-25T11:47:46.435Z	<Leonid Usov> it’s a pity it didn’t work
2024-06-25T11:54:15.314Z	<Dhairya Parmar> shouldn't this be `if (pending_db_updates.empty() \|\| !pending_acks.empty() \|\| !pending_requests.empty())` then?
2024-06-25T11:54:19.446Z	<Leonid Usov> the `dout` could be better. Please update it to ```... << "bootstrap: waiting for new peers; pending acks: " << ...```
2024-06-25T11:54:39.120Z	<Dhairya Parmar> if we're waiting on peers for the acks and requests(pending ones)
2024-06-25T11:55:12.171Z	<Dhairya Parmar> because ```while (!requests.empty()) { pending_requests.emplace_front(std::move(requests.back())); requests.pop_back(); } while (!acks.empty()) { pending_acks.emplace_front(std::move(acks.back())); acks.pop_back(); }``` can be no-ops right
2024-06-25T11:55:12.476Z	<Leonid Usov> no the bootstrap only cares about the db_updates, because that’s how new peers are discovered
2024-06-25T11:56:14.364Z	<Leonid Usov> the only reason for us to go for another spin from this bootstrap branch is if we have non-empty db updates, meaning that some of the peers we’re waiting for responded
2024-06-25T11:56:42.046Z	<Leonid Usov> the only reason for us to go for another spin from this bootstrap branch is if we have non-empty db updates, meaning that some of the peers we’re waiting for have responded
2024-06-25T11:57:47.808Z	<Dhairya Parmar> okay i get that but what im pointing to is that if there are no pending db updates, we shouldnt go for a sleep right
2024-06-25T11:57:57.289Z	<Leonid Usov> > shouldn’t this be `if (pending_db_updates.empty() \|\| !pending_acks.empty() \|\| !pending_requests.empty())` then? it could be, but functionally that would be equivalent
2024-06-25T11:58:23.221Z	<Leonid Usov> > if there are no pending db updates, we shouldnt go for a sleep right the opposite, we should sleep if there are no pending db update
2024-06-25T11:58:24.727Z	<Leonid Usov> > if there are no pending db updates, we shouldnt go for a sleep right the opposite, we should sleep if there are no pending db updates
2024-06-25T11:59:02.156Z	<Dhairya Parmar> from what i understand from this code snippet ``` while (!requests.empty()) { pending_requests.emplace_front(std::move(requests.back())); requests.pop_back(); } while (!acks.empty()) { pending_acks.emplace_front(std::move(acks.back())); acks.pop_back(); } if (pending_db_updates.empty()) { dout(5) << "bootstrap: waiting for new peers with pending acks: " << pending_acks.size() << " requests: " << pending_requests.size() << ". Wait timeout: " << bootstrap_delay << dendl; submit_condition.wait_for(ls, bootstrap_delay); }``` is that we first gather the pending reqs+acks and then sleep on it cause we're waiting for new peers for the pending reqs. but what if the `requests` is empty?
2024-06-25T11:59:07.461Z	<Dhairya Parmar> the while loop is a no-op
2024-06-25T11:59:30.518Z	<Dhairya Parmar> so is it necessary to wait?
2024-06-25T11:59:34.420Z	<Leonid Usov> those while loops are to make sure we don’t lose acks and requests that have been submitted while we’re bootstrapping
2024-06-25T11:59:57.915Z	<Leonid Usov> here they are all getting moved from the pending lists ``` decltype(pending_acks) acks(std::move(pending_acks)); decltype(pending_requests) requests(std::move(pending_requests)); decltype(pending_db_updates) db_updates(std::move(pending_db_updates));```
2024-06-25T12:00:16.460Z	<Leonid Usov> so we need to put them back before we go for the next iteration
2024-06-25T12:01:10.978Z	<Leonid Usov> we’re also using `emplace_front` in those loops to preserve the original order of the acks and the requests
2024-06-25T12:01:32.516Z	<Leonid Usov> then we enter the sleep unless there are pending db updates
2024-06-25T12:01:36.916Z	<Dhairya Parmar> okay so what if we have ```bootstrap: waiting for new peers with pending acks: 0 requests: 0. Wait timeout: 15```
2024-06-25T12:01:55.712Z	<Dhairya Parmar> we still wait right?
2024-06-25T12:02:02.277Z	<Leonid Usov> this means there are no pending acks or requests, but some peers are unknown
2024-06-25T12:02:18.368Z	<Dhairya Parmar> oh!
2024-06-25T12:02:19.837Z	<Leonid Usov> this means there are no pending acks or requests, but some peers are unknown, and we sleep waiting for their discovery responses which arrive as db listings
2024-06-25T12:03:43.298Z	<Leonid Usov> ``` decltype(pending_acks) acks(std::move(pending_acks)); decltype(pending_requests) requests(std::move(pending_requests)); decltype(pending_db_updates) db_updates(std::move(pending_db_updates));``` this can be changed to ``` auto acks(std::move(pending_acks)); auto requests(std::move(pending_requests)); auto db_updates(std::move(pending_db_updates));```
2024-06-25T12:04:14.220Z	<Leonid Usov> ``` decltype(pending_acks) acks(std::move(pending_acks)); decltype(pending_requests) requests(std::move(pending_requests)); decltype(pending_db_updates) db_updates(std::move(pending_db_updates));``` this can be changed to ``` auto acks(std::move(pending_acks)); auto requests(std::move(pending_requests)); auto db_updates(std::move(pending_db_updates));``` or ``` auto acks = std::move(pending_acks); auto requests = std::move(pending_requests)); auto db_updates = std::move(pending_db_updates));```
2024-06-25T12:04:28.964Z	<Leonid Usov> ``` decltype(pending_acks) acks(std::move(pending_acks)); decltype(pending_requests) requests(std::move(pending_requests)); decltype(pending_db_updates) db_updates(std::move(pending_db_updates));``` this can be changed to ``` auto acks(std::move(pending_acks)); auto requests(std::move(pending_requests)); auto db_updates(std::move(pending_db_updates));``` or ``` auto acks = std::move(pending_acks); auto requests = std::move(pending_requests); auto db_updates = std::move(pending_db_updates);```
2024-06-25T12:04:39.907Z	<Dhairya Parmar> since the comment starts with `we are waiting here because if requests/acks aren't empty` it confused me and made me think that the empty reqs case is not being considered.
2024-06-25T12:05:05.324Z	<Leonid Usov> well, there are more lines to the comment: ``` // we are waiting here because if requests/acks aren't empty // the code above will skip the sleep due to the `db_thread_has_work` // returning true, causing a busy-loop of the quiesce manager thread.```
2024-06-25T12:05:48.504Z	<Leonid Usov> feel free to rephrase this for clarity
2024-06-25T12:06:07.425Z	<Leonid Usov> I often do this to my own comments when I read them again and don’t understand
2024-06-25T12:06:20.376Z	<Leonid Usov> I often do this to my own comments when I read them again and don’t understand, which happens regularly
2024-06-25T12:07:38.442Z	<Leonid Usov> actually, now that I think about it, there might be something missing in this code
2024-06-25T12:07:53.359Z	<Leonid Usov> for example, if there’s a pending membership update, we won’t consider it here
2024-06-25T12:08:24.226Z	<Dhairya Parmar> and how bad is it?
2024-06-25T12:08:27.082Z	<Dhairya Parmar> to non consider?
2024-06-25T12:08:40.213Z	<Dhairya Parmar> to not consider?
2024-06-25T12:08:57.019Z	<Leonid Usov> not very good. We want to process that update asap, and without considering it we will sleep up to bootstrap delay
2024-06-25T12:09:31.054Z	<Leonid Usov> BTW, why do you have 15 as the bootstrap delay? It should be around 1 second
2024-06-25T12:09:46.011Z	<Dhairya Parmar> it was just an example, i didnt actully run it 😛
2024-06-25T12:09:50.569Z	<Leonid Usov> ah ok
2024-06-25T12:11:43.709Z	<Leonid Usov> and alternative approach - a more robust one - would be to never sleep anywhere else but at the beginning of the main loop, but also consider the `bootstrapping` condition there, so that we ignore pending acks and requests. That could be an argument to `db_thread_has_work(bool leader_bootstrapping = false)`
2024-06-25T12:14:11.516Z	<Dhairya Parmar> beginning of the main loop?
2024-06-25T12:14:42.759Z	<Leonid Usov> here ``` while (!db_thread_has_work()) { auto db_age = db.get_age(); if (next_event_at_age <= db_age) { break; } dout(20) << "db idle, age: " << db_age << " next_event_at_age: " << next_event_at_age << dendl; auto timeout = std::min(max_wait, next_event_at_age - db_age); submit_condition.wait_for(ls, timeout); }```
2024-06-25T12:15:21.732Z	<Leonid Usov> will be ```while (!db_thread_has_work(leader_is_bootstrapping)) { ...```
2024-06-25T12:16:59.264Z	<Leonid Usov> then you won’t need a dedicated sleep in the bootstrap branch, and will avoid bugs like the one we have with the membership because there will be a single function that considers if there is work to be done
2024-06-25T12:17:26.782Z	<Leonid Usov> actually, you can know if the leader is bootstrapping without additional arguments:
2024-06-25T12:17:57.681Z	<Leonid Usov> `bool bootstrapping = is_leader && peers.contains_unknown_peers`
2024-06-25T12:18:19.841Z	<Leonid Usov> peers will be empty for a non-leader, so you can drop the is_leader
2024-06-25T12:19:01.680Z	<Leonid Usov> it’s just not completely trivial, but we still do that every time we call `leader_bootstrap`: ``` // check that we've heard from all peers in this epoch std::unordered_set<QuiesceInterface::PeerId> unknown_peers; for (auto&& [peer, info] : peers) { if (info.diff_map.db_version.epoch < membership.epoch && info.diff_map.db_version.set_version == 0) { if (peer != [membership.me](http://membership.me)) { unknown_peers.insert(peer); } } }```
2024-06-25T12:22:08.603Z	<Leonid Usov> ok.. it’s not that straight forward, sorry. this check is only valid after we process the membership
2024-06-25T12:22:25.605Z	<Leonid Usov> so maybe passing it via a boolean argument is the best option
2024-06-25T12:24:40.879Z	<Leonid Usov> feel free to update the interface of the `leader_bootstrap` method. You want it to • give you an idea about whether the bootstrap is complete, in other words, there are no unknown peers • provide you with a delay or next_event_at_age in case it wants to re-query unknown peers You could achieve it by changing the return type to an optional. then, `nullopt` return will mean that there are no unknown peers, and a value would mean the `next_event_at`
2024-06-25T12:25:48.927Z	<Leonid Usov> ok, that’s what we have already, we just use `zero` as the special value to denote “no unknown peers”
2024-06-25T12:26:47.319Z	<Leonid Usov> so yeah, we just need to add a boolean and set it to true if the response is non-zero, and then use that boolean in the next call to `db_thread_has_work`, masking the `acks` and `requests` if the arg is true
2024-06-25T12:27:05.460Z	<Leonid Usov> but an optional would be cleaner
2024-06-25T12:28:07.679Z	<Dhairya Parmar> a bit more to process at one go 😅
2024-06-25T12:30:30.785Z	<Leonid Usov> good, I’ll get back to MDLog
2024-06-25T12:30:36.614Z	<Leonid Usov> good, I’ll get back to MDLog 🙂
2024-06-25T15:49:45.729Z	<Venky Shankar> oh!
2024-06-25T15:49:53.059Z	<Venky Shankar> thx for adding me
2024-06-25T15:50:10.553Z	<Venky Shankar> have you proposed the workaround yet @Dhairya Parmar
2024-06-25T15:50:11.707Z	<Venky Shankar> ?
2024-06-25T15:57:52.355Z	<Dhairya Parmar> not yet but they seem to easily reproduce this crash
2024-06-25T16:20:25.784Z	<Venky Shankar> lets get them up stable first
2024-06-25T16:20:33.719Z	<Venky Shankar> do you mean they have a reproducer?
2024-06-25T16:20:53.347Z	<Venky Shankar> or are they running into the crash which on mds failover, will happen over and over
2024-06-25T16:27:52.625Z	<Venky Shankar> @Patrick Donnelly around? could you please kill [vshankar-2024-06-24_04:32:30-fs-wip-vshankar-testing-20240617.094036-debug-testing-default-smithi](https://pulpito.ceph.com/vshankar-2024-06-24_04:32:30-fs-wip-vshankar-testing-20240617.094036-debug-testing-default-smithi/#collapseOne) run?
2024-06-25T17:14:03.078Z	<Erich Weiler> Hey all, is @Xiubo Li still around? I’ve been emailing him about some debug logs for a few days and he’s not responded. This is regarding: <https://tracker.ceph.com/issues/65607> Or maybe someone else can take a look?
2024-06-25T17:15:44.067Z	<Erich Weiler> Hey all, is @Xiubo Li still around? I’ve been emailing him about some debug logs for a few days and he’s not responded. This is regarding: <https://tracker.ceph.com/issues/65607> Or maybe someone else can take a look? Possibly this is related to this also: <https://tracker.ceph.com/issues/62123>
2024-06-25T18:06:10.420Z	<Dhairya Parmar> both which is actually a good sign for us since we starved to get a reproducer
2024-06-25T18:07:16.275Z	<Dhairya Parmar> venky, i was thinking how about we expose the preallocated inodes somewhere? maybe thro' ceph-dencoder or the journal tool? this way at least debugging can be made easier
2024-06-25T18:23:14.243Z	<Dhairya Parmar> Hey erich, he's on PTO
2024-06-25T18:27:44.657Z	<Erich Weiler> Thanks @Dhairya Parmar I’ll wait till he gets back

ceph - cephfs - 2024-06-25

Any issue? please create an issue here and use the infra label.