2024-06-21T06:03:16.950Z | <Rishabh Dave> @Venky Shankar current output for clone progress bars looks okay? posted here - <https://github.com/ceph/ceph/pull/54620#pullrequestreview-2130881248> |
2024-06-21T06:12:28.353Z | <Dhairya Parmar> hey @Leonid Usov (re: the client bug we discussed in yesterday's sync up) could you go through this tracker <https://tracker.ceph.com/issues/66581> and add/change anything you feel so? |
2024-06-21T06:17:57.286Z | <Leonid Usov> I already did, it’s a nice write up, nothing to add, good work! |
2024-06-21T06:31:37.128Z | <Venky Shankar> yeh, much better |
2024-06-21T06:31:43.378Z | <Venky Shankar> looks really neat |
2024-06-21T06:31:47.813Z | <Venky Shankar> đź‘Ť |
2024-06-21T06:32:28.066Z | <Rishabh Dave> okay, it's ready for review again. i made last set of requested changes yesterday night |
2024-06-21T06:32:57.222Z | <Venky Shankar> I left a comment regarding the ev_id used |
2024-06-21T06:33:10.138Z | <Venky Shankar> should it always be a random string? |
2024-06-21T06:34:33.443Z | <Venky Shankar> (I see that with other modules using progress, but I want to understand if switching to static string could cause issues) |
2024-06-21T06:35:29.893Z | <Rishabh Dave> yeah, i left reply to it yesterday night |
2024-06-21T06:35:53.455Z | <Venky Shankar> kk |
2024-06-21T06:35:58.590Z | <Venky Shankar> haven't checked those yet |
2024-06-21T06:36:13.949Z | <Rishabh Dave> searching for the thread so that i can pass you the link |
2024-06-21T06:36:41.956Z | <Venky Shankar> I'm having a hard time searching due to the number of comments |
2024-06-21T06:37:14.358Z | <Rishabh Dave> same |
2024-06-21T06:37:30.251Z | <Rishabh Dave> <https://github.com/ceph/ceph/pull/54620#discussion_r1647989400> |
2024-06-21T06:37:53.018Z | <Rishabh Dave> there are 2 more replies from below the one in the link |
2024-06-21T06:39:59.189Z | <Venky Shankar> kk |
2024-06-21T06:40:19.738Z | <Venky Shankar> if we move to static strings then the qa test needs to change |
2024-06-21T06:40:37.245Z | <Venky Shankar> since we know the key names in progress event json |
2024-06-21T06:40:37.806Z | <Rishabh Dave> done already. :) |
2024-06-21T06:40:55.371Z | <Venky Shankar> oh, that was quick |
2024-06-21T06:40:57.545Z | <Venky Shankar> nice |
2024-06-21T06:41:12.865Z | <Venky Shankar> I'll have a look and if nothing else is required, will put this to test |
2024-06-21T06:41:22.546Z | <Rishabh Dave> got it done yesteday and put it to test with teuth early morning today |
2024-06-21T06:41:32.089Z | <Rishabh Dave> > I'll have a look and if nothing else is required, will put this to test
yay! |
2024-06-21T07:29:22.547Z | <Rishabh Dave> @Venky Shankar Please reply on this PR too, I left a reply here too yesterday night - <https://github.com/ceph/ceph/pull/58101#discussion_r1648063199> |
2024-06-21T10:49:10.111Z | <Rishabh Dave> ping |
2024-06-21T10:50:03.460Z | <Venky Shankar> will try |
2024-06-21T11:00:34.909Z | <Rishabh Dave> thanks |
2024-06-21T11:14:15.084Z | <Dhairya Parmar> @Leonid Usov im a bit curious about the usage of double underscores for various quiesce state like `QS__FAILURE` |
2024-06-21T11:14:51.137Z | <Dhairya Parmar> while others have single underscore |
2024-06-21T11:14:59.647Z | <Leonid Usov> These denote boundaries |
2024-06-21T11:15:14.468Z | <Leonid Usov> So you would use them in range conditions |
2024-06-21T11:18:10.100Z | <Dhairya Parmar> so it is just for the distinction that these states must be used for fulfilling conditions like
``` if (info.state >= QS__FAILURE) {
// we don't care about roots in failed states
dout(5) << "ignoring a root in a failed state: '" << root << "', " << info.state << dendl;
their_it = map.roots.erase(their_it);
continue;
}```
|
2024-06-21T11:20:42.489Z | <Leonid Usov> yes. Those aren’t real states as per the state diagram. |
2024-06-21T11:21:15.645Z | <Leonid Usov> we have __invalid, __max and different internal boundaries |
2024-06-21T11:21:35.400Z | <Dhairya Parmar> okay |
2024-06-21T11:23:26.057Z | <Dhairya Parmar> and it seems like the <https://tracker.ceph.com/issues/66107> has taken another turn? |
2024-06-21T11:24:08.823Z | <Dhairya Parmar> from what it looks like this is a messenger side problem and not quiesce specific? |
2024-06-21T11:24:16.849Z | <Leonid Usov> well, it looks like we need to create another ticket to investigate the root cause of the missed message |
2024-06-21T11:24:39.192Z | <Leonid Usov> however, I still think that the manager should re-query for an improved resiliency |
2024-06-21T11:24:47.929Z | <Leonid Usov> it won’t hurt |
2024-06-21T11:25:01.525Z | <Leonid Usov> and it will allow us implement the messaging over a lossy connection |
2024-06-21T11:25:30.391Z | <Dhairya Parmar> so i guess you're hinting towards removal of this `if (agent_callback->if_newer < db.version()) {` |
2024-06-21T11:25:35.322Z | <Dhairya Parmar> so i guess you're hinting towards removal of this `if (agent_callback->if_newer < db.version()) {` check |
2024-06-21T11:26:34.086Z | <Leonid Usov> i;m not sure. It’s definitely a simple solution, but as we discussed, I would not like to remove it because then we may get many more unnecessary invocations of the agent. |
2024-06-21T11:27:35.685Z | <Leonid Usov> Please consider identifying the case when you need to “force-notify” an agent. For the replica it’s easy - if there is a listing from the leader, then we force-notify the client regardless of the `if_newer` |
2024-06-21T11:28:16.563Z | <Leonid Usov> and for the leader it shouldn’t be needed, as leader is acked by its client directly |
2024-06-21T11:28:49.389Z | <Leonid Usov> and for the leader it shouldn’t be needed, as the leader is acked by its client directly, avoiding the possibility of a lost ack due to networking |
2024-06-21T11:30:52.069Z | <Leonid Usov> then the only missing part will be to have the leader send the listing to the peer that doesn’t ack for a given timeout, or the last listing was sent longer ago than the timeout |
2024-06-21T11:31:09.538Z | <Leonid Usov> then the only missing part will be to have the leader send the listing to the peer that doesn’t ack for a given timeout, or the last listing was sent longer than the timeout ago |
2024-06-21T11:33:44.678Z | <Leonid Usov> then the only missing part will be to have the leader send the listing to the peer that doesn’t ack for a given timeout, _and_ the last listing was sent longer than the timeout ago |
2024-06-21T11:35:05.942Z | <Dhairya Parmar> are the peers and the replicas here in the same context? |
2024-06-21T14:17:40.102Z | <Leonid Usov> replicas would be a better name for the field that’s called `peers` in the manager |
2024-06-21T14:18:09.598Z | <Leonid Usov> peer should mean someone who’s not me but a participant in the cluster |
2024-06-21T14:18:23.922Z | <Leonid Usov> members = me + peers |
2024-06-21T14:18:53.630Z | <Leonid Usov> replicas = peers of a leader |
2024-06-21T14:20:10.854Z | <Leonid Usov> ah wait there’s a complication with that field, because it includes the leader so that local acks can be taken into account |
2024-06-21T14:20:34.068Z | <Leonid Usov> so, technically, that field should have been `members` |
2024-06-21T17:05:48.714Z | <Dhairya Parmar> 0_o |
2024-06-21T17:06:22.333Z | <Dhairya Parmar> wait, let me summarise this, when you say the leader and the peers, the peers are replicas correct? |
2024-06-21T17:24:26.286Z | <Leonid Usov> Yes. But for a replica it’s peers includes the leader |
2024-06-21T17:24:40.596Z | <Leonid Usov> Yes. And for a replica it’s peers includes the leader |