ceph - cephfs - 2024-06-21

Timestamp (UTC)	Message
2024-06-21T06:03:16.950Z	<Rishabh Dave> @Venky Shankar current output for clone progress bars looks okay? posted here - <https://github.com/ceph/ceph/pull/54620#pullrequestreview-2130881248>
2024-06-21T06:12:28.353Z	<Dhairya Parmar> hey @Leonid Usov (re: the client bug we discussed in yesterday's sync up) could you go through this tracker <https://tracker.ceph.com/issues/66581> and add/change anything you feel so?
2024-06-21T06:17:57.286Z	<Leonid Usov> I already did, it’s a nice write up, nothing to add, good work!
2024-06-21T06:31:37.128Z	<Venky Shankar> yeh, much better
2024-06-21T06:31:43.378Z	<Venky Shankar> looks really neat
2024-06-21T06:31:47.813Z	<Venky Shankar> 👍
2024-06-21T06:32:28.066Z	<Rishabh Dave> okay, it's ready for review again. i made last set of requested changes yesterday night
2024-06-21T06:32:57.222Z	<Venky Shankar> I left a comment regarding the ev_id used
2024-06-21T06:33:10.138Z	<Venky Shankar> should it always be a random string?
2024-06-21T06:34:33.443Z	<Venky Shankar> (I see that with other modules using progress, but I want to understand if switching to static string could cause issues)
2024-06-21T06:35:29.893Z	<Rishabh Dave> yeah, i left reply to it yesterday night
2024-06-21T06:35:53.455Z	<Venky Shankar> kk
2024-06-21T06:35:58.590Z	<Venky Shankar> haven't checked those yet
2024-06-21T06:36:13.949Z	<Rishabh Dave> searching for the thread so that i can pass you the link
2024-06-21T06:36:41.956Z	<Venky Shankar> I'm having a hard time searching due to the number of comments
2024-06-21T06:37:14.358Z	<Rishabh Dave> same
2024-06-21T06:37:30.251Z	<Rishabh Dave> <https://github.com/ceph/ceph/pull/54620#discussion_r1647989400>
2024-06-21T06:37:53.018Z	<Rishabh Dave> there are 2 more replies from below the one in the link
2024-06-21T06:39:59.189Z	<Venky Shankar> kk
2024-06-21T06:40:19.738Z	<Venky Shankar> if we move to static strings then the qa test needs to change
2024-06-21T06:40:37.245Z	<Venky Shankar> since we know the key names in progress event json
2024-06-21T06:40:37.806Z	<Rishabh Dave> done already. :)
2024-06-21T06:40:55.371Z	<Venky Shankar> oh, that was quick
2024-06-21T06:40:57.545Z	<Venky Shankar> nice
2024-06-21T06:41:12.865Z	<Venky Shankar> I'll have a look and if nothing else is required, will put this to test
2024-06-21T06:41:22.546Z	<Rishabh Dave> got it done yesteday and put it to test with teuth early morning today
2024-06-21T06:41:32.089Z	<Rishabh Dave> > I'll have a look and if nothing else is required, will put this to test yay!
2024-06-21T07:29:22.547Z	<Rishabh Dave> @Venky Shankar Please reply on this PR too, I left a reply here too yesterday night - <https://github.com/ceph/ceph/pull/58101#discussion_r1648063199>
2024-06-21T10:49:10.111Z	<Rishabh Dave> ping
2024-06-21T10:50:03.460Z	<Venky Shankar> will try
2024-06-21T11:00:34.909Z	<Rishabh Dave> thanks
2024-06-21T11:14:15.084Z	<Dhairya Parmar> @Leonid Usov im a bit curious about the usage of double underscores for various quiesce state like `QS__FAILURE`
2024-06-21T11:14:51.137Z	<Dhairya Parmar> while others have single underscore
2024-06-21T11:14:59.647Z	<Leonid Usov> These denote boundaries
2024-06-21T11:15:14.468Z	<Leonid Usov> So you would use them in range conditions
2024-06-21T11:18:10.100Z	<Dhairya Parmar> so it is just for the distinction that these states must be used for fulfilling conditions like ``` if (info.state >= QS__FAILURE) { // we don't care about roots in failed states dout(5) << "ignoring a root in a failed state: '" << root << "', " << info.state << dendl; their_it = map.roots.erase(their_it); continue; }```
2024-06-21T11:20:42.489Z	<Leonid Usov> yes. Those aren’t real states as per the state diagram.
2024-06-21T11:21:15.645Z	<Leonid Usov> we have __invalid, __max and different internal boundaries
2024-06-21T11:21:35.400Z	<Dhairya Parmar> okay
2024-06-21T11:23:26.057Z	<Dhairya Parmar> and it seems like the <https://tracker.ceph.com/issues/66107> has taken another turn?
2024-06-21T11:24:08.823Z	<Dhairya Parmar> from what it looks like this is a messenger side problem and not quiesce specific?
2024-06-21T11:24:16.849Z	<Leonid Usov> well, it looks like we need to create another ticket to investigate the root cause of the missed message
2024-06-21T11:24:39.192Z	<Leonid Usov> however, I still think that the manager should re-query for an improved resiliency
2024-06-21T11:24:47.929Z	<Leonid Usov> it won’t hurt
2024-06-21T11:25:01.525Z	<Leonid Usov> and it will allow us implement the messaging over a lossy connection
2024-06-21T11:25:30.391Z	<Dhairya Parmar> so i guess you're hinting towards removal of this `if (agent_callback->if_newer < db.version()) {`
2024-06-21T11:25:35.322Z	<Dhairya Parmar> so i guess you're hinting towards removal of this `if (agent_callback->if_newer < db.version()) {` check
2024-06-21T11:26:34.086Z	<Leonid Usov> i;m not sure. It’s definitely a simple solution, but as we discussed, I would not like to remove it because then we may get many more unnecessary invocations of the agent.
2024-06-21T11:27:35.685Z	<Leonid Usov> Please consider identifying the case when you need to “force-notify” an agent. For the replica it’s easy - if there is a listing from the leader, then we force-notify the client regardless of the `if_newer`
2024-06-21T11:28:16.563Z	<Leonid Usov> and for the leader it shouldn’t be needed, as leader is acked by its client directly
2024-06-21T11:28:49.389Z	<Leonid Usov> and for the leader it shouldn’t be needed, as the leader is acked by its client directly, avoiding the possibility of a lost ack due to networking
2024-06-21T11:30:52.069Z	<Leonid Usov> then the only missing part will be to have the leader send the listing to the peer that doesn’t ack for a given timeout, or the last listing was sent longer ago than the timeout
2024-06-21T11:31:09.538Z	<Leonid Usov> then the only missing part will be to have the leader send the listing to the peer that doesn’t ack for a given timeout, or the last listing was sent longer than the timeout ago
2024-06-21T11:33:44.678Z	<Leonid Usov> then the only missing part will be to have the leader send the listing to the peer that doesn’t ack for a given timeout, _and_ the last listing was sent longer than the timeout ago
2024-06-21T11:35:05.942Z	<Dhairya Parmar> are the peers and the replicas here in the same context?
2024-06-21T14:17:40.102Z	<Leonid Usov> replicas would be a better name for the field that’s called `peers` in the manager
2024-06-21T14:18:09.598Z	<Leonid Usov> peer should mean someone who’s not me but a participant in the cluster
2024-06-21T14:18:23.922Z	<Leonid Usov> members = me + peers
2024-06-21T14:18:53.630Z	<Leonid Usov> replicas = peers of a leader
2024-06-21T14:20:10.854Z	<Leonid Usov> ah wait there’s a complication with that field, because it includes the leader so that local acks can be taken into account
2024-06-21T14:20:34.068Z	<Leonid Usov> so, technically, that field should have been `members`
2024-06-21T17:05:48.714Z	<Dhairya Parmar> 0_o
2024-06-21T17:06:22.333Z	<Dhairya Parmar> wait, let me summarise this, when you say the leader and the peers, the peers are replicas correct?
2024-06-21T17:24:26.286Z	<Leonid Usov> Yes. But for a replica it’s peers includes the leader
2024-06-21T17:24:40.596Z	<Leonid Usov> Yes. And for a replica it’s peers includes the leader

ceph - cephfs - 2024-06-21

Any issue? please create an issue here and use the infra label.