ceph - cephfs - 2024-06-24

Timestamp (UTC)Message
2024-06-24T05:22:21.395Z
<Rishabh Dave> @Venky Shankar both mgr/vol PRs needs review - <https://github.com/ceph/ceph/pull/54620> and <https://github.com/ceph/ceph/pull/58101>
2024-06-24T05:23:39.958Z
<Venky Shankar> I'm swamped with other stuff. Will plan it for this week.
2024-06-24T05:23:54.998Z
<Rishabh Dave> okay
2024-06-24T09:47:16.567Z
<Venky Shankar> Hey @Leonid Usov - /a/vshankar-2024-06-18_04:14:30-fs-wip-vshankar-testing-20240617.094253-debug-testing-default-smithi/7761550 - There is a decode failures like
```2024-06-22T08:46:57.004 INFO:tasks.ceph.mds.b.smithi176.stderr:2024-06-22T08:46:57.005+0000 7f70126c2640 -1 quiesce.mds.0 <quiesce_dispatch> failed to decode message of type 1286 v1: End of buffer [buffer:2]```
Is this a known issue?
2024-06-24T09:47:46.938Z
<Venky Shankar> Message type is `src/msg/Message.h:#define MSG_MDS_QUIESCE_DB_ACK     0x506  // quiesce agent ack back to the db`
2024-06-24T09:53:22.189Z
<Leonid Usov> I have seen this once, but I assume this was part of the messenger error injection 
2024-06-24T09:54:54.096Z
<Venky Shankar> How is that related to messenger error injection? Error injection AFAIK, causes the message to not to be delivered (like a lost ack or something), but this is a decode failure which according the error msg is decoding past the bufferlist, isn't it?
2024-06-24T09:55:18.472Z
<Leonid Usov> i could imagine an error injection that corrupts a message
2024-06-24T09:55:33.655Z
<Leonid Usov> but I’ll look closer
2024-06-24T09:55:47.202Z
<Venky Shankar> oh, do we go that far in corrupting message. I'd be surprised - wow!
2024-06-24T10:04:01.105Z
<Leonid Usov> the message arrived empty if I understand correctly
```2024-06-22T08:46:57.005+0000 7f70126c2640  1 -- [v2:172.21.15.176:6837/2441896889,v1:172.21.15.176:6839/2441896889] <== mds.1 v2:172.21.15.176:6836/358368136 4367 ====  ==== 0+0+0 (crc 0 0 0) 0x562296d2d500 con 0x5
622935f6480
2024-06-22T08:46:57.005+0000 7f70126c2640 -1 quiesce.mds.0 <quiesce_dispatch> failed to decode message of type 1286 v1: End of buffer [buffer:2]
2024-06-22T08:46:57.005+0000 7f70126c2640 10 quiesce.mds.0 <quiesce_dispatch> dump:```
2024-06-24T10:06:33.473Z
<Leonid Usov> here is the corresponding log from the sending side. We see that there was some kind of socket error: https://files.slack.com/files-pri/T1HG3J90S-F079F3RAD27/download/untitled
2024-06-24T10:12:56.602Z
<Venky Shankar> Do we know if the sender did encode a bunch of information in the bufferlist?
2024-06-24T10:13:29.809Z
<Leonid Usov> no, we don’t. The code never got to sending, there was a disconnection
2024-06-24T10:13:41.551Z
<Leonid Usov> well, no
2024-06-24T10:13:47.347Z
<Leonid Usov> it sent out the header, apparently
2024-06-24T10:14:43.007Z
<Leonid Usov> ```2024-06-22T08:46:56.658+0000 7fbd249e5640  1 --2- [v2:172.21.15.176:6836/358368136,v1:172.21.15.176:6838/358368136] >> [v2:172.21.15.176:6837/2441896889,v1:172.21.15.176:6839/2441896889] conn(0x558a5ec93600 0x558a5
e8f1180 crc :-1 s=SESSION_ACCEPTING pgs=757 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_reconnect reconnect race detected, this connection loses to existing=0x558a5e6aba80```
2024-06-24T10:16:06.469Z
<Leonid Usov> please check out `remote/smithi176/log/ceph-mds.f.log.gz` for the messenger - related issue
2024-06-24T10:17:46.369Z
<Leonid Usov> i think the reconnect/resend is broken. I won’t be surprised if that’s related if not the same root cause as the investigation we did on the ticket I opened against quiesce db to improve resiliency to lost acks, see my comments on <https://tracker.ceph.com/issues/66107>
2024-06-24T10:18:36.451Z
<Venky Shankar> I'll get to examining the logs only tomorrow (earliest)
2024-06-24T10:18:48.195Z
<Venky Shankar> For now - I'll create a tracker which we can triage
2024-06-24T10:18:50.743Z
<Leonid Usov> anyway, I don’t think this is a quiesce-db related issue
2024-06-24T10:19:26.721Z
<Leonid Usov> we see that in many cases we see some socket level errors and the reconnect / resend mechanism isn’t working as it should
2024-06-24T10:19:37.666Z
<Venky Shankar> If resend is broken, we are in bigger trouble o_O
2024-06-24T10:23:28.717Z
<Venky Shankar> I understand that you resend/reconnet is not working as expected, but what bothers me is the message is getting mangled itself. if an endpoint didn't get the message, its understandable, but if it got (say) half of the message for some reason (and never the leftover half), its worry some IMO.
2024-06-24T10:23:35.697Z
<Leonid Usov> We’ll, the resend functionality is clearly there. I think there’s some edge case we aren’t handling well
2024-06-24T10:24:31.011Z
<Leonid Usov> I agree that I wouldn’t expect the messenger to submit a partially received message
2024-06-24T10:24:53.324Z
<Leonid Usov> but maybe it wasn’t supposed to be testing that? I know there are CRCs, but maybe they are disabled?
2024-06-24T10:53:06.340Z
<Dhairya Parmar> >  then the only missing part will be to have the leader send the listing to the peer that doesn’t ack for a given timeout, _and_ the last listing was sent longer than the timeout ago
@Leonid Usov does this means the the `RootInfo` struct would need another field?
2024-06-24T11:09:25.688Z
<Leonid Usov> I think you meant PeerInfo. And no, I believe you can achieve the logic with the fields available, though it shouldn’t ba a problem to add a field if you really need one. 
2024-06-24T11:12:24.028Z
<Dhairya Parmar> hmm `PeerInfo`  already has
```        QuiesceTimePoint last_activity;
        QuiesceDbVersion last_sent_version;```
so i guess I can just make use of them
2024-06-24T11:12:39.844Z
<Dhairya Parmar> hmm `PeerInfo`  already has
```QuiesceTimePoint last_activity;
QuiesceDbVersion last_sent_version;```
so i guess I can just make use of them
2024-06-24T11:29:27.235Z
<Dhairya Parmar> `using QuiesceClock = ceph::coarse_real_clock;`

@Leonid Usov real time clocks are prone to clock shifts, cephfs-mirror had a cleanup sometime back where it was swapped with monotonic clocks (which are less prone), and IIRC venky or someone had mentioned this in standup that it'd be better to use monotonic clocks wherever possible. So good to switch this too? Or the usage of real time clock is intentional here?
2024-06-24T11:45:05.052Z
<Leonid Usov> Feel free to update that definition as you find appropriate.
The quiesce db itself is designed to not rely on absolute time, but rather on the deltas. It’s still possible that using a monotonic clock will benefit the quiesce db.
2024-06-24T11:47:08.855Z
<Leonid Usov> you can see that all events in the db are measured and recorded as “age”. The only problem could be on the leader where now could potentially become less than `time_zero` I don’t think I have a test for that, but technically, it should not be hard to pick a new time_zero if such condition occurs, without affecting the contents of the DB.
2024-06-24T11:47:19.985Z
<Leonid Usov> yep
2024-06-24T11:52:55.345Z
<Dhairya Parmar> this is more of a cleanup and to reduce the scope of any randomness
2024-06-24T12:07:16.003Z
<Dhairya Parmar> @Leonid Usov I had a [question](https://github.com/ceph/ceph/pull/57946/commits/b3edbbabdbf64c39fb91da3f47fcfce25bd3c93c#r1650918025) in the PR 57946
2024-06-24T12:13:03.966Z
<Dhairya Parmar> another question w.r.t. quiesce main thread:

```if (next_event_at_age <= db_age) {
  break;
}```
`next_event_at_age` is `QuiesceTimeInterval::max()` , so if the db's age is more than the max, i think something has severely gone wrong, isn't it?
2024-06-24T12:13:47.997Z
<Dhairya Parmar> shouldn't this at least be logged?
2024-06-24T12:24:06.989Z
<Leonid Usov> Max is not the only option for that variable
2024-06-24T12:27:25.078Z
<Leonid Usov> I’ll respond on the PR
2024-06-24T12:33:25.195Z
<Dhairya Parmar> okay
2024-06-24T12:49:47.962Z
<Leonid Usov> There is a [slide](https://docs.google.com/presentation/d/1p2h1_-Qwj1gsMZ690nHCs6zYq9oB0hbgliFDCtymfGs/edit#slide=id.g2e43e634214_0_165) about this in my code walkthrough
2024-06-24T13:38:29.615Z
<Venky Shankar> @Patrick Donnelly RFR (a bit on prio): <https://github.com/ceph/ceph/pull/58227>
2024-06-24T13:58:41.581Z
<Venky Shankar> I'm not sure about the CRC part, but certainly the messenger should not send a partial message. We would have seen all kinds of weird bugs if that was the case.
2024-06-24T14:07:34.508Z
<Venky Shankar> @Leonid Usov I'm sure a there is a tracker for this failed run: <https://pulpito.ceph.com/vshankar-2024-06-18_04:14:30-fs-wip-vshankar-testing-20240617.094253-debug-testing-default-smithi/7761356/> (`valgrind error: UninitCondition QuiesceDbVersion::operator<=>(QuiesceDbVersion const&) const QuiesceDbManager::db_thread_has_work() const QuiesceDbManager::quiesce_db_thread_main()`) - I can't find it, do you have it handy?
2024-06-24T14:39:15.707Z
<Leonid Usov> @Patrick Donnelly fixed it already 
2024-06-24T14:42:36.469Z
<Leonid Usov> [https://tracker.ceph.com/issues/66449](https://tracker.ceph.com/issues/66449)
2024-06-24T14:58:44.292Z
<Venky Shankar> great. I have an old branch..
2024-06-24T15:55:29.274Z
<Patrick Donnelly> done!
2024-06-24T15:59:14.988Z
<Patrick Donnelly> RFR <https://github.com/ceph/ceph/pull/53503>

Any issue? please create an issue here and use the infra label.