ceph - ceph-devel - 2024-08-04

Timestamp (UTC)Message
2024-08-04T19:59:02.171Z
<jmguzman> Hello Guys

We are suffering some instability on a productive cluster, that I think is due to the mechanisms to declare a OSD down and up.
2024-08-04T20:01:02.688Z
<jmguzman> We have a cluster with 3 hosts. OSD1 is in Host1
There is a network connectivity that prevents host 1 to connect with host 2 and host 3 (it is an isolated networking problem, related to wrong mac propagation, nothing to do with Ceph, but likely to happen in large networks).
As mon_osd_min_down_reporters=2, when Host 2 and Host 3 declares the OSD down, the OSD is declared DOWN
but then we receive the report from Host1, and therefore the OSD is now UP
And so on.. This produces consistent flapping.. and degrades the control plane up to the point to make in unusable.
2024-08-04T20:02:07.822Z
<jmguzman> IMHO, the problem is: there is a threshold to declare the ODN down (mon_osd_min_down_reporters), but no threshold to declare it up.
2024-08-04T20:03:36.307Z
<jmguzman> For know we are trying to set mon_osd_min_down_reporters={number_of_host}, so, the OSD is declared down, then ALL THE HOSTS declare it down.
2024-08-04T20:04:00.935Z
<jmguzman> But I am not sure of the consequences of this...
2024-08-04T20:04:09.775Z
<jmguzman> What do you think?

Any issue? please create an issue here and use the infra label.