ceph - cephadm - 2024-09-20

Timestamp (UTC)Message
2024-09-20T10:41:58.963Z
<cz tan> hi all, for osd deployed with cephadm, which is osd read configuration from? the value of ceph conf set global xxx or from the default value in the [global.yaml.in](http://global.yaml.in)? I've found that values that are false in [global.yaml.in](http://global.yaml.in), even if using ceph conf set to true, still read false unless set true in ceph.conf. thanks
2024-09-20T10:55:36.464Z
<Eugen Block> Can you please provide an example which config option doesn't work as you expect? Maybe you're trying to change options that would require a daemon restart? `ceph config help <option>` usually shows if a daemon restart is required.
2024-09-20T12:25:39.817Z
<Ken Carlile> so I've basically put my ceph cluster through hell over the last day or two by having it add a bunch of extraneous OSDs, then trying to remove them, but also marking them all out, so it's all sorts of unhappy. There were 3 failed HDDs in there as well
2024-09-20T12:27:16.934Z
<Ken Carlile> So I've left it sit overnight, reactivating PGs and rm/zapping the incorrect OSDs. It seems to have stalled out somewhere along the line. I still have around 30 OSDs left in the ceph orch rm status display, and 425 pgs inactive. But it seems to have kind of given up.
2024-09-20T12:28:37.198Z
<Ken Carlile> If I try to stop any of the osd rms, it says "Unable to find OSD in the queue: osd.766"
2024-09-20T12:29:00.302Z
<Ken Carlile> If I try to start an rm on the same drive, I get:
```Unable to find OSDs: ['osd.766']```
2024-09-20T12:29:19.109Z
<Ken Carlile> but osd.766 still appears in ceph orch ps, as well as on the host under podman ps
2024-09-20T12:29:28.986Z
<Ken Carlile> If I try to stop any of the osd rms, it says
```Unable to find OSD in the queue: osd.766```
2024-09-20T12:29:41.470Z
<Ken Carlile> If I try to start an rm on the same OSD, I get:
```Unable to find OSDs: ['osd.766']```
2024-09-20T12:29:47.447Z
<Ken Carlile> so I'm kind of at a loss as to what to do
2024-09-20T12:30:31.079Z
<Ken Carlile> Looking at ceph -s, I see that there are no pgs being backfilled or waiting on backfill, just those 425 that are sitting at activating (15) or activating+remapped (410)
2024-09-20T12:32:34.661Z
<Ken Carlile> I am also seeing slow ops reported on two OSDs (which are not ones that I am removing). I don't see any underlying errors on the HDD/SSDs for those, and while kicking over the containers using podman has worked throughout this process when I've seen it, it doesn't seem to necessarily be clearing those right now.
2024-09-20T12:43:00.019Z
<Eugen Block> Ken: have you failed the mgr? It tends to overload and give up at some, resulting in all kinds of wrong numbers.
2024-09-20T12:43:20.072Z
<Ken Carlile> I have not, I will try that.
2024-09-20T12:46:20.326Z
<Ken Carlile> oh yeah, it's so far gone it thinks that all 5 mgrs are active
2024-09-20T12:51:27.977Z
<Ken Carlile> ...ok, now it's telling me that the orch module is not found
2024-09-20T12:53:57.169Z
<Eugen Block> check the startup log of the new active MGR, it should tell you why (hopefully)
2024-09-20T12:56:09.861Z
<Ken Carlile> It's saying no module cephadm
2024-09-20T12:57:15.002Z
<Eugen Block> oh, which ceph version is this?
2024-09-20T12:57:40.763Z
<Ken Carlile> Reef 18.2.4
2024-09-20T12:58:22.066Z
<Eugen Block> could be this [issue](https://tracker.ceph.com/issues/67329) you're hitting.
2024-09-20T12:59:11.027Z
<Ken Carlile> ooh, fun!
2024-09-20T12:59:13.733Z
<Ken Carlile> thanks
2024-09-20T12:59:49.794Z
<Ken Carlile> now I just have to figure out the workaround...
2024-09-20T13:01:33.133Z
<Eugen Block> you'll have to edit the config-key (json), I did write it here in slack somewhere as well, let me quickly search
2024-09-20T13:03:01.465Z
<Eugen Block> Do you have the original_weight in this output?
```ceph config-key get mgr/cephadm/osd_remove_queue```
if you do, modify the config-key (make a backup first) by removing the keys/values for the original_weight with:
```ceph config-key set …```
Then fail the mgr.
Remove only those entries, nothing else.
2024-09-20T13:03:52.669Z
<Ken Carlile> cmlooks like
2024-09-20T13:07:38.060Z
<Ken Carlile> okie dokie, elt's see
2024-09-20T13:07:50.433Z
<Ken Carlile> shit, I missed that last line.
2024-09-20T13:12:01.843Z
<Ken Carlile> put that back..
2024-09-20T13:18:36.026Z
<Ken Carlile> removed those entries, but no dice on the orch running
2024-09-20T13:18:55.626Z
<Eugen Block> did you fail the mgr again?
2024-09-20T13:19:30.549Z
<Ken Carlile> yup, done it a couple of times now
2024-09-20T13:19:59.704Z
<Eugen Block> then you might want to increase debug level and catch some debug logs from mgr startup
2024-09-20T13:20:26.021Z
<Ken Carlile> yuppers, ok
2024-09-20T13:21:19.148Z
<Ken Carlile> it's gotta be that... and I must have missed one or two:
```2024-09-20T13:19:18.934+0000 7f4af7532640 -1 mgr load Failed to construct class in 'cephadm'
2024-09-20T13:19:18.934+0000 7f4af7532640 -1 mgr load Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 619, in __init__
    self.to_remove_osds.load_from_store()
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 924, in load_from_store
    osd_obj = OSD.from_json(osd, rm_util=self.rm_util)
  File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 789, in from_json
    return cls(**inp)
TypeError: __init__() got an unexpected keyword argument 'original_weight'```
2024-09-20T13:22:56.921Z
<Eugen Block> looks like it, yes
2024-09-20T13:23:38.629Z
<Ken Carlile> yup, there were 2
2024-09-20T13:23:50.611Z
<Ken Carlile> yay!
2024-09-20T13:23:54.193Z
<Ken Carlile> works now. thank you very much!
2024-09-20T13:24:32.358Z
<Eugen Block> no problem.
2024-09-20T13:24:42.171Z
<Ken Carlile> now to see if it actually starts rm'ing the OSDs. 🙂
2024-09-20T13:27:48.549Z
<Eugen Block> by the way, just a few thought. there are a couple of ways to prevent rebalancing when adding new OSDs. you can either set crush_initial_weight to 0, so you can inspect your new OSDs first, zap them if not, redeploy. Then there's --dry-run flag for ceph orch apply, that can come in handy as well. Another way would be to use (a) different crush root(s) for newly deployed hosts. But IIUC, you added new OSDs to existing hosts, so in this case a different crush root wouldn't be suitable. But the initial weight = 0 could have been helpful, we use that in our own cluster.
2024-09-20T13:28:07.019Z
<Ken Carlile> ok, good to know
2024-09-20T13:29:59Z
<Ken Carlile> unfortunately, all of this doesn't seem to have revived the activating PGs or the OSD removals.
2024-09-20T13:31:58.449Z
<Ken Carlile> although I can see why from the logs. the PGs are getting in the way of the rms
2024-09-20T13:32:03.985Z
<Ken Carlile> so perhaps patience is in order
2024-09-20T13:37:25.975Z
<Eugen Block> I'll be heading into the weekend now, patience is always in order 😉
2024-09-20T18:08:40.388Z
<Ken Carlile> I've stopped all the OSD removals except for the 3 with bad HDDs, which brought most of the PGs back to life. I still have 55 that are stuck in activating+remapped (for between 15 to 28 hrs), which are just not moving
2024-09-20T18:08:54.028Z
<Ken Carlile> oddly, only 2 of those 55 are on OSDs that are marked down and have failed hdds
2024-09-20T18:13:25.798Z
<Ken Carlile> I can't _see_ anything wrong with any of them.
```"recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2024-09-20T01:31:42.466906+0000",
            "might_have_unfound": [],
            "recovery_progress": {
                "backfill_targets": [
                    "35(6)",
                    "298(9)",
                    "460(7)"
                ],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "recovery_ops": [],
                    "read_ops": []
                }
            }
        },
        {
            "name": "Started",
            "enter_time": "2024-09-20T01:31:41.598720+0000"
        }
    ],```

Any issue? please create an issue here and use the infra label.