ceph - cephadm - 2024-10-15

Timestamp (UTC)	Message
2024-10-15T08:59:39.173Z	<Laimis Juzeliūnas> Hi community - seeking for some help with Ceph upgrade from 18.2.4 Reef to 19.2.0 Squid. We started the upgrade after turning off the autoscale as suggested in the official blog via `ceph orch upgrade start --image [quay.io/ceph/ceph:v19.2.0](http://quay.io/ceph/ceph:v19.2.0)` and it got stuck during the very first mgr upgrade. For some reason the orchestrator cannot function as it should, all orch commands just started throwing error ENOTSUP: ```root@ceph-node001 ~ # ceph orch ps Error ENOTSUP: Module 'orchestrator' is not enabled/loaded (required by command 'orch ps'): use `ceph mgr module enable orchestrator` to enable it root@ceph-node001 ~ # ceph mgr module enable orchestrator module 'orchestrator' is already enabled (always-on)``` After a while commands simply get stuck with the mgr log having the entry: ```debug 2024-10-15T08:54:49.115+0000 7f0789d3e640 0 log_channel(audit) log [DBG] : from='client.21754627 -' entity='client.admin' cmd=[{"prefix": "orch upgrade status", "target": ["mon-mgr", ""]}]: dispatch``` Ceph versions output show that managers are already on 19.2.0 Any hints what could we check to proceed?
2024-10-15T09:05:56.627Z	<Laimis Juzeliūnas> after some time I get some orch outputs but eventually they still fallback to the same: ```debug 2024-10-15T09:05:24.130+0000 7f5647f18640 -1 mgr.server reply reply (95) Operation not supported Module 'orchestrator' is not enabled/loaded (required by command 'orch upgrade status'): use `ceph mgr module enable orchestrator` to enable it```
2024-10-15T09:13:02.279Z	<Laimis Juzeliūnas> logs here: <https://pastebin.com/FMEFDV8Y>
2024-10-15T11:49:55.221Z	<Brian P> Why would you use untested Ceph so early?
2024-10-15T13:56:17.694Z	<Adam King> I've seen this happen recently, although in all cases it only occurred just after a mgr startup/restart (I mostly saw it in bootstrap scenarios). I would expect in this case the upgrade could still complete and once cephadm is done restarting daemons this would stop happening. I think it's some bug with the module loading procedure during the mgr startup, but haven't been able to root cause it.
2024-10-15T14:14:01.583Z	<Laimis Juzeliūnas> Thanks Adam for the reply, we managed to get passed this turning off the balancer and restarting the mgr daemons
2024-10-15T14:14:57.097Z	<Adam King> interesting, didn't think the balancer would be involved at all. Glad you're past it anyhow.
2024-10-15T18:36:31.569Z	<Joshua Blanch> @Adam King For the issue I mentioned during the meeting today <https://tracker.ceph.com/issues/68514#change-278471>, you mentioned a hack I can do to change the unit file. I'll post more in this thread.
2024-10-15T18:37:29.664Z	<Joshua Blanch> So this is the unit.meta file for an osd created by a path ```{ "service_name": "osd", "ports": [], "ip": null, "deployed_by": [ "[quay.io/ceph/ceph@sha256:ac06cdca6f2512a763f1ace8553330e454152b82f95a2b6bf33c3f3ec2eeac77](http://quay.io/ceph/ceph@sha256:ac06cdca6f2512a763f1ace8553330e454152b82f95a2b6bf33c3f3ec2eeac77)", "[quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906](http://quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906)" ], "rank": null, "rank_generation": null, "extra_container_args": null, "extra_entrypoint_args": null, "memory_request": null, "memory_limit": null }```
2024-10-15T18:38:06.731Z	<Joshua Blanch> would the idea be to a service_id then refresh the orchestrator to pick it up?
2024-10-15T18:39:02.267Z	<Adam King> yeah, so basically change that line to `"service_name": "osd.foo",` or whatever, run `ceph orch ps --refresh` and once cephadm goes and refreshes the daemons on that host it will put the osd in the `osd.foo` service
2024-10-15T19:06:16.104Z	<Laimis Juzeliūnas> someone has to do it 🤷
2024-10-15T19:12:11.346Z	<Laimis Juzeliūnas> but on a more serious note we were very interested in the new Monitoring: RGW S3 Analytics for metrics per bucket/user
2024-10-15T19:29:53.822Z	<Joshua Blanch> I didn't get luck on getting it worked but ceph orch ls did seem to pick it up, this is on 18.2.4 ```root@ceph-test-1:/# ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 3m ago 2h count:1 crash 4/4 3m ago 2h * grafana ?:3000 1/1 3m ago 2h count:1 mgr 2/2 3m ago 2h count:2 mon 4/5 3m ago 2h count:5 node-exporter ?:9100 4/4 3m ago 2h * osd.foo 1 3m ago - <unmanaged> osd.test 1 3m ago - <unmanaged> prometheus ?:9095 1/1 3m ago 2h count:1 root@ceph-test-1:/# ceph orch set-managed osd.test No service of name osd.test found. Check "ceph orch ls" for all known services ceph orch ls --export service_type: osd service_id: test service_name: osd.test unmanaged: true spec: filter_logic: AND objectstore: bluestore``` I wonder if the service_name doesn't get saved in self.spec_store.all_specs.keys(), as I don't see any calls to spec_store.save() or something similar <https://github.com/ceph/ceph/blob/7e7aac11cd215de37ed990a494fdcd018f225c55/src/pybind/mgr/cephadm/module.py?plain=1#L2524-L2526> I was hacking something similar and trying to give osd daemons of this type a service_id on creation but you have to call spec_store.save() somewhere on the OSD creation. Although this isn't really a fix since migration would be tricky to already existing OSDs, unless we introduce a method to combine/transfer service_names together if that makes sense. ``` drive_group = DriveGroupSpec( service_id = 'default' placement=PlacementSpec(host_pattern=host_name), method=method, **drive_group_spec, ) ``` <https://github.com/ceph/ceph/blob/7e7aac11cd215de37ed990a494fdcd018f225c55/src/pybind/mgr/orchestrator/module.py?plain=1#L1397-L1401>
2024-10-15T19:31:11.349Z	<Adam King> if you just pass any `osd.foo` spec file, even an unmanaged one that matches no disks, you might be able to run service commands on that
2024-10-15T19:31:47.796Z	<Adam King> but yeah. once a proper command to change the spec affinity is created, we'll probably want to create a service to match as part of the process

ceph - cephadm - 2024-10-15

Any issue? please create an issue here and use the infra label.