ceph - cephadm - 2024-09-18

Timestamp (UTC)Message
2024-09-18T12:53:51.569Z
<Benard> I have done a bit more analysis on this and found something interesting. I ran the ceph container myself and got a shell on it. When I check dm-0 I can see that it is owned by `64045` which is `ceph` on the baremetal server. This of course doesnt map to ceph in thecontainer so I get a permission denied error.

I chowned the dir to `167:167` mysekf and ran `ceph-osd..` again and got the same error. I can see that starting ceph osd chowns that directory to the wrong `64045` user! Is this a but or am I missing something?
2024-09-18T12:54:23.348Z
<Benard> I have done a bit more analysis on this and found something interesting. I ran the ceph container myself and got a shell on it. When I check dm-0 I can see that it is owned by `64045` which is `ceph` on the baremetal server. This of course doesnt map to ceph in thecontainer so I get a permission denied error.

I chowned the dir to `167:167` mysekf and ran `ceph-osd..` again and got the same error. I can see that starting ceph-osd chowns that directory to the wrong `64045` user! Is this a bug or am I missing something?
2024-09-18T13:06:47.871Z
<Brian P> `cephadm` is a bug, if that is what you are asking

Have you checked what `unit.run` does?
2024-09-18T13:08:04.504Z
<Benard> Yes, that is where I got all the containers commands from. It essentially runs a bunch of docker containers to prepare the directories and runs the main osd in a docker container. THe problem command is essentially:
`ceph-osd -n osd.0 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true`
2024-09-18T13:12:41.964Z
<Brian P> I believe this issue is because of running rhel8 container in ubuntu, also related to uids.
But I don't recall more details.
2024-09-18T13:13:21.826Z
<Brian P> Your friend `juju` installs ceph packages and that is the trigger.
2024-09-18T13:22:22.758Z
<Benard> But I would have thought that cephadm would be able to handle existing ceph packages on the host, since its able to migrate from other deployments to cephadm?
2024-09-18T13:23:30.533Z
<Brian P> That is a big leap of faith to have of Python code from upstream
2024-09-18T13:26:09.457Z
<Benard> Is cephadm not the defacto and recommended way of deploying ceph now?
2024-09-18T13:27:09.819Z
<Brian P> By whom? Do you know of any big cluster using it?
2024-09-18T13:28:40.297Z
<Brian P> Are you running an old version with that bug though?

Does your unit.run have something like this at the beginning?
```/usr/bin/install -d -m0770 -o 167 -g 167 /var/run/ceph/...```
2024-09-18T13:29:55.005Z
<Benard> It does
2024-09-18T13:30:10.286Z
<Benard> What alternative would you recommend over cephadm?
2024-09-18T13:33:14.677Z
<Brian P> The only 'tricky' part is bootstrapping the cluster, which can be done step by step in <5 minutes.

Day 1 and so on for Ceph should be done manually and with simple packages, since every single 'automation' has been mediocre at best.
Bash gets you a long way, no need to glue things together.
But that is just me.
2024-09-18T13:34:35.029Z
<Benard> Is there a bug I can look at regarding the permissions?
2024-09-18T13:36:10.033Z
<Brian P> Do these match?
```ls -ld /var/lib/ceph/
drwxr-x--- 12 ceph ceph 4096 Jan 1  1970 /var/lib/ceph/```
```ls -ld /var/lib/ceph/YOUR_CLUSTER/
drwx------ 108 167 167 4096 Jan 1  1970 /var/lib/ceph/YOUR_CLUSTER/```
2024-09-18T13:36:47.461Z
<Brian P> AFAICT this is `chmod` issue
2024-09-18T13:37:43.676Z
<Benard> Mine looks exactly like the above. ceph:ceph for /var/lib/ceph and 167:167 for /var/lib/ceph/UUID
2024-09-18T13:45:28.998Z
<Brian P> This is just failing the `adopt` command? (does that even work?)
Can you try doing the whole host with `ceph cephadm osd activate $HOST` instead?
2024-09-18T13:46:30.527Z
<Benard> I did that already and to cut a long story short that did not work
2024-09-18T13:47:27.348Z
<Brian P> Is cephadm using an ancient developer image?
2024-09-18T13:48:28.797Z
<Brian P> Since this cluster was migrated into cephadm, right?
I would try 'upgrading' it to the same version, basically it will restart whatever you already migrated.
2024-09-18T13:50:02.406Z
<Brian P> Update its internal configs, because it probably does not respect:
`ceph config get global container_image`
2024-09-18T13:50:16.896Z
<Benard> I tried to import an OSD from legacy config to cephadm using `cephadm adopt --style legacy --name osd.0`. However the disk fails to start and when I look in the logs I see:
```debug 2024-09-13T16:36:30.778+0000 7ff0b1e45380  1 bdev(0x5563e27f4400 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
debug 2024-09-13T16:36:30.778+0000 7ff0b1e45380 -1 bdev(0x5563e27f4400 /var/lib/ceph/osd/ceph-0/block) open open got: (13) Permission denied
debug 2024-09-13T16:36:30.778+0000 7ff0b1e45380  1 bdev(0x5563e27f4400 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
debug 2024-09-13T16:36:30.778+0000 7ff0b1e45380 -1 bdev(0x5563e27f4400 /var/lib/ceph/osd/ceph-0/block) open open got: (13) Permission denied
debug 2024-09-13T16:36:30.778+0000 7ff0b1e45380  0 osd.0:0.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cut
off=196)```
The dir is owned y `ceph:ceph` so I am not sure what is wrong with it. Anyone seen this before?
2024-09-18T13:50:27.090Z
<Brian P> Since this cluster was migrated into cephadm, right?
I would try 'upgrading' it to the same version, basically it will restart whatever you already migrated.
And update its internal configs, because it probably does not respect:
`ceph config get global container_image`
2024-09-18T13:54:56.985Z
<Benard> So far the mons and mgrs have been migrated. I tried to migrate the OSDs using the above methods to no avail.

I am not sure what you mean by upgrading it, I am using v16/pacific, which both the 'old' and the 'new' cluster use.
2024-09-18T13:56:18.939Z
<Brian P> Can you list the container images in the `strg-bm`? Just to confirm ONLY current good images are being used.

By upgrading I meant running `ceph orch upgrade1
2024-09-18T13:57:30.288Z
<Benard> ```root@strg0-bm:/dev# docker image ls
REPOSITORY          TAG                    IMAGE ID       CREATED        SIZE
ubuntu              latest                 edbfe74c41f8   6 weeks ago    78.1MB
[quay.io/ceph/ceph](http://quay.io/ceph/ceph)   v16                    3c4eff6082ae   3 months ago   1.19GB
ceph/daemon-base    latest-pacific-devel   41387741ad94   3 years ago    1.2GB```
2024-09-18T13:58:07.090Z
<Brian P> Does `adopt` use `daemon-base` ?
2024-09-18T13:59:18.340Z
<Benard> Doesnt seem to be:
```root@strg0-bm:/var/lib/ceph/osd# cephadm adopt --style legacy --name osd.0
Pulling container image [quay.io/ceph/ceph:v16](http://quay.io/ceph/ceph:v16)...
Found online OSD at //var/lib/ceph/osd/ceph-0/fsid
objectstore_type is bluestore
...```
2024-09-18T14:00:35.072Z
<Brian P> IF you delete that ancient image, run adopt, does it show up again?
2024-09-18T14:02:08.980Z
<Benard> Let me try that
2024-09-18T14:16:41.162Z
<Eugen Block> run `cephadm --image <your-required-image> adopt ...`  or overwrite your global `container_image`  to get rid of the latest-pacific-devel image, otherwise it will pull it everytime
2024-09-18T15:29:42.190Z
<Benard> I just tried the above to no avail.

@Brian P I can confirm that immage does not reappear. I am pretty sure that is only used by the mgr to peariodically scan for osds or something similar but adopt does not use it.

@Eugen Block I do not have container_image set (unless there is some default value that is set somewhere). Regardless even when I explicitly use the latest 16.2.15 image the issue persists
2024-09-18T16:40:20.013Z
<Eugen Block> Who are the two users exactly? Can you show their entries from /etc/passwd? Maybe 167 was already in use by the juju controlled cluster? If that is the case, I’d probably remove the respective packages, ensure the users are gone as well, then reinstall ceph and see if it shows the correct user id, usually 167

Any issue? please create an issue here and use the infra label.