ceph - cephadm - 2024-09-19

Timestamp (UTC)	Message
2024-09-19T10:07:40.268Z	<Benard> @Eugen Block There is one user present on the hypervisor. This user has a UID of `64045`. I believe this is the user used by the existing juju deployment to deploy ceph osds. The other user is the ceph user used by the cephadm containers. It has a UID of `167` and exists only in the container. It seems that when the ceph container starts and chowns files and directories, the `dm-*` devices have their UID/GID set to the baremetal ceph user(`64945`) instead of the container ceph user(`164`) which leads to this permissions problem
2024-09-19T10:08:35.037Z	<Benard> If i remove the package and the user from the baremetals it will kill all the OSDs on that node which I ideally dont want to do. I would prefer to migrate 1 daemon at a time and not 1 node at a time
2024-09-19T11:39:40.575Z	<Eugen Block> I'm not sure if those are typos, but you mention 4 different UIDs (64045, 64945, 167, 164). Anyway, I think you'll have to deal with this sooner or later anyway, I mean to get rid of the juju controlled ceph user. I understand that you don't want to take down an entire host, but hopefully "host" is your failure-domain, so if you set the host to noout, you could safely adopt the OSDs (probably). But everything you try with this setup, it's kind of experimenting anyway. You could try to modify the cephadm file (/usr/sbin/cephadm) on each host to set the uid/gid to the previous one. But with the first upgrade you'll get a different container version, stored in /var/lib/ceph/{FSID}/cephadm.********** which doesn't contain your ceph user's uid. I haven't had such a case yet, but I still think removing juju's ceph user is the best approach. If you have a test environment, try that there first though.
2024-09-19T11:45:20.705Z	<Benard> Pardon, those were typos, let me correct that now
2024-09-19T11:45:35.858Z	<Benard> @Eugen Block There is one user present on the hypervisor. This user has a UID of `64045`. I believe this is the user used by the existing juju deployment to deploy ceph osds. The other user is the ceph user used by the cephadm containers. It has a UID of `167` and exists only in the container. It seems that when the ceph container starts and chowns files and directories, the `dm-*` devices have their UID/GID set to the baremetal ceph user(`64045`) instead of the container ceph user(`167`) which leads to this permissions problem
2024-09-19T11:51:02.419Z	<Benard> So to clarify, what you are suggesting is that I redeploy an entire host? If I delete the existing juju ceph user and remove the `ceph-osd` package as is recommended I am basically taking down an entire host and then will have to rebuild it from scratch. Is this the recommended approach?
2024-09-19T11:54:50.760Z	<Eugen Block> I don't know how juju works, tbh. I don't mean to rebuilding a host OS-wise but just get rid of the 64045 user so cephadm can take over the existing OSDs. Would it be an option to take a look together to get a better impression of the environment? Just not this week, next week would work if that's an option for you.
2024-09-19T12:09:36.267Z	<Benard> Perhaps we can do that, I will DM you
2024-09-19T13:40:29.351Z	<Benard> I have figured out the problem. Turns out there were udev rules that were resetting the owners of those devices every time they are accessed 🙃
2024-09-19T14:08:04.974Z	<Eugen Block> Ooohhh, nice catch! I didn’t think of udev, tbh. But as soon as I threatened to take a look, you fixed it, so… so I get some credit? 😂 😉
2024-09-19T14:08:35.102Z	<Benard> You do indeed 😁

ceph - cephadm - 2024-09-19

Any issue? please create an issue here and use the infra label.