ceph - cephadm - 2024-09-16

Timestamp (UTC)Message
2024-09-16T10:02:13.599Z
<Benard> I am trying to add an OSD using cephadm(HDD for data and SSD partition for WAL)  but it is complaining that the device is a partition. Is it not possible to add partitions to ceph using cephadm? `stderr ceph-volume lvm batch: error: /dev/sdi2 is a partition, please pass LVs or raw block devices`
2024-09-16T10:03:23.564Z
<Benard> This is using ceph pacific
2024-09-16T10:18:12.353Z
<Eugen Block> There was a [thread](https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/OOHAGFPGLGH5BI26A57UQQNQNIYA7KQP/#N77BSCJP2MI47QPR44IMHLY4R7OSGCGI) on the ceph-users mailing list just recently. If you can, don’t use partitions but rather entire devices. It just makes your life easier as a ceph administrator. Especially cephadm is designed to automate as much as possible, fiddling with partitions always requires some level of manual operations. If you really want to go through with it, you can create a PV and LV on that partition and then create your OSD on those. But replacing disks will be not as easy as it can be if your OSDs use the entire disks. 
2024-09-16T11:01:14.390Z
<Benard> Thanks @Eugen Block. But what about the use case when I want to use another device for the WAL? This significantly improves performance of HDD OSDs and Icant use an entire device for the WAL as that will result in a massive amount of wasted space.
2024-09-16T11:01:51.242Z
<Benard> I would have thought that is a fairly common way of deploying OSDs. It ismentioned extensively in the documentation
2024-09-16T11:07:01.010Z
<Eugen Block> correct, but then the entire WAL/DB device is also devided into several LVs to provide faster db storage for multiple OSDs
2024-09-16T11:08:01.272Z
<Benard> So does this mean that if I want to do this with cephadm I have to manually add all the lvs/pgs before i can add the OSD?
2024-09-16T11:08:11.298Z
<Benard> Specifically for the DB device
2024-09-16T11:09:23.954Z
<Eugen Block> no, that works well with cephadm. if you have let's say 10 HDDs and 2 SSDs, you prepare your osd spec so cephadm deploys 5 HDDs that share one SSD for WAL/DB, and the other 5 HDDs share the other SSD among them
2024-09-16T11:10:49.093Z
<Benard> What about if I want to add 1 HDD to the cluster and I want it to use part of the SSD for the DB device.
2024-09-16T11:11:26.759Z
<Benard> so for example: `ceph orch daemon add osd strg0-bm:data_devices=/dev/sda,db_devices=/dev/sdi2`
2024-09-16T11:11:34.221Z
<Eugen Block> if there's enough room on the SSD, it should work as well
2024-09-16T11:12:30.934Z
<Benard> There is enough room, but unfortunately when I run that command it gives me the above error
2024-09-16T11:13:21.288Z
<Eugen Block> yes, because you use partitions. if you let ceph handle the LV stuff on raw devices, it's gonna be easier
2024-09-16T11:14:25.633Z
<Benard> So should I pass the entire sdi device to that command?
`db_devices=/dev/sdi`
2024-09-16T11:16:50.150Z
<Eugen Block> you can try, I'm not entirely sure, though. Usually we have larger deployments where we have spec files that regulate the db/wal stuff and sizes etc. What you try is something that I sometimes attempt on test clusters with only a few disks.
2024-09-16T11:20:56.504Z
<Eugen Block> btw, if you create OSDs via command line, add a parameter for the db size, otherwise you might end up with a large db device which you don't need, using unnecessary disk space
2024-09-16T11:27:34.717Z
<Eugen Block> we have one cluster which has a heterogenous disk layout, preventing us from using spec files. in that case we also handle OSD creation/replacement manually. in that case I create LVs on the SSDs manually as well. replacing disks has often worked well by simply running "pvmove" from the failing disk to the new one. that way we didn't have to rebalance a disk by taking it out etc.
2024-09-16T11:28:01.172Z
<Benard> Thats very interesting, thanks
2024-09-16T11:29:06.456Z
<Benard> Havent tried using pvmove, I usually just completely wipe the OSD and redeploy it. I will try that next time
2024-09-16T11:29:37.843Z
<Benard> It sounds to me like there is no way around manually creating and managing the PVs/VGs when replacingfaulty disks
2024-09-16T11:29:41.506Z
<Eugen Block> that's the beauty of LVM πŸ™‚
2024-09-16T11:29:58.431Z
<Benard> Which is a shame as other solutions like juju can manage redeploing disks like that just fine
2024-09-16T11:32:25.523Z
<Eugen Block> > It sounds to me like there is no way around manually creating and managing the PVs/VGs when replacingfaulty disks
as I said, it really depends. I'll try to replay this use-case specifically, I don't want to say anything invalid πŸ˜‰
2024-09-16T11:33:57.177Z
<Eugen Block> did you deploy the other OSDs the same way, so manually via cli?
2024-09-16T11:35:40.857Z
<Eugen Block> I **think** that those existing OSDs will show as "unmanaged" in `ceph orch ls osd` output. and if they're unmanaged, there's nothing the orchestrator can do about it in an automated way. that's why spec files can be quite useful if you want osd stuff to be handled automatically
2024-09-16T11:37:01.765Z
<Benard> The other OSDs were deployed using canonical juju. I am currently testing migrating from juju to cephadm (which is not going well). While that was going on I wanted to test just adding a normal OSD and ran into this problem
2024-09-16T11:37:38.519Z
<Benard> The other OSDs currently dont show up as they havent been migrated:
`osd                                0  4m ago     -    <unmanaged>`
2024-09-16T11:41:34.253Z
<Eugen Block> okay, I can't say anything about juju. but I do have one small cluster where I just recently added OSDs the same way you tried (but without partitions). those show up as unmanaged, as I thought. Those won't be handled by cephadm in any way. But you can overwrite the service spec so those OSDs are covered as well, enabling cephadm to manage them. I guess it makes sense to go into more detail when you've adopted those OSDs from juju.
2024-09-16T11:42:58.215Z
<Benard> I am not sure I understand. Are you saying that  if I add an OSD with `cephadm orch daemon add osd...` it wont be managed by cephadm?
2024-09-16T11:46:17.576Z
<Eugen Block> I'll respond later, gotta catch a train
2024-09-16T12:18:31.056Z
<Eugen Block> so you can use cephadm to deploy and remove daemons, but since there won't be a service deployed (like with a spec file), it will be unmanaged, hence no automatic handling by cephadm. if you remove such an osd daemon by ceph orch osd rm, it will leave a health warning about stray daemons. you'll have to clean up manually.
2024-09-16T12:28:20.304Z
<Benard> so `ceph orch damon add` doesnt actually add the daemons properly?
2024-09-16T12:34:01.362Z
<Eugen Block> it does, just not for fully automated management. it's also mentioned in the docs, maybe they could use some clarification what it actually means. so here's an example I just executed on my test cluster:
β€’ deploy OSDs manually via ceph orch daemon add
β€’ OSDs are deployed, but unmanaged
β€’ removing an OSD works (`ceph orch osd rm --replace --force --zap 3`), leaving the OSD in "destroyed" state for replacement, but zapping doesn't work. so I need to zap that osd manually (`cephadm ceph-volume lvm zap --destroy /dev/vdf`)
β€’ since no service is defined, nothing happens to that OSD until I add a daemon again: `ceph orch daemon add osd host:data_devices=/dev/vdg` 
β€’ the "destroyed" OSD is redeployed successfully, but it's more work than with osd service specs in place
2024-09-16T12:37:28.403Z
<Benard> Thank you very much for clarifying Eugen πŸ‘
2024-09-16T12:39:39.744Z
<Benard> So if I have a disk that I want to replace in the ceph cluster after a failure, what is the correct way? Is it to just replace the disk and apply the OSD service spec again?
2024-09-16T12:43:57.492Z
<Eugen Block> if you already have a service spec in place which covers the failing OSDs (depending on your filters for sizes, rotational flags and what-not), you don't need to apply anything. you just run the rm command with the --replace flag (`ceph orch osd rm --replace --force --zap <OSD_ID>`) and the orchestrator will take care of the rest. it will wipe the corresponding DB device (if present) and redeploy according to the specs with the same osd id.

One of our customers didn't have many disk failures yet, just recently the first osd replacement was due since we adopted the cluster by cephadm. they only changed the hard-drive, ran the above command that was it. They literally were amazed πŸ˜„
2024-09-16T13:06:03.900Z
<Eugen Block> just to be clear, if I talk about the osd spec, I mean that there's a service definition which is actually "managed" by cephadm, not the manual deployment as we chatted about which leads to "unmanaged" services.
2024-09-16T23:54:54.119Z
<Ken Carlile> Is it possible to enable QAT with a cephadm deployed cluster?

Any issue? please create an issue here and use the infra label.