ceph - ceph-devel - 2024-09-13

Timestamp (UTC)Message
2024-09-13T11:48:16.692Z
<Matthews Jose> Hello all,
2024-09-13T11:48:24.981Z
<Matthews Jose> Why is the primary OSD always involved in read operations in Ceph, even when replica OSDs are available? Specifically, can clients ever directly contact replica OSDs for reads, or must they always go through the primary OSD? In what situations can read operations be offloaded to replica OSDs, and how does the primary OSD manage this process, especially with `balance-reads` enabled?
2024-09-13T11:48:34.011Z
<Matthews Jose> thanks in advance
2024-09-13T13:09:47.872Z
<IcePic> Matthew Jose: As soon as the cluster grows with many OSD hosts and many clients, many of the drives will be doing IO, so you have to spread the IOs in some way, and "have the primary handle it" is one of those. 
2024-09-13T13:10:40.430Z
<IcePic> Matthew Jose: You can tell ceph to have primary want to be on certain OSDs more than others if there is a speed difference or so, but I don't think many use that
2024-09-13T13:18:05.392Z
<Matthews Jose> 1. **If I understood correctly, the primary OSD is involved in handling IO to help distribute the workload across multiple OSDs as the cluster grows. Additionally, while it's possible to configure certain OSDs to act as primary based on their speed or performance, this isn't a common practice. Is this correct, and are there specific situations where configuring certain OSDs as primary is particularly beneficial? When you mention that the primary OSD handles IO, does that mean the client always contacts the primary OSD first, and then the primary redistributes the traffic to the replica OSDs? How exactly does the primary OSD manage this traffic, especially for read operations?**
2024-09-13T13:18:43.579Z
<Matthews Jose> 1. **If I understood correctly, the primary OSD is involved in handling IO to help distribute the workload across multiple OSDs as the cluster grows. Additionally, while it's possible to configure certain OSDs to act as primary based on their speed or performance, this isn't a common practice. Is this correct, and are there specific situations where configuring certain OSDs as primary is particularly beneficial?**
2. **When you mention that the primary OSD handles IO, does that mean the client always contacts the primary OSD first, and then the primary redistributes the traffic to the replica OSDs? How exactly does the primary OSD manage this traffic, especially for read operations?**
2024-09-13T13:32:19.441Z
<IcePic> The primary is only distributing to others for EC pools, for replicating ones it just answers the IO itself
2024-09-13T13:35:24.847Z
<IcePic> and yes, the clients will contact the primary only, and in case it is down or out, the osdmap gets updated and the client knows it needs to talk to the second, third and so on
2024-09-13T13:36:44.661Z
<Matthews Jose> I understand so basically balanced reads then apply to only EC pools in all other cases reads are handled by the primary
2024-09-13T13:37:29.697Z
<IcePic> "balanced" seem to put some value into the sentence that I'm not sure I follow
2024-09-13T13:40:47.430Z
<IcePic> for instance, if I create and use a 10G RBD image, it will be split into 2 or 4M pieces, each ending up on a different PG according to the pseduorandom placement rules. If I then read this 10G image from start to end, it is going to involve 10G/2M reads from "random" PGs, spread over all involved PGs, so some 5000 operations in all. If you have 500 OSDs, on average each OSD will handle 10 2-megabyte 
2024-09-13T13:40:53.557Z
<IcePic> reads, at various times depending on how the random placement laid the data out. 
2024-09-13T13:41:15.179Z
<IcePic> if that isn't "balanced", as far as cluster IO goes, then I don't really know what is
2024-09-13T13:44:29.188Z
<Matthews Jose> I understood but this is the default CEPH behavior by going through the code in /include/rados.h I found the ceph OSD op codes specifically : balance-reads, hex:0x4003, decimal:16387 also unbalance-reads, hex: 0x4004, decimal: 16388. I was wondering what this balance-read op code was used for.
2024-09-13T13:46:11.897Z
<IcePic> perhaps related https://docs.ceph.com/en/reef/rados/operations/read-balancer/
2024-09-13T13:47:22.527Z
<Matthews Jose> thank you I will have a look 🙂
2024-09-13T13:47:43.643Z
<IcePic> what I mentioned on deliberately setting OSDs up to hold more (or less) primary PGs is explained here also https://docs.redhat.com/en/documentation/red_hat_ceph_storage/1.2.3/html/storage_strategies/primary-affinity#primary-affinity
2024-09-13T13:48:10.812Z
<Matthews Jose> you some code in the [PrimayLOGpg.cc](http://PrimayLOGpg.cc) in the do_op method :if ((m->get_flags() & (CEPH_OSD_FLAG_BALANCE_READS |
       CEPH_OSD_FLAG_LOCALIZE_READS)) &&
      op->may_read() &&
      !(op->may_write() || op->may_cache())) {
    // balanced reads; any replica will do
    if (!(is_primary() || is_nonprimary())) {
      osd->handle_misdirected_op(this, op);
      return;
    }
2024-09-13T13:48:30.446Z
<Matthews Jose> They have CEPH_OSD_FLAG_BALANCE_READS I was wondering.
2024-09-13T19:19:28.280Z
<Ilya Dryomov> The read balancer (<https://docs.ceph.com/en/reef/rados/operations/read-balancer/>) is not related to `CEPH_OSD_FLAG_BALANCE_READS` or `CEPH_OSD_FLAG_LOCALIZE_READS` flags
2024-09-13T19:26:43.760Z
<Ilya Dryomov> > Why is the primary OSD always involved in read operations in Ceph, even when replica OSDs are available?
This is the default behavior, but with `CEPH_OSD_FLAG_BALANCE_READS` or `CEPH_OSD_FLAG_LOCALIZE_READS` flags it can be changed on a per-op basis
For the former, the read would be sent to a random OSD in the PG (replica set), not necessarily the primary OSD
For the latter, the read would be sent an OSD in the PG (replica set) that is closest to the client, with closeness/proximity defined in terms of CRUSH hierarchy
2024-09-13T19:28:01.770Z
<Ilya Dryomov> This only applies to replicated pools, for EC pools all ops have to go through the primary OSD (at least as of today -- that might be changing in the near future)

Any issue? please create an issue here and use the infra label.