ceph - cephadm - 2024-06-27

Timestamp (UTC)Message
2024-06-27T18:40:32.871Z
<Raghu> i am able to re-produce the issue on a totally different cluster. With multisite up and running for a day or so, then on the next day we start to see the message in the logs
```2024-06-27T13:03:46.577677+0000 mgr.host1.xrdvsx (mgr.193905) 1 : cephadm [WRN] unable to load spec for rgw.realmcephadm.secondaryzone: ServiceSpec: __init__() got an unexpected keyword argument 'rgw_token'```
next i see all the following messages
```2024-06-27T13:03:52.423123+0000 mgr.host1.xrdvsx (mgr.193905) 13 : cephadm [INF] Removing orphan daemon rgw.realmcephadm.secondaryzone.host2.mdezmy...
2024-06-27T13:03:52.423261+0000 mgr.host1.xrdvsx (mgr.193905) 14 : cephadm [INF] Removing daemon rgw.realmcephadm.secondaryzone.host2.mdezmy from host2 -- ports [8000]```
Then all the RGW instances get shutdown automatically.
2024-06-27T18:43:11.054Z
<Raghu> i am able to re-produce the issue on a totally different cluster. With multisite up and running for a day or so, then on the next day we start to see the message in the logs
```2024-06-27T13:03:46.577677+0000 mgr.host1.xrdvsx (mgr.193905) 1 : cephadm [WRN] unable to load spec for rgw.realmcephadm.secondaryzone: ServiceSpec: __init__() got an unexpected keyword argument 'rgw_token'```
next i see all the following messages
```2024-06-27T13:03:52.423123+0000 mgr.host1.xrdvsx (mgr.193905) 13 : cephadm [INF] Removing orphan daemon rgw.realmcephadm.secondaryzone.host2.mdezmy...
2024-06-27T13:03:52.423261+0000 mgr.host1.xrdvsx (mgr.193905) 14 : cephadm [INF] Removing daemon rgw.realmcephadm.secondaryzone.host2.mdezmy from host2 -- ports [8000]```
Then all the RGW instances get shutdown automatically..


I made sure that the the same token is present on both primary and secondary site
```ceph rgw realm tokens
[
    {
        "realm": "realmcephadm",
        "token": "ewogICAgInJlYWxtX25hbWUiOiAicmVhbG1jZXBoYWRtIiwKICAgICJyZWFsbV9pZCI6ICI4MDExMTA2NS03NjU0LTRkY2EtOTQ4NC0wOWFlMDY0MmVlMTMiLAogICAgImVuZHBvaW50IjogImh0dHA6Ly9kZXYtMTAtNzQtMC0xMTg6ODAwMCIsCiAgICAiYWNjZXNzX2tleSI6ICJIR09CSzc2VFBaMFVFVURUNVhLWSIsCiAgICAic2VjcmV0IjogIm9JSEJHem0ydDA3MUc5d25DQTF6T3gybFFxaDRMRmY2akozVmRvV3QiCn0="
    }
]```
2024-06-27T18:54:54.384Z
<Raghu> i am able to re-produce the issue on a totally different cluster. With multisite up and running for a day or so, then on the next day we start to see the message in the logs
```2024-06-27T13:03:46.577677+0000 mgr.host1.xrdvsx (mgr.193905) 1 : cephadm [WRN] unable to load spec for rgw.realmcephadm.secondaryzone: ServiceSpec: __init__() got an unexpected keyword argument 'rgw_token'```
next i see all the following messages
```2024-06-27T13:03:52.423123+0000 mgr.host1.xrdvsx (mgr.193905) 13 : cephadm [INF] Removing orphan daemon rgw.realmcephadm.secondaryzone.host2.mdezmy...
2024-06-27T13:03:52.423261+0000 mgr.host1.xrdvsx (mgr.193905) 14 : cephadm [INF] Removing daemon rgw.realmcephadm.secondaryzone.host2.mdezmy from host2 -- ports [8000]```
Then all the RGW instances get shutdown automatically..


I made sure that the the same token is present on both primary and secondary site
```ceph rgw realm tokens
[
    {
        "realm": "realmcephadm",
        "token": "xxxxxxxx="
    }
]```
2024-06-27T21:06:09.879Z
<Adam King> This looks like a bug with the rgw module. `rgw_token` isn't a real attribute of the rgw spec. So whenever it tries to convert that spec file from a json string into an actual python object it fails as reported in the first log statement you posted and then, since the spec ends up disappearing since it can't be loaded in then the daemons get removed since there's no service spec for those rgw daemons. It initially works because the rgw module is transferring the spec to the orchestrator as a python object. It's only when the module is restarted, the active mgr changes, etc. that it has to take the stored json string and convert it back to a python object that things fall apart. I'm curious if you could get `ceph orch ls rgw --export` right after the zone create is run on the secondary cluster if it will have a `rgw_token` field. Also, assuming it does, if you took the output of that command, changed `rgw_token` to `rgw_realm_token`, re-applied the spec using `ceph orch apply -i...` whether you'd see the same behavior. I think that would stop the service deletion and if it works it should be straightforward to fix the bug in the rgw module.

Any issue? please create an issue here and use the infra label.