2024-06-24T14:17:57.168Z | <Raghu> Hello
Good morning, we deployed a multisite cluster using cephadm, The cluster was up and running. We made a small change to unit.run file to remove cgroup=split . Then we ran podman stop to stop the container and then used systemctl to restart the rgw service.
We saw that all the cephadm service files are gone from the secondary cluster. All the RGW instances are dead on the secondary cluster.
seems like cephadm is running rm-daemon to do the cleanup.
```2024-06-24 14:10:33,294 7f67106bfb80 DEBUG --------------------------------------------------------------------------------
cephadm ['--image', '[cephadmurl.bloomberg.com/sds/jjm-test/ceph-os-18.0.0-305.bb.rgw_account@sha256:3d2487b9be883474d3ebc1d24fb26644c0d4968527683316c998824f81691c43](http://cephadmurl.bloomberg.com/sds/jjm-test/ceph-os-18.0.0-305.bb.rgw_account@sha256:3d2487b9be883474d3ebc1d24fb26644c0d4968527683316c998824f81691c43)', '--timeout', '895', 'rm-daemon', '--fsid', '9d1e5b54-2a79-11ef-9ed5-000f53996140', '--name', 'rgw.realmcephadm.secondaryzone.host1.aixhjo', '--force', '--tcp-ports', '8000']
2024-06-24 14:10:33,350 7f67106bfb80 INFO Non-zero exit code 3 from systemctl status ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service - Ceph rgw.realmcephadm.secondaryzone.host1.aixhjo for 9d1e5b54-2a79-11ef-9ed5-000f53996140
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Loaded: loaded (/etc/systemd/system/ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@.service; disabled; vendor preset: disabled)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Active: inactive (dead)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:17 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:17.939+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:22 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:22.391+0000 7f059663f700 -1 --2- 10.218.64.31:0/3238125650 >> [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] conn(0x55e1bac12400 0x55e1b6753b80 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner peer [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] is using msgr V1 protocol
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f0593e3a700 -1 received signal: Terminated from /run/podman-init -- /usr/bin/radosgw -n client.rgw.realmcephadm.secondaryzone.host1.aixhjo -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false (PID: 1) UID: 0
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f05980b1780 -1 shutting down
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG Non-zero exit code 5 from systemctl stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG systemctl: stderr Failed to stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl disable ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG systemctl: stderr Failed to disable unit: Unit file ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service does not exist.```
has any one seen anything similar to this in the past ? any suggestions what can be done to troubleshoot this issue ? |
2024-06-24T14:19:04.142Z | <Raghu> Hello
Good morning, we deployed a multisite cluster using cephadm, The cluster was up and running. We made a small change to unit.run file to remove cgroup=split (sed -i 's/--cgroups=split//g' /var/lib/ceph/9d1e5b54-2a79-11ef-9ed5-000f53996140/*/unit.run).
Then we ran podman stop to stop the container and then used systemctl to restart the service (systemctl start ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140.target).
We saw that all the cephadm service files are gone from the secondary cluster. All the RGW instances are dead on the secondary cluster.
seems like cephadm is running rm-daemon to do the cleanup.
```2024-06-24 14:10:33,294 7f67106bfb80 DEBUG --------------------------------------------------------------------------------
cephadm ['--image', 'cephadmurl.bloomberg.com/sds/jjm-test/ceph-os-18.0.0-305.bb.rgw_account@sha256:3d2487b9be883474d3ebc1d24fb26644c0d4968527683316c998824f81691c43', '--timeout', '895', 'rm-daemon', '--fsid', '9d1e5b54-2a79-11ef-9ed5-000f53996140', '--name', 'rgw.realmcephadm.secondaryzone.host1.aixhjo', '--force', '--tcp-ports', '8000']
2024-06-24 14:10:33,350 7f67106bfb80 INFO Non-zero exit code 3 from systemctl status ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service - Ceph rgw.realmcephadm.secondaryzone.host1.aixhjo for 9d1e5b54-2a79-11ef-9ed5-000f53996140
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Loaded: loaded (/etc/systemd/system/ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@.service; disabled; vendor preset: disabled)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Active: inactive (dead)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:17 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:17.939+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:22 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:22.391+0000 7f059663f700 -1 --2- 10.218.64.31:0/3238125650 >> [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] conn(0x55e1bac12400 0x55e1b6753b80 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner peer [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] is using msgr V1 protocol
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f0593e3a700 -1 received signal: Terminated from /run/podman-init -- /usr/bin/radosgw -n client.rgw.realmcephadm.secondaryzone.host1.aixhjo -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false (PID: 1) UID: 0
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f05980b1780 -1 shutting down
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG Non-zero exit code 5 from systemctl stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG systemctl: stderr Failed to stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl disable ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG systemctl: stderr Failed to disable unit: Unit file ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service does not exist.```
has any one seen anything similar to this in the past ? any suggestions what can be done to troubleshoot this issue ? |
2024-06-24T14:19:44.762Z | <Raghu> Hello
Good morning, we deployed a multisite cluster using cephadm, The cluster was up and running. We made a small change to unit.run file to remove cgroup=split (sed -i 's/--cgroups=split//g' /var/lib/ceph/9d1e5b54-2a79-11ef-9ed5-000f53996140/*/unit.run).
Then we ran podman stop to stop the container and then used systemctl to restart the service (systemctl start ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140.target).
We saw that all the RGW service files are gone from the secondary cluster. All the RGW instances are dead on the secondary cluster.
seems like cephadm is running rm-daemon to do the cleanup.
```2024-06-24 14:10:33,294 7f67106bfb80 DEBUG --------------------------------------------------------------------------------
cephadm ['--image', 'cephadmurl.bloomberg.com/sds/jjm-test/ceph-os-18.0.0-305.bb.rgw_account@sha256:3d2487b9be883474d3ebc1d24fb26644c0d4968527683316c998824f81691c43', '--timeout', '895', 'rm-daemon', '--fsid', '9d1e5b54-2a79-11ef-9ed5-000f53996140', '--name', 'rgw.realmcephadm.secondaryzone.host1.aixhjo', '--force', '--tcp-ports', '8000']
2024-06-24 14:10:33,350 7f67106bfb80 INFO Non-zero exit code 3 from systemctl status ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service - Ceph rgw.realmcephadm.secondaryzone.host1.aixhjo for 9d1e5b54-2a79-11ef-9ed5-000f53996140
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Loaded: loaded (/etc/systemd/system/ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@.service; disabled; vendor preset: disabled)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Active: inactive (dead)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:17 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:17.939+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:22 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:22.391+0000 7f059663f700 -1 --2- 10.218.64.31:0/3238125650 >> [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] conn(0x55e1bac12400 0x55e1b6753b80 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner peer [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] is using msgr V1 protocol
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f0593e3a700 -1 received signal: Terminated from /run/podman-init -- /usr/bin/radosgw -n client.rgw.realmcephadm.secondaryzone.host1.aixhjo -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false (PID: 1) UID: 0
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f05980b1780 -1 shutting down
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG Non-zero exit code 5 from systemctl stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG systemctl: stderr Failed to stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl disable ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG systemctl: stderr Failed to disable unit: Unit file ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service does not exist.```
has any one seen anything similar to this in the past ? any suggestions what can be done to troubleshoot this issue ? |
2024-06-24T14:22:34.028Z | <Raghu> Hello
Good morning, we deployed a multisite cluster using cephadm, The cluster was up and running. We made a small change to unit.run file to remove cgroup=split (sed -i 's/--cgroups=split//g' /var/lib/ceph/9d1e5b54-2a79-11ef-9ed5-000f53996140/*/unit.run).
Then we ran podman stop to stop the container and then used systemctl to restart the service (systemctl start ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140.target).
We saw that all the RGW service files are gone from the secondary cluster. All the RGW instances are dead on the secondary cluster.
seems like cephadm is running rm-daemon to do the cleanup.
I tried to re-create all the service files in to /var/lib/ceph/9d1e5b54-2a79-11ef-9ed5-000f53996140/rgw.realmcephadm.secondaryzone.host1.aixhjo
with in a couple of mins of me creating this directory along with systemctl service files, all these get automatically deleted by a cephadm binary.
This is so puzzling on why cephadm is deleting the service files automatically.
Following are all the logs from cephadm.log when these files are being deleted automatically.
```2024-06-24 14:10:33,294 7f67106bfb80 DEBUG --------------------------------------------------------------------------------
cephadm ['--image', 'cephadmurl.bloomberg.com/sds/jjm-test/ceph-os-18.0.0-305.bb.rgw_account@sha256:3d2487b9be883474d3ebc1d24fb26644c0d4968527683316c998824f81691c43', '--timeout', '895', 'rm-daemon', '--fsid', '9d1e5b54-2a79-11ef-9ed5-000f53996140', '--name', 'rgw.realmcephadm.secondaryzone.host1.aixhjo', '--force', '--tcp-ports', '8000']
2024-06-24 14:10:33,350 7f67106bfb80 INFO Non-zero exit code 3 from systemctl status ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service - Ceph rgw.realmcephadm.secondaryzone.host1.aixhjo for 9d1e5b54-2a79-11ef-9ed5-000f53996140
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Loaded: loaded (/etc/systemd/system/ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@.service; disabled; vendor preset: disabled)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Active: inactive (dead)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:17 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:17.939+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:22 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:22.391+0000 7f059663f700 -1 --2- 10.218.64.31:0/3238125650 >> [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] conn(0x55e1bac12400 0x55e1b6753b80 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner peer [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] is using msgr V1 protocol
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f0593e3a700 -1 received signal: Terminated from /run/podman-init -- /usr/bin/radosgw -n client.rgw.realmcephadm.secondaryzone.host1.aixhjo -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false (PID: 1) UID: 0
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f05980b1780 -1 shutting down
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG Non-zero exit code 5 from systemctl stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG systemctl: stderr Failed to stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl disable ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG systemctl: stderr Failed to disable unit: Unit file ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service does not exist.```
has any one seen anything similar to this in the past ? any suggestions what can be done to troubleshoot this issue ? |
2024-06-24T14:29:42.170Z | <Raghu> Hello
Good morning, we deployed a multisite cluster using cephadm, The cluster was up and running. We made a small change to unit.run file to remove cgroup=split (sed -i 's/--cgroups=split//g' /var/lib/ceph/9d1e5b54-2a79-11ef-9ed5-000f53996140/*/unit.run).
Then we ran podman stop to stop the container and then used systemctl to restart the service (systemctl start ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140.target).
We saw that all the RGW service files are gone from the secondary cluster. All the RGW instances are dead on the secondary cluster.
seems like cephadm is running rm-daemon to do the cleanup.
I tried to re-create all the service files in to /var/lib/ceph/9d1e5b54-2a79-11ef-9ed5-000f53996140/rgw.realmcephadm.secondaryzone.host1.aixhjo
with in a couple of mins of me creating this directory along with systemctl service files, all these get automatically deleted by a cephadm binary.
This is so puzzling on why cephadm is deleting the service files automatically.
Following are all the logs from cephadm.log when these files are being deleted automatically.
```2024-06-24 14:10:33,294 7f67106bfb80 DEBUG --------------------------------------------------------------------------------
cephadm ['--image', 'cephadmurl.bloomberg.com/sds/jjm-test/ceph-os-18.0.0-305.bb.rgw_account@sha256:3d2487b9be883474d3ebc1d24fb26644c0d4968527683316c998824f81691c43', '--timeout', '895', 'rm-daemon', '--fsid', '9d1e5b54-2a79-11ef-9ed5-000f53996140', '--name', 'rgw.realmcephadm.secondaryzone.host1.aixhjo', '--force', '--tcp-ports', '8000']
2024-06-24 14:10:33,350 7f67106bfb80 INFO Non-zero exit code 3 from systemctl status ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service - Ceph rgw.realmcephadm.secondaryzone.host1.aixhjo for 9d1e5b54-2a79-11ef-9ed5-000f53996140
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Loaded: loaded (/etc/systemd/system/ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@.service; disabled; vendor preset: disabled)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Active: inactive (dead)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:17 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:17.939+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:22 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:22.391+0000 7f059663f700 -1 --2- 10.218.64.31:0/3238125650 >> [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] conn(0x55e1bac12400 0x55e1b6753b80 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner peer [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] is using msgr V1 protocol
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f0593e3a700 -1 received signal: Terminated from /run/podman-init -- /usr/bin/radosgw -n client.rgw.realmcephadm.secondaryzone.host1.aixhjo -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false (PID: 1) UID: 0
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f05980b1780 -1 shutting down
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG Non-zero exit code 5 from systemctl stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG systemctl: stderr Failed to stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl disable ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG systemctl: stderr Failed to disable unit: Unit file ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service does not exist.```
This is what i see on the mon logs , which matches with above
```2024-06-24T14:10:33.702+0000 7ffa284d9700 ({"prefix": "auth rm", "entity": "client.rgw.realmcephadm.secondaryzone.host1.aixhjo"} v 0) v1
2024-06-24T14:10:33.702+0000 7ffa284d9700 cmd=[{"prefix": "auth rm", "entity": "client.rgw.realmcephadm.secondaryzone.host1.aixhjo"}]: dispatch
2024-06-24T14:10:33.702+0000 7ffa284d9700 ({"prefix": "config rm", "who": "client.rgw.realmcephadm.secondaryzone.host1.aixhjo", "name": "rgw_frontends"} v 0) v1
2024-06-24T14:10:33.703+0000 7ffa284d9700 cmd=[{"prefix": "config rm", "who": "client.rgw.realmcephadm.secondaryzone.host1.aixhjo", "name": "rgw_frontends"}]: dispatch```
has any one seen anything similar to this in the past ? any suggestions what can be done to troubleshoot this issue ? |
2024-06-24T14:33:04.948Z | <Raghu> Hello
Good morning, we deployed a multisite cluster using cephadm, The cluster was up and running. We made a small change to unit.run file to remove cgroup=split (sed -i 's/--cgroups=split//g' /var/lib/ceph/9d1e5b54-2a79-11ef-9ed5-000f53996140/*/unit.run).
Then we ran podman stop to stop the container and then used systemctl to restart the service (systemctl start ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140.target).
We saw that all the RGW service files are gone from the secondary cluster. All the RGW instances are dead on the secondary cluster.
seems like cephadm is running rm-daemon to do the cleanup.
I tried to re-create all the service files in to /var/lib/ceph/9d1e5b54-2a79-11ef-9ed5-000f53996140/rgw.realmcephadm.secondaryzone.host1.aixhjo
with in a couple of mins of me creating this directory along with systemctl service files, all these get automatically deleted by a cephadm binary.
This is so puzzling on why cephadm is deleting the service files automatically.
Following are all the logs from cephadm.log when these files are being deleted automatically.
```2024-06-24 14:10:33,294 7f67106bfb80 DEBUG --------------------------------------------------------------------------------
cephadm ['--image', 'cephadmurl.bloomberg.com/sds/jjm-test/ceph-os-18.0.0-305.bb.rgw_account@sha256:3d2487b9be883474d3ebc1d24fb26644c0d4968527683316c998824f81691c43', '--timeout', '895', 'rm-daemon', '--fsid', '9d1e5b54-2a79-11ef-9ed5-000f53996140', '--name', 'rgw.realmcephadm.secondaryzone.host1.aixhjo', '--force', '--tcp-ports', '8000']
2024-06-24 14:10:33,350 7f67106bfb80 INFO Non-zero exit code 3 from systemctl status ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service - Ceph rgw.realmcephadm.secondaryzone.host1.aixhjo for 9d1e5b54-2a79-11ef-9ed5-000f53996140
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Loaded: loaded (/etc/systemd/system/ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@.service; disabled; vendor preset: disabled)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Active: inactive (dead)
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:17 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:17.939+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:19 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:19.358+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,350 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059563d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f0595e3e700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:20 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:20.973+0000 7f059663f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:22 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:22.391+0000 7f059663f700 -1 --2- 10.218.64.31:0/3238125650 >> [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] conn(0x55e1bac12400 0x55e1b6753b80 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner peer [v2:10.218.65.62:6864/3637088951,v1:10.218.65.62:6865/3637088951] is using msgr V1 protocol
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f0593e3a700 -1 received signal: Terminated from /run/podman-init -- /usr/bin/radosgw -n client.rgw.realmcephadm.secondaryzone.host1.aixhjo -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-journald=true --default-log-to-stderr=false (PID: 1) UID: 0
2024-06-24 14:10:33,351 7f67106bfb80 INFO systemctl: stdout Jun 18 13:54:23 host1 ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-rgw-realmcephadm-secondaryzone-host1-aixhjo[105150]: 2024-06-18T13:54:23.320+0000 7f05980b1780 -1 shutting down
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo
2024-06-24 14:10:33,475 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG Non-zero exit code 5 from systemctl stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,650 7f67106bfb80 DEBUG systemctl: stderr Failed to stop ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl reset-failed ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,656 7f67106bfb80 DEBUG systemctl: stderr Failed to reset failed state of unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service: Unit ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service not loaded.
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG Non-zero exit code 1 from systemctl disable ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service
2024-06-24 14:10:33,662 7f67106bfb80 DEBUG systemctl: stderr Failed to disable unit: Unit file ceph-9d1e5b54-2a79-11ef-9ed5-000f53996140-init@rgw.realmcephadm.secondaryzone.host1.aixhjo.service does not exist.```
This is what i see on the mon logs , which matches with above
```2024-06-24T14:10:33.702+0000 7ffa284d9700 ({"prefix": "auth rm", "entity": "client.rgw.realmcephadm.secondaryzone.host1.aixhjo"} v 0) v1
2024-06-24T14:10:33.702+0000 7ffa284d9700 cmd=[{"prefix": "auth rm", "entity": "client.rgw.realmcephadm.secondaryzone.host1.aixhjo"}]: dispatch
2024-06-24T14:10:33.702+0000 7ffa284d9700 ({"prefix": "config rm", "who": "client.rgw.realmcephadm.secondaryzone.host1.aixhjo", "name": "rgw_frontends"} v 0) v1
2024-06-24T14:10:33.703+0000 7ffa284d9700 cmd=[{"prefix": "config rm", "who": "client.rgw.realmcephadm.secondaryzone.host1.aixhjo", "name": "rgw_frontends"}]: dispatch```
This is what we see in the mgr logs
```2024-06-24T14:10:32.937+0000 7f40f03b2700 0 [cephadm INFO cephadm.serve] Removing orphan daemon rgw.realmcephadm.secondaryzone.host1.aixhjo...
2024-06-24T14:10:32.937+0000 7f40f03b2700 0 log_channel(cephadm) log [INF] : Removing orphan daemon rgw.realmcephadm.secondaryzone.host1.aixhjo...
2024-06-24T14:10:32.937+0000 7f40f03b2700 0 [cephadm INFO cephadm.serve] Removing daemon rgw.realmcephadm.secondaryzone.host1.aixhjo from host1 -- ports [8000]
2024-06-24T14:10:32.937+0000 7f40f03b2700 0 log_channel(cephadm) log [INF] : Removing daemon rgw.realmcephadm.secondaryzone.host1.aixhjo from host1 -- ports [8000]
2024-06-24T14:10:33.702+0000 7f40f03b2700 0 [cephadm INFO cephadm.services.cephadmservice] Removing key for client.rgw.realmcephadm.secondaryzone.host1.aixhjo
2024-06-24T14:10:33.702+0000 7f40f03b2700 0 log_channel(cephadm) log [INF] : Removing key for client.rgw.realmcephadm.secondaryzone.host1.aixhjo```
has any one seen anything similar to this in the past ? any suggestions what can be done to troubleshoot this issue ? |
2024-06-24T15:32:53.654Z | <Adam King> `Removing orphan daemon` would imply it doesn't think there is any service spec that matches up with that daemon. Did you remove any rgw services or modify the placement of one of them? |
2024-06-24T15:36:13.451Z | <Raghu> The spec file that i used is
```placement:
label: rgwsync
count_per_host: 2
rgw_zone: secondaryzone
rgw_realm_token: xxxxxxxx
spec:
rgw_frontend_port: 8000```
Nothing has been changed here |
2024-06-24T15:37:08.689Z | <Raghu> The spec file that i used is
```placement:
label: rgwsync
count_per_host: 2
rgw_zone: secondaryzone
rgw_realm_token: xxxxxxxx
spec:
rgw_frontend_port: 8000```
Nothing has been changed here.
rgwsync label was added to couple of machines and its been the same way from the beginning, nothing has been changed there |
2024-06-24T15:55:35.128Z | <Raghu> The spec file that i used is
```placement:
label: rgwsync
count_per_host: 2
rgw_zone: secondaryzone
rgw_realm_token: xxxxxxxx
spec:
rgw_frontend_port: 8000```
Nothing has been changed here.
rgwsync label was added to couple of machines and its been the same way from the beginning, nothing has been changed there.
As the realm/ zonegroup and zone are already created, i cannot use the command to create a new instance of RGW.
```
ceph rgw zone create -i /tmp/rgw2.spec --start-radosgw``` |
2024-06-24T15:55:41.189Z | <Raghu> The spec file that i used is
```placement:
label: rgwsync
count_per_host: 2
rgw_zone: secondaryzone
rgw_realm_token: xxxxxxxx
spec:
rgw_frontend_port: 8000```
Nothing has been changed here.
rgwsync label was added to couple of machines and its been the same way from the beginning, nothing has been changed there.
As the realm/ zonegroup and zone are already created, i cannot use the command to create a new instance of RGW.
```ceph rgw zone create -i /tmp/rgw2.spec --start-radosgw``` |
2024-06-24T17:02:26.863Z | <Adam King> what does the actual service name show up as in `ceph orch ls`? The log message seems to indicate there was no `rgw.realmcephadm.secondaryzone` service which is what it was looking for for the rgw daemon it removed. |
2024-06-24T17:14:41.752Z | <Raghu> i dont even see that service at all in the config
```ceph orch ls
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
crash 12/13 9m ago 9d *
mgr 3/2 9m ago 9d <unmanaged>
mon 3/5 9m ago 9d <unmanaged>
osd 154 9m ago - <unmanaged>``` |
2024-06-24T17:15:45.078Z | <Raghu> i dont even see that service at all in the config
```ceph orch ls
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
crash 12/13 9m ago 9d *
mgr 3/2 9m ago 9d <unmanaged>
mon 3/5 9m ago 9d <unmanaged>
osd 154 9m ago - <unmanaged>```
I am not sure why the service does not even show up. |
2024-06-24T17:17:42.682Z | <Raghu> i dont even see that service at all in the config
```ceph orch ls
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
crash 12/13 9m ago 9d *
mgr 3/2 9m ago 9d <unmanaged>
mon 3/5 9m ago 9d <unmanaged>
osd 154 9m ago - <unmanaged>```
I am not sure why the service does not even show up. most likely cephadm might have deleted the service as well. |
2024-06-24T22:10:39.416Z | <Adam King> Cephadm doesn't do any automatic service deletion afaik. Either way, the way to get the daemons back would be to re-apply the spec with `ceph orch apply -i <spec-filepath>` |