ceph - sepia - 2024-08-23

Timestamp (UTC)	Message
2024-08-23T03:53:17.521Z	<Vallari Agrawal> It could be a DNS issue? On smithi049, it gives me 502 Bad Gateway: ```[vallariag@smithi049 ~]$ podman pull [quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49](http://quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49) Trying to pull [quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49](http://quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49)... Error: parsing image configuration: fetching blob: received unexpected HTTP status: 502 Bad Gateway``` But it works okay on my local machine with sepia VPN connected: ```✗ docker pull [quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49](http://quay.ceph.io/ceph-ci/ceph:5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49) 5eb5dc7942d1c14e2e0a8a24e734a7d4c385aa49: Pulling from ceph-ci/ceph e8b54c863393: Pulling fs layer db43c217a2c8: Pulling fs layer ```
2024-08-23T06:38:15.083Z	<Sunil Angadi> Hi team, in PR build getting failed for `centos` distros <https://shaman.ceph.com/builds/ceph/wip-sangadi1-testing-2024-08-22-1458-quincy/> can somebody please help me to root cause the issue [https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAIL[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/82350//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=arm64,AVAILABLE_ARCH=arm64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/82350//consoleFull) what action makes me to get build success?
2024-08-23T07:19:11.472Z	<Sunil Angadi> @Sayalee Raut this might be the reason
2024-08-23T07:25:27.863Z	<Sayalee Raut> Hey @Sunil Angadi Yes, my shaman build failed yesterday late evening, but now since quay is back up, I will try rebuilding it.Thanks!
2024-08-23T07:26:56.497Z	<Dan Mick> ah. there is a problem with [quay.ceph.io](http://quay.ceph.io)'s backing store. The web UI was probably working but actual bulk file operations probably were not. Let me work on it.
2024-08-23T08:22:29.528Z	<Sayalee Raut> Okay, thanks @Dan Mick
2024-08-23T12:48:43.023Z	<Yaarit> @Dan Mick thanks, I see it is working
2024-08-23T14:41:57.101Z	<yuriw> I see several c9 failed for quincy as well, and @Sayalee Raut build @Dan Mick @Laura Flores ^
2024-08-23T14:42:19.944Z	<yuriw> I see several c9 failed for quincy as well, and @Sayalee Raut build @Dan Mick @Laura Flores ^
2024-08-23T14:42:48.180Z	<yuriw> [https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVA[…]entos9,DIST=centos9,MACHINE_SIZE=gigantic/82347//consoleFull](https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos9,DIST=centos9,MACHINE_SIZE=gigantic/82347//consoleFull) I see: curl: (22) The requested URL returned error: 404 error: skipping <https://3.chacra.ceph.com/r/ceph/wip-saraut02-testing-2024-08-22-1312/cd4c799c9219dc7debfb16701b85f090fcc7074d/centos/9/flavors/default//noarch/ceph-release-1-0.el9.noarch.rpm> - transfer failed
2024-08-23T14:51:19.791Z	<Casey Bodley> can we consider removing these nodes from pr checks in the meantime?
2024-08-23T16:52:50.480Z	<Laura Flores> So, for the centos distros, there is an extra step in the job after building packages where a container image is created and pushed to [quay.ceph.io](http://quay.ceph.io). Dan Mick mentioned that [quay.ceph.io](http://quay.ceph.io) is somewhat broken after upgrading the log running cluster (a cluster we have which hosts many of our services including [quay.ceph.io](http://quay.ceph.io)) two days ago, so builds will break until that is fixed.
2024-08-23T16:53:11.713Z	<Laura Flores> So @Sunil Angadi you aren't doing anything wrong with how you built the branch; it's an external issue.
2024-08-23T16:57:10.701Z	<Dan Mick> OK, so, rgw has issues with the rgw.quay service
2024-08-23T16:58:10.343Z	<Dan Mick> They seem very basic but I need help with where to start. @Casey Bodley can you spare some time this morning
2024-08-23T16:58:30.583Z	<Casey Bodley> i'm all yours
2024-08-23T16:58:52.074Z	<Dan Mick> so the service was set up with cephadm AFAIK:
2024-08-23T16:59:07.123Z	<Dan Mick> # ceph orch ls --service-name rgw.quay --export service_type: rgw service_id: quay service_name: rgw.quay placement: count: 2 spec: rgw_frontend_type: beast
2024-08-23T17:00:21.156Z	<Dan Mick> the radosgw's got restarted, and fail immediately. q1) how is one supposed to find the logs for such a failing service? How does one even know which hosts were chosen for it to run last, canonically?
2024-08-23T17:00:43.253Z	<Dan Mick> cephadm health detail seems to help, showing reesi002 and reesi006
2024-08-23T17:01:28.413Z	<Dan Mick> on reesi002, the service status shows
2024-08-23T17:02:22.945Z	<Dan Mick> # systemctl -l --no-pager status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service × ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service - Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a Loaded: loaded (/etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service; enabled; vendor preset: enabled) Active: failed (Result: protocol) since Fri 2024-08-23 08:06:23 UTC; 8h ago CPU: 1.483s Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Scheduled restart job, restart counter is at 5. Aug 23 08:06:23 reesi002 systemd[1]: Stopped Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a. Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Consumed 1.483s CPU time. Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Start request repeated too quickly. Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Failed with result 'protocol'. Aug 23 08:06:23 reesi002 systemd[1]: Failed to start Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a. Aug 23 14:52:52 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed. Aug 23 14:52:54 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed. Aug 23 14:52:55 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
2024-08-23T17:02:43.376Z	<Dan Mick> (nothing useful, IOW)
2024-08-23T17:03:04.229Z	<Dan Mick> # systemctl -l --no-pager status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service × ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service - Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a Loaded: loaded (/etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service; enabled; vendor preset: enabled) Active: failed (Result: protocol) since Fri 2024-08-23 08:06:23 UTC; 8h ago CPU: 1.483s Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Scheduled restart job, restart counter is at 5. Aug 23 08:06:23 reesi002 systemd[1]: Stopped Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a. Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Consumed 1.483s CPU time. Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Start request repeated too quickly. Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Failed with result 'protocol'. Aug 23 08:06:23 reesi002 systemd[1]: Failed to start Ceph `rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a.` `Aug 23 14:52:52 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.` `Aug 23 14:52:54 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.` `Aug 23 14:52:55 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.`
2024-08-23T17:03:46.820Z	<Dan Mick> `# systemctl -l --no-pager status ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service` `× ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service - Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a` `Loaded: loaded (/etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service; enabled; vendor preset: enabled)` `Active: failed (Result: protocol) since Fri 2024-08-23 08:06:23 UTC; 8h ago` `CPU: 1.483s` `Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Scheduled restart job, restart counter is at 5.` `Aug 23 08:06:23 reesi002 systemd[1]: Stopped Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a.` `Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Consumed 1.483s CPU time.` `Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Start request repeated too quickly.` `Aug 23 08:06:23 reesi002 systemd[1]: ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@rgw.quay.reesi002.anxbdb.service: Failed with result 'protocol'.` `Aug 23 08:06:23 reesi002 systemd[1]: Failed to start Ceph rgw.quay.reesi002.anxbdb for 28f7427e-5558-4ffd-ae1a-51ec3042759a.` `Aug 23 14:52:52 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.` `Aug 23 14:52:54 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.` `Aug 23 14:52:55 reesi002 systemd[1]: /etc/systemd/system/ceph-28f7427e-5558-4ffd-ae1a-51ec3042759a@.service:22: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.`
2024-08-23T17:04:50.044Z	<Dan Mick> journalctl -u is slightly more verbose:
2024-08-23T17:05:07.830Z	<Dan Mick> deferred set uid:gid to 167:167 (ceph:ceph) ]: ceph version 19.1.1 (1d9f35852eef16b81614e38a05cf88b505cc142b) squid (rc), p> ]: framework: beast ]: framework conf key: port, val: 80 ]: rgw main: failed to load zone: (2) No such file or directory
2024-08-23T17:05:34.444Z	<Casey Bodley> hm mok
2024-08-23T17:05:34.767Z	<Dan Mick> the failed to load zone seems useful. I don't see anything like a zone name in the error message, and I don't know where to go from here.
2024-08-23T17:06:16.936Z	<Dan Mick> AFAIK zones weren't ever configured
2024-08-23T17:06:33.781Z	<Dan Mick> so it's probably whatever defaults to a non-multi config, but...
2024-08-23T17:07:19.547Z	<Casey Bodley> my understanding is that cephadm creates a realm/zonegroup/zone for each service. can you confirm that @Adam King?
2024-08-23T17:09:42.726Z	<Dan Mick> is that configuration something that can be examined somehow? radosgw-admin seemed to fail with a similar "can't help you" error message for the few dumb things I tried
2024-08-23T17:10:27.985Z	<Casey Bodley> i ssh'ed in but radosgw-admin isn't installed there
2024-08-23T17:10:33.183Z	<Casey Bodley> i tried 'radosgw-admin zone list'
2024-08-23T17:10:40.168Z	<Dan Mick> well it's in the container I assume
2024-08-23T17:11:02.276Z	<Adam King> You can have a zone created for you if you run the rgw spec through the `ceph rgw realm bootstrap -i <spec>` command, but specs just passed to the orchestrator don't create realms/zones/zonegroups.
2024-08-23T17:11:49.342Z	<Casey Bodley> ok, thanks. so it's probably just expecting a zone/zonegroup named 'default'
2024-08-23T17:12:57.021Z	<Dan Mick> Note that with cephadm, radosgw daemons are configured via the monitor configuration database instead of via a ceph.conf or the command line. If that configuration isn’t already in place (usually in the `client.rgw.<something>` section), then the radosgw daemons will start up with default settings (e.g., binding to port 80).
2024-08-23T17:13:18.447Z	<Dan Mick> says the doc. so there ought to be remnants in one of the mon config mechanismss?
2024-08-23T17:13:59.244Z	<Dan Mick> ceph config dump shows some, but nothing about zones/realms
2024-08-23T17:14:04.929Z	<Casey Bodley> `rgw_zone` is the relevant config option
2024-08-23T17:14:39.697Z	<Dan Mick> is there a way to examine the pools to discover the zone name
2024-08-23T17:15:14.323Z	<Casey Bodley> the rgw pools should have a prefix like {zone-name}.rgw.*
2024-08-23T17:15:38.756Z	<Dan Mick> 94 default.rgw.control 95 default.rgw.meta 96 default.rgw.log 97 default.rgw.buckets.index 98 default.rgw.buckets.data 99 default.rgw.buckets.non-ec
2024-08-23T17:16:28.302Z	<Dan Mick> so is is possible that radosgw changed to not assume 'default', and now requires explicitly naming the zone in config?
2024-08-23T17:17:26.115Z	<Casey Bodley> if rgw_zone is empty and there is no rgw_realm configured, it should read the zone named "default" and create it if it doesn't exist
2024-08-23T17:19:35.919Z	<Casey Bodley> so "failed to load zone: (2) No such file or directory" is confusing
2024-08-23T17:19:39.027Z	<Dan Mick> the end of ceph config-key ls \| grep rgw
2024-08-23T17:19:49.945Z	<Dan Mick> "config/client.rgw.cephadmin3.rgw0/debug_rgw", "config/client.rgw.quay.reesi002.anxbdb/container_image", "config/client.rgw.quay.reesi002.anxbdb/debug_ms", "config/client.rgw.quay.reesi002.anxbdb/debug_rgw", "config/client.rgw.quay.reesi002.anxbdb/rgw_frontends", "config/client.rgw.quay.reesi006.zmjsox/container_image", "config/client.rgw.quay.reesi006.zmjsox/rgw_frontends", "config/client.rgw.rgw.quay.reesi002.anxbdb/debug_ms", "config/client.rgw.rgw.quay.reesi002.anxbdb/debug_rgw", "config/client.rgw.rgw.quay.reesi002.anxbdb/log_to_file", "config/client.rgw.rgw.quay.reesi006.zmjsox/debug_rgw", "config/client.rgw.rgw.quay.reesi006.zmjsox/log_to_file", "config/client.rgw/debug_rgw", "config/client.rgw/log_to_file", "mgr/cephadm/rgw_migration_queue", "mgr/cephadm/spec.rgw.quay", "rgw/cert/rgw.quay"
2024-08-23T17:21:29.929Z	<Dan Mick> I tried an strace, but it seems like everything it's doing is rados ops so that wasn't much help. I tried to add debug_rgw=30 to the cmdline but that didn't seem to help either
2024-08-23T17:22:25.541Z	<Casey Bodley> debug-ms=1 would log the individual rados ops
2024-08-23T17:24:48.587Z	<Dan Mick> if you think that's worth a try I can try it
2024-08-23T17:26:08.661Z	<Casey Bodley> yes please
2024-08-23T17:27:15.448Z	<Casey Bodley> or can you capture the output of `radosgw-admin realm default`?
2024-08-23T17:27:34.933Z	<Casey Bodley> wait no
2024-08-23T17:28:01.504Z	<Casey Bodley> `radosgw-admin realm list` would show both
2024-08-23T17:28:34.080Z	<Dan Mick> where would I expect log messages from debug_ms to show up?
2024-08-23T17:28:56.352Z	<Casey Bodley> journalctl i think?
2024-08-23T17:29:51.420Z	<Dan Mick> ok yeah
2024-08-23T17:30:51.200Z	<Dan Mick> Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 <== osd.109 v2:172.21.2.221:6832/4215531678 1 ==== osd_op_reply(3 default.zone.87194b64-d43d-470f-a491-a67115d255a7 [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) ==== 193+0+0 (crc 0 0 0) 0x555da573c780 con 0x555da5774400
2024-08-23T17:30:59.703Z	<Dan Mick> is right before the "no such file" generic message
2024-08-23T17:31:24.583Z	<Casey Bodley> do you see anything about default.realm above that?
2024-08-23T17:32:00.889Z	<Dan Mick> Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 <== osd.78 v2:172.21.2.224:6850/2039386015 1 ==== osd_op_reply(1 default.realm [read 0~46 out=46b] v0'0 uv2 ondisk = 0) ==== 157+0+46 (crc 0 0 0) 0x555da573c 280 con 0x555da49aac00
2024-08-23T17:32:09.038Z	<Casey Bodley> oh, that default.zone.{realm-id} implies that it did find a default realm
2024-08-23T17:32:09.413Z	<Dan Mick> ISTR there's some way to match replies to requests
2024-08-23T17:32:37.944Z	<Dan Mick> is it 172.21.2.202:0/2063761546 maybe
2024-08-23T17:32:48.697Z	<Dan Mick> nah, that's probably just the client addr/nonce
2024-08-23T17:35:39.739Z	<Casey Bodley> ok, so something at some point created a realm and set is as the default, which is 'sticky'
2024-08-23T17:35:49.847Z	<Dan Mick> this might be the whole trace
2024-08-23T17:36:01.943Z	<Dan Mick> `Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 --> [v2:172.21.2.224:6850/2039386015,v1:172.21.2.224:6851/2039386015] -- osd_op(unknown.0.0:1 93.12 93:49953fa1:::default.realm:head [read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e10788506) -- 0x555da5aa9800 con 0x555da49aac00` `Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.224:6850/2039386015,v1:172.21.2.224:6851/2039386015] conn(0x555da49aac00 0x555da4a1cb00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0` `Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.224:6850/2039386015,v1:172.21.2.224:6851/2039386015] conn(0x555da49aac00 0x555da4a1cb00 crc :-1 s=READY pgs=1175 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=osd.78 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0` `Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 <== osd.78 v2:172.21.2.224:6850/2039386015 1 ==== osd_op_reply(1 default.realm [read 0~46 out=46b] v0'0 uv2 ondisk = 0) ==== 157+0+46 (crc 0 0 0) 0x555da573c280 con 0x555da49aac00` `Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.224:6874/3034296905,v1:172.21.2.224:6875/3034296905] conn(0x555da5aa9c00 0x555da4a1c580 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).connect` `Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 --> [v2:172.21.2.224:6874/3034296905,v1:172.21.2.224:6875/3034296905] -- osd_op(unknown.0.0:2 93.6 93:610446fc:::realms.87194b64-d43d-470f-a491-a67115d255a7:head [call version.read in=11b,read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e10788506) -- 0x555da5774000 con 0x555da5aa9c00` `Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.224:6874/3034296905,v1:172.21.2.224:6875/3034296905] conn(0x555da5aa9c00 0x555da4a1c580 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0` `Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.224:6874/3034296905,v1:172.21.2.224:6875/3034296905] conn(0x555da5aa9c00 0x555da4a1c580 crc :-1 s=READY pgs=1044 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=osd.79 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0` `Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 <== osd.79 v2:172.21.2.224:6874/3034296905 1 ==== osd_op_reply(2 realms.87194b64-d43d-470f-a491-a67115d255a7 [call out=48b,read 0~107 out=107b] v0'0 uv3 ondisk = 0) ==== 229+0+155 (crc 0 0 0) 0x555da573c500 con 0x555da5aa9c00` `Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.221:6832/4215531678,v1:172.21.2.221:6833/4215531678] conn(0x555da5774400 0x555da4a1e680 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).connect` `Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 --> [v2:172.21.2.221:6832/4215531678,v1:172.21.2.221:6833/4215531678] -- osd_op(unknown.0.0:3 93.14 93:2d6507ee:::default.zone.87194b64-d43d-470f-a491-a67115d255a7:head [read 0~0] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e10788506) -- 0x555da5774800 con 0x555da5774400` `Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.221:6832/4215531678,v1:172.21.2.221:6833/4215531678] conn(0x555da5774400 0x555da4a1e680 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0` `Aug 23 17:26:58 reesi002 radosgw[1833516]: --2- 172.21.2.202:0/2063761546 >> [v2:172.21.2.221:6832/4215531678,v1:172.21.2.221:6833/4215531678] conn(0x555da5774400 0x555da4a1e680 crc :-1 s=READY pgs=1213 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=osd.109 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0` `Aug 23 17:26:58 reesi002 radosgw[1833516]: -- 172.21.2.202:0/2063761546 <== osd.109 v2:172.21.2.221:6832/4215531678 1 ==== osd_op_reply(3 default.zone.87194b64-d43d-470f-a491-a67115d255a7 [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) ==== 193+0+0 (crc 0 0 0) 0x555da573c780 con 0x555da5774400` `Aug 23 17:26:58 reesi002 radosgw[1833516]: rgw main: failed to load zone: (2) No such file or directory` `Aug 23 17:26:58 reesi002 radosgw[1833516]: Couldn't init storage provider (RADOS)`
2024-08-23T17:36:22.204Z	<Dan Mick> I'll try that radosgw-admin command
2024-08-23T17:37:23.383Z	<Dan Mick> root@reesi001:/usr/share# radosgw-admin realm list { "default_info": "87194b64-d43d-470f-a491-a67115d255a7", "realms": [ "default" ] }
2024-08-23T17:39:22.101Z	<Casey Bodley> i think the simplest workaround is to set `rgw_zone = default` in the mon config. i'm not positive about the exact syntax, but it would look something like `ceph config set client.rgw.rgw.quay rgw_zone default`
2024-08-23T17:39:38.745Z	<Dan Mick> root@reesi001:/usr/share# radosgw-admin zone list { "default_info": "", "zones": [ "default" ] }
2024-08-23T17:40:02.440Z	<Casey Bodley> `client.rgw.quay` also shows up in that config-key output
2024-08-23T17:40:36.658Z	<Dan Mick> are there default zone/realm names, and are they "default"? You're implying 'no'
2024-08-23T17:41:12.545Z	<Dan Mick> and, if we change it, maybe we should change it with a cephadm service ... manifest
2024-08-23T17:41:16.467Z	<Dan Mick> or whatever it's called
2024-08-23T17:42:54.077Z	<Casey Bodley> "default" is overloaded here, but something created a realm with the `--default` option which means that everything now tries to run inside that realm
2024-08-23T17:43:22.839Z	<Casey Bodley> for zones and zonegroups, there's fallback behavior to use the name "default" if nothing is specifically configured
2024-08-23T17:43:36.475Z	<Dan Mick> ....but the normal state of realm is "not specified"?
2024-08-23T17:43:48.409Z	<Dan Mick> (for a non-realm config)
2024-08-23T17:44:51.357Z	<Casey Bodley> the normal state would be no realms exist
2024-08-23T17:45:03.594Z	<Dan Mick> (I'm also trying to figure out if this is a behavior change that will catch someone else)
2024-08-23T17:45:58.376Z	<Dan Mick> I'd be very surprised if any human specifically specified a domain for this. I can't be sure (multiple admins) but I'm 99.5% sure the service was configured only with the cephadm service description shown above. So I'm concerned that whatever cephadm did left a state that radosgw is now puking on in the transition from v18-v19
2024-08-23T17:46:13.051Z	<Dan Mick> I'd be very surprised if any human specifically specified a realm for this. I can't be sure (multiple admins) but I'm 99.5% sure the service was configured only with the cephadm service description shown above. So I'm concerned that whatever cephadm did left a state that radosgw is now puking on in the transition from v18-v19
2024-08-23T17:46:17.888Z	<Casey Bodley> earlier in the release process we documented this in the release notes, but it's not the same issue: > * rgw: On startup, radosgw and radosgw-admin now validate the ``rgw_realm`` > config option. Previously, they would ignore invalid or missing realms and > go on to load a zone/zonegroup in a different realm. If startup fails with > a "failed to load realm" error, fix or remove the ``rgw_realm`` option.
2024-08-23T17:47:17.855Z	<Dan Mick> but, where would something have "created a realm with the --default option"?
2024-08-23T17:47:32.364Z	<Dan Mick> would that not be in the mon config?
2024-08-23T17:48:00.353Z	<Dan Mick> i.e. what is rgw trying to validate? I don't see a specification
2024-08-23T17:48:45.432Z	<Casey Bodley> rgw wouldn't have done that itself. possible that someone tried `ceph rgw realm bootstrap -i <spec>`? i don't know if that adds `--default`, but it shouldn't
2024-08-23T17:49:16.429Z	<Dan Mick> I really doubt it. All the users of that rgw instance are likely unaware that realms even exist
2024-08-23T17:49:27.028Z	<Casey Bodley> my recollection is that people hit that "failed to load realm" error when they accidentally included an `rgw_realm` field in their rgw service spec
2024-08-23T17:49:44.202Z	<Casey Bodley> before squid that didn't cause an error
2024-08-23T17:49:57.279Z	<Dan Mick> but even if they did something like ceph rgw realm, where would that leave info if not the mon config?
2024-08-23T17:50:32.126Z	<Casey Bodley> it would create rados objects that would be visible to the `radosgw-admin realm list` command
2024-08-23T17:52:03.310Z	<Dan Mick> so we have a situation where radosgw now does something with rgw_realm in the config; since that doesn't exist, what does it do?
2024-08-23T17:52:19.559Z	<Dan Mick> sorry to belabor this, I just want to understand the theory to understand the wider impact
2024-08-23T17:55:12.302Z	<Casey Bodley> if the `rgw_realm` option is not configured, it looks for the rados object `default.realm` and, if found, loads that realm
2024-08-23T17:55:43.061Z	<Dan Mick> can I look for that object? its ctime/mtime might be useful. what pool?
2024-08-23T17:56:11.885Z	<Casey Bodley> should be in `rgw.root`
2024-08-23T17:56:20.187Z	<Dan Mick> .rgw.root?
2024-08-23T17:56:41.873Z	<Dan Mick> (leading .)
2024-08-23T17:57:14.545Z	<Casey Bodley> right
2024-08-23T17:57:54.530Z	<Dan Mick> ($87194b64-d43d-470f-a491-a67115d255a7
2024-08-23T17:58:26.748Z	<Dan Mick> (which, save for the leading chars, is what realm list shows)
2024-08-23T17:59:06.126Z	<Casey Bodley> that's the realm id, yeah
2024-08-23T17:59:19.462Z	<Dan Mick> modified 2/22/2024
2024-08-23T18:00:20.334Z	<Dan Mick> so...what's rgw failing to find?
2024-08-23T18:01:21.620Z	<Casey Bodley> because it finds a realm, it tries to load the realm's default zone from the rados object `default.zone.87194b64-d43d-470f-a491-a67115d255a7` but finds none
2024-08-23T18:01:37.918Z	<Casey Bodley> if it hadn't found the realm, it would fall back to zone/zonegroups named "default"
2024-08-23T18:02:04.852Z	<Casey Bodley> so something created a realm, but didn't put anything in it
2024-08-23T18:02:15.583Z	<Dan Mick> fwiw:
2024-08-23T18:02:16.500Z	<Dan Mick> root@reesi001:/usr/share# radosgw-admin realm get-default default realm: 87194b64-d43d-470f-a491-a67115d255a7
2024-08-23T18:02:33.343Z	<Dan Mick> and, ok
2024-08-23T18:04:16.303Z	<Dan Mick> indeed, there is no default.zone.8...
2024-08-23T18:04:23.832Z	<Dan Mick> there is a default.zonegroup.8....
2024-08-23T18:04:39.193Z	<Casey Bodley> yeah that was in the debug-ms=1 output
2024-08-23T18:04:58.329Z	<Dan Mick> ugh, can I really not rados get to stdout
2024-08-23T18:05:40.274Z	<Dan Mick> (well I can specify /dev/stdout as the file)
2024-08-23T18:06:23.900Z	<Dan Mick> # rados -p .rgw.root ls \| sort default.realm default.zonegroup.87194b64-d43d-470f-a491-a67115d255a7 period_config.87194b64-d43d-470f-a491-a67115d255a7 periods.87194b64-d43d-470f-a491-a67115d255a7:staging periods.87194b64-d43d-470f-a491-a67115d255a7:staging.latest_epoch periods.87254aaf-7abe-4c13-bc70-27b1976ac684.1 periods.87254aaf-7abe-4c13-bc70-27b1976ac684.2 periods.87254aaf-7abe-4c13-bc70-27b1976ac684.3 periods.87254aaf-7abe-4c13-bc70-27b1976ac684.4 periods.87254aaf-7abe-4c13-bc70-27b1976ac684.5 periods.87254aaf-7abe-4c13-bc70-27b1976ac684.6 periods.87254aaf-7abe-4c13-bc70-27b1976ac684.latest_epoch periods.c752bb6f-a9bf-4160-af9f-da1646fa42de.1 periods.c752bb6f-a9bf-4160-af9f-da1646fa42de.latest_epoch realms.87194b64-d43d-470f-a491-a67115d255a7 realms.87194b64-d43d-470f-a491-a67115d255a7.control realms_names.default zonegroup_info.d3419277-bb38-437f-a610-1aa77b0989df zonegroups_names.default zone_info.0332d5b4-b7b6-4a18-ad00-aa1a16c66256 zone_names.default
2024-08-23T18:07:26.041Z	<Dan Mick> zone_names.default contains 033..
2024-08-23T18:08:34.055Z	<Casey Bodley> wow. from the periods.* objects i take that to mean that someone created two other realms with ids `c752bb6f-a9bf-4160-af9f-da1646fa42de` and `87254aaf-7abe-4c13-bc70-27b1976ac684` and deleted them
2024-08-23T18:08:34.455Z	<Dan Mick> all of those objects are from 02/22
2024-08-23T18:09:22.821Z	<Dan Mick> except for realms.87194b64-d43d-470f-a491-a67115d255a7.control, from 8/21
2024-08-23T18:09:34.191Z	<Dan Mick> and zone_names.default, from 2018 (!)
2024-08-23T18:10:50.787Z	<Dan Mick> (you have sudo on reesi001 which is where I'm doing this, btw)
2024-08-23T18:15:05.649Z	<Dan Mick> so I'm coming back to "this was a working config before the upgrade, even if it had hinky stuff in the rados objects/mon config". Is there a transition that others may need to adjust to? (and I'll point out that I didn't even glance at release notes, which it now occurs to me is probably a bad idea)
2024-08-23T18:18:26.076Z	<Casey Bodley> yeah i think we'll need to come up with another release note for this
2024-08-23T18:18:27.509Z	<Dan Mick> (and when I say "from" I'm talking about mtime from rados stat2)
2024-08-23T18:18:37.993Z	<Casey Bodley> i'll discuss with the team
2024-08-23T18:19:03.048Z	<Casey Bodley> for now i'm considering a `sudo rados -p .rgw.root rm default.realm` to get lrc going again
2024-08-23T18:19:28.536Z	<Casey Bodley> any objection? i'm not sure the `ceph config set` for rgw_zone i suggested before will work
2024-08-23T18:20:43.629Z	<Dan Mick> so if default.realm doesn't exist, it'll use realm name 'default' and open zone_names.default?..
2024-08-23T18:21:54.292Z	<Casey Bodley> if rgw_realm is empty and `default.realm` doesn't exist and rgw_zone is empty, it will load `zone_names.default`
2024-08-23T18:23:14.308Z	<Casey Bodley> <https://github.com/ceph/ceph/blob/squid/src/rgw/driver/rados/rgw_zone.cc#L1192-L1223> is the algorithm i'm looking at
2024-08-23T18:23:18.607Z	<Dan Mick> so is this a theory: config probably got a specific realm named default created at some point, id 8719...
2024-08-23T18:23:39.760Z	<Dan Mick> evidenced by the 2/22 default.realm and the periods.8719.. files
2024-08-23T18:24:21.215Z	<Dan Mick> that was innocuous until 19.1.1 when rgw started caring about default.realm and trying to use it as the base for finding the zone, and it can't
2024-08-23T18:24:47.913Z	<Dan Mick> well. started caring that the contents of default.realm were valid
2024-08-23T18:25:24.801Z	<Dan Mick> I'll collect copies of all the objects in .rgw.root before changing anything
2024-08-23T18:26:15.404Z	<Dan Mick> in /root/rgw
2024-08-23T18:27:33.570Z	<Dan Mick> does that theory of the failure sound right to you?
2024-08-23T18:30:54.379Z	<Dan Mick> (while you ponder that I'll remove default.realm, since I have a copy, and try restarting 002
2024-08-23T18:31:58.915Z	<Dan Mick> well it came up
2024-08-23T18:32:07.250Z	<Dan Mick> trying 006
2024-08-23T18:33:12.082Z	<Dan Mick> it came up too
2024-08-23T18:33:19.123Z	<Casey Bodley> wonderful
2024-08-23T18:33:43.343Z	<Dan Mick> cephadm hasn't noticed yet but yeah, good calls
2024-08-23T18:35:13.015Z	<Dan Mick> trying ceph orch ps --refresh to get it to notice
2024-08-23T18:36:08.334Z	<Dan Mick> orch ls --refresh, or time, shows it 2/2 now
2024-08-23T18:36:11.018Z	<Dan Mick> yay
2024-08-23T18:36:17.989Z	<Casey Bodley> tyvm Dan
2024-08-23T18:36:35.372Z	<Casey Bodley> what a mess :(
2024-08-23T18:36:39.625Z	<Dan Mick> I mean ty, couldn't have done anything without you
2024-08-23T18:37:07.679Z	<Dan Mick> but yeah do let's drive this to a potential customer impact eval
2024-08-23T18:37:20.256Z	<Dan Mick> but yeah, do let's drive this to a potential customer impact eval
2024-08-23T18:37:30.644Z	<Casey Bodley> this multisite config stuff is and always has been full of land mines. this is what we get for trying to clean up/simplify
2024-08-23T18:37:40.708Z	<Dan Mick> it's complex
2024-08-23T18:39:52.879Z	<Casey Bodley> it's complicated, but also tries to guess what the admin meant if it doesn't find a valid configuration. makes it really hard to reason about
2024-08-23T18:43:42.967Z	<Dan Mick> I'm pretty sure Zack set that rgw instance up. I'll ask him if he remembers any confusion about realm/zone/zonegroup
2024-08-23T18:44:03.892Z	<Dan Mick> but I do remember him telling me "I just followed the instructions for cephadm and it just worked" (in essence)
2024-08-23T18:44:21.166Z	<Dan Mick> or, you know, cephadm-behind-ceph-orch
2024-08-23T18:46:18.409Z	<Dan Mick> So the rgw instances behind [quay.ceph.io](http://quay.ceph.io) are back up, and I believe quay is probably working again. Partial explanation: some preexisting configuration (perhaps questionable) was confusing a new startup behavior in radosgw, and radosgw believed that it could not start. The config was stored in a rados object; @Casey Bodley walked me through where, why, and I backed up all the objects in the .rgw.root pool for postmortem, removed default.realm object, and the two gateways are now running without apparent problem.
2024-08-23T19:20:50.276Z	<Casey Bodley> opened <https://tracker.ceph.com/issues/67697> and <https://github.com/ceph/ceph/pull/59422>
2024-08-23T19:43:09.374Z	<Æmerson> Thank you for all the work from both of you.
2024-08-23T21:08:42.118Z	<Dan Mick> fwiw, I reviewed the 19.1.0 and 19.1.1 release emails and didn't see anything that would have made me be cautious
2024-08-23T21:11:32.892Z	<Dan Mick> also checked with zack; he has no memory of specifying anything about realm or zone
2024-08-23T21:11:57.338Z	<Dan Mick> so I suspect we got at least some of this courtesy of cephadm. I don't know if there's a reasonable way to be sure

ceph - sepia - 2024-08-23

Any issue? please create an issue here and use the infra label.